Warm tip: This article is reproduced from serverfault.com, please click

n-gram quanteda r token

R: tokenize n-grams but not strip punctuations

发布于 2020-12-06 21:21:50

I am trying to conduct tokenization of n-grams (between 1 (minimum) and 3(maximum)) on my data. After applying this function , I can see that it strips some relevant words such as [sad](words that I have converted from emojis).

For example the input is:

I dislike lemons [sad]

When I apply the n-gram tokenizer and assess their frequency (which are separated by "_") the output for sad appears like this (bare in mind that I am only printing the top 100 n-grams and other words are included but I want to assess this one specifically):

[_sad]
[_sad _]

How do I make sure that "[" its not stripped during tokenization of n-grams? (i.e. In order to become [sad])

This is my code and I am using quanteda package:

tokens= tokens_ngrams(tokens(textcleaning), n=1:3)

Then I create a corpus object and built top 100 n-grams through term document matrix.

Questioner

Louise

Viewed

0

Andrew Brown 2020-12-07 06:26:39

I played around with this a little bit -- and I am thinking you should convert your [ and ] characters to something unique but alphanumeric. It seems like {quanteda} wants to parse tokens that contain or are adjacent to special characters in this way -- and not consider them part of the "word" per se. Since your concept of "[sad]" is a single word, then to tokenize it, just do something that distinguishes it from regular "sad".

I use gsub and search for patterns "\\[" and "\\]" respectively. [ is a regular expression special character so you need to escape it with two backslashes. I replace the first with the word "emoji" and the second with "" to form "emojisad"

Note how the period at the end of the sentence is handled. You said you stripped out punctuation, but this behavior seems like a "feature" not a bug.

library(quanteda, warn.conflicts = FALSE)
#> Package version: 2.1.2
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 

test_string <- "I dislike lemons [sad]."

tokens_ngrams(tokens(gsub("\\[", "emoji", gsub("\\]", "", test_string))))
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "I_dislike"       "dislike_lemons"  "lemons_emojisad" "emojisad_."

^{Created on 2020-12-06 by the reprex package (v0.3.0)}

Louise 2020-12-06 22:12:56

you are absolutely right- fabulous idea! thank you!!!!

热门帖子

1

IPV6 太香了

2

最近同事的一句话让我一直耿耿于怀，大伙帮忙分析下

3

阿里 ECS 突然 CPU 和磁盘 IO 跑满，如何去查，给阿里提工单有用吗

4

公司下个月要裁员，现在要提桶跑路吗？

5

我这个到底是什么病，还有救吗

6

有出服务器类发票的吗？

7

打算和媳妇出国旅游，有没有推荐？

8

把网址全变成长得一样的棒棒： llIlI.lI

9

即将三十岁，辞掉 6 年电子厂的企划的工作，转行成人用品行业的摄影师

10

川西之旅，照片分享

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books