Warm tip: This article is reproduced from serverfault.com, please click

R: tokenize n-grams but not strip punctuations

发布于 2020-12-06 21:21:50

I am trying to conduct tokenization of n-grams (between 1 (minimum) and 3(maximum)) on my data. After applying this function , I can see that it strips some relevant words such as [sad](words that I have converted from emojis).

For example the input is:

  • I dislike lemons [sad]

When I apply the n-gram tokenizer and assess their frequency (which are separated by "_") the output for sad appears like this (bare in mind that I am only printing the top 100 n-grams and other words are included but I want to assess this one specifically):

  • [_sad]
  • [_sad _]

How do I make sure that "[" its not stripped during tokenization of n-grams? (i.e. In order to become [sad])

This is my code and I am using quanteda package:

tokens= tokens_ngrams(tokens(textcleaning), n=1:3)

Then I create a corpus object and built top 100 n-grams through term document matrix.

Questioner
Louise
Viewed
0
Andrew Brown 2020-12-07 06:26:39

I played around with this a little bit -- and I am thinking you should convert your [ and ] characters to something unique but alphanumeric. It seems like {quanteda} wants to parse tokens that contain or are adjacent to special characters in this way -- and not consider them part of the "word" per se. Since your concept of "[sad]" is a single word, then to tokenize it, just do something that distinguishes it from regular "sad".

I use gsub and search for patterns "\\[" and "\\]" respectively. [ is a regular expression special character so you need to escape it with two backslashes. I replace the first with the word "emoji" and the second with "" to form "emojisad"

Note how the period at the end of the sentence is handled. You said you stripped out punctuation, but this behavior seems like a "feature" not a bug.

library(quanteda, warn.conflicts = FALSE)
#> Package version: 2.1.2
#> Parallel computing: 2 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 

test_string <- "I dislike lemons [sad]."

tokens_ngrams(tokens(gsub("\\[", "emoji", gsub("\\]", "", test_string))))
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "I_dislike"       "dislike_lemons"  "lemons_emojisad" "emojisad_."

Created on 2020-12-06 by the reprex package (v0.3.0)