温馨提示:本文翻译自stackoverflow.com,查看原文请点击:r - tidytext: Issue with unnest_tokens and token = 'ngrams'
r token whatsapp tidytext unnest

r - tidytext:unnest_tokens和token ='ngrams'的问题

发布于 2020-04-07 11:39:08

我正在运行以下代码

library(rwhatsapp)
library(tidytext)

chat <- rwa_read(x = c(
  "31/1/15 04:10:59 - Menganito: Was it good?",
  "31/1/15 14:10:59 - Fulanito: Yes, it was"
))

chat %>% as_tibble() %>% 
  unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)

但我收到以下错误:

Error in unnest_tokens.data.frame(., output = bigram, input = text, token = "ngrams",  : 
  If collapse = TRUE (such as for unnesting by sentence or paragraph), unnest_tokens needs all input columns to be atomic vectors (not lists)

我尝试在Google上进行一些研究,但找不到答案。text是一个字符向量,所以我不明白为什么我会得到一个错误提示,说不是。

查看更多

提问者
piblo95
被浏览
115
akrun 2020-02-01 02:18

问题是因为有一些listNULL

str(chat)
#tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
# $ time      : POSIXct[1:2], format: "2015-01-31 04:10:59" "2015-01-31 14:10:59"
# $ author    : Factor w/ 2 levels "Fulanito","Menganito": 2 1
# $ text      : chr [1:2] "Was it good?" "Yes, it was"
# $ source    : chr [1:2] "text input" "text input"
# $ emoji     :List of 2   ###
#  ..$ : NULL
#  ..$ : NULL
# $ emoji_name:List of 2    ###
#  ..$ : NULL
#  ..$ : NULL

我们可以将其删除,并且现在可以使用

library(rwhatsapp)
library(tidytext)
chat %>% 
   select_if(~ !is.list(.)) %>%
   unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
# A tibble: 4 x 4
#  time                author    source     bigram 
#  <dttm>              <fct>     <chr>      <chr>  
#1 2015-01-31 04:10:59 Menganito text input was it 
#2 2015-01-31 04:10:59 Menganito text input it good
#3 2015-01-31 14:10:59 Fulanito  text input yes it 
#4 2015-01-31 14:10:59 Fulanito  text input it was 

另外,默认情况下collapse=TRUE,这会在存在NULL元素时产生问题,因为在collapsed 时长度会有所不同一种选择是指定collapse = FALSE

chat %>% 
   unnest_tokens(output = bigram, input = text, token = "ngrams",
        n = 2, collapse= FALSE)
# A tibble: 4 x 6
#  time                author    source     emoji  emoji_name bigram 
#  <dttm>              <fct>     <chr>      <list> <list>     <chr>  
#1 2015-01-31 04:10:59 Menganito text input <NULL> <NULL>     was it 
#2 2015-01-31 04:10:59 Menganito text input <NULL> <NULL>     it good
#3 2015-01-31 14:10:59 Fulanito  text input <NULL> <NULL>     yes it 
#4 2015-01-31 14:10:59 Fulanito  text input <NULL> <NULL>     it was