Warm tip: This article is reproduced from serverfault.com, please click

sql-Pyspark:提取数据框的行,其中值包含字符串

(sql - Pyspark: Extracting rows of a dataframe where value contains a string of characters)

发布于 2020-11-28 07:46:44

我正在使用pyspark,并且有一个大型数据框,仅包含一列值,其中每一行都是一长串字符:

col1
-------
'2020-11-20;id09;150.09,-20.02'
'2020-11-20;id44;151.78,-25.14'
'2020-11-20;id78;148.24,-22.67'
'2020-11-20;id55;149.77,-27.89'
...
...
...

我正在尝试提取数据框的行,其中“ idxx”与诸如[“ id01”,“ id02”,“ id22”,“ id77”,...]之类的字符串列表匹配。目前,我从数据框中提取行的方式是:

df.filter(df.col1.contains("id01") | df.col1.contains("id02") | df.col1.contains("id22") | ... )

有没有一种方法可以使此方法更有效,而不必将每个字符串项都硬编码到过滤器函数中?

Questioner
code_learner93
Viewed
0
mck 2020-11-28 16:00:58
from functools import reduce
from operator import or_

str_list = ["id01", "id02", "id22", "id77"]
df.filter(reduce(or_, [df.col1.contains(s) for s in str_list]))