I'm using pyspark and I have a large dataframe with only a single column of values, of which each row is a long string of characters:
col1
-------
'2020-11-20;id09;150.09,-20.02'
'2020-11-20;id44;151.78,-25.14'
'2020-11-20;id78;148.24,-22.67'
'2020-11-20;id55;149.77,-27.89'
...
...
...
I'm trying to extract rows of the dataframe where 'idxx' matches a list of strings such as ["id01", "id02", "id22", "id77", ...]. Currently, the way I extract rows from the dataframe is:
df.filter(df.col1.contains("id01") | df.col1.contains("id02") | df.col1.contains("id22") | ... )
Is there a way to make this more efficient instead of having to hard code every string item into the filter function?
from functools import reduce
from operator import or_
str_list = ["id01", "id02", "id22", "id77"]
df.filter(reduce(or_, [df.col1.contains(s) for s in str_list]))