Warm tip: This article is reproduced from serverfault.com, please click

apache-spark apache-spark-sql pyspark sql

Pyspark: Extracting rows of a dataframe where value contains a string of characters

发布于 2020-11-28 07:46:44

I'm using pyspark and I have a large dataframe with only a single column of values, of which each row is a long string of characters:

col1
-------
'2020-11-20;id09;150.09,-20.02'
'2020-11-20;id44;151.78,-25.14'
'2020-11-20;id78;148.24,-22.67'
'2020-11-20;id55;149.77,-27.89'
...
...
...

I'm trying to extract rows of the dataframe where 'idxx' matches a list of strings such as ["id01", "id02", "id22", "id77", ...]. Currently, the way I extract rows from the dataframe is:

df.filter(df.col1.contains("id01") | df.col1.contains("id02") | df.col1.contains("id22") | ... )

Is there a way to make this more efficient instead of having to hard code every string item into the filter function?

Questioner

code_learner93

Viewed

0

mck 2020-11-28 16:00:58

from functools import reduce
from operator import or_

str_list = ["id01", "id02", "id22", "id77"]
df.filter(reduce(or_, [df.col1.contains(s) for s in str_list]))

热门帖子

1

十兆宽带 100 块钱一个月我是活在 80 年代吗？

2

关于高质量屏幕共享的问题

3

我是真的受不了打小报告的三八了！

4

[开发者自荐] AirBattery: 在 Mac 端获取所有设备的电量并显示在 Dock 或状态栏上

5

南京移动 IPV6 被停

6

WireGuard 跨国组网失败后,一个新工具的诞生

7

你的 IDEA(2024.1) 在 macOS 上崩吗？

8

居家办公兼谈普通人的无聊和解救方式

9

拼多多开始用灵动岛放广告了？

10

准备买个二手 Apple Watch， Ultra1 还是 Ultra2

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books