Warm tip: This article is reproduced from serverfault.com, please click

python-如何使用Glue读取多个S3存储桶?

(python - How can I read multiple S3 buckets using Glue?)

发布于 2020-11-30 19:00:30

使用Spark时,可以使用前缀*来从多个存储桶中读取数据。例如,我的文件夹结构如下:

s3://bucket/folder/computation_date=2020-11-01/
s3://bucket/folder/computation_date=2020-11-02/
s3://bucket/folder/computation_date=2020-11-03/
etc.

使用PySpark,如果我想读取第11个月的所有数据,则可以执行以下操作:

input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))

如何使用Glue实现相同的功能?以下似乎无效:

input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_glue = glueContext.create_dynamic_frame_from_options(
            connection_type="s3",
            connection_options = {
                "paths": ["s3://{}/{}/".format(input_bucket, input_prefix)]
            },
            format="parquet",
            transformation_ctx="df_spark")
Questioner
Melissa Guo
Viewed
0
Melissa Guo 2020-12-01 03:18:55

我使用spark而不是Glue读取文件

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
input_bucket = [MY-BUCKET]
input_prefix = [MY-FOLDER/computation_date=2020-11-*]
df_spark = spark.read.parquet("s3://{}/{}/".format(input_bucket, input_prefix))