Warm tip: This article is reproduced from serverfault.com, please click

amazon-s3 csv python s3fs python-s3fs

How to stream a large gzipped .tsv file from s3, process it, and write back to a new file on s3?

发布于 2020-11-30 03:54:22

I have a large file s3://my-bucket/in.tsv.gz that I would like to load and process, write back its processed version to an s3 output file s3://my-bucket/out.tsv.gz.

How do I streamline the in.tsv.gz directly from s3 without loading all the file to memory (it cannot fit the memory)
How do I write the processed gzipped stream directly to s3?

In the following code, I show how I was thinking to load the input gzipped dataframe from s3, and how I would write the .tsv if it were located locally bucket_dir_local = ./.

import pandas as pd
import s3fs
import os
import gzip
import csv
import io

bucket_dir = 's3://my-bucket/annotations/'
df = pd.read_csv(os.path.join(bucket_dir, 'in.tsv.gz'), sep='\t', compression="gzip")

bucket_dir_local='./'
# not sure how to do it with an s3 path
with gzip.open(os.path.join(bucket_dir_local, 'out.tsv.gz'), "w") as f:
    with io.TextIOWrapper(f, encoding='utf-8') as wrapper:
        w = csv.DictWriter(wrapper, fieldnames=['test', 'testing'], extrasaction="ignore")
        w.writeheader()
        for index, row in df.iterrows():
            my_dict = {"test": index, "testing": row[6]}
            w.writerow(my_dict)

Edit: smart_open looks like the way to go.

Questioner

0x90

Viewed

0

0x90 2020-12-02 14:51:22

Here is a dummy example to read a file from s3 and write it back to s3 using smart_open

from smart_open import open
import os

bucket_dir = "s3://my-bucket/annotations/"

with open(os.path.join(bucket_dir, "in.tsv.gz"), "rb") as fin:
    with open(
        os.path.join(bucket_dir, "out.tsv.gz"), "wb"
    ) as fout:
        for line in fin:
            l = [i.strip() for i in line.decode().split("\t")]
            string = "\t".join(l) + "\n"
            fout.write(string.encode())

热门帖子

1

现在想系统的学习下 Python 大佬们有什么推荐的视频课程没

2

iPhone 的信号问题，可能主要在基站切换策略上

3

朋友们，受到 BlockSite 启发，我打算做一款 bilibiliBlock 插件,大家觉得怎么样？

4

大家帮忙推荐 <黑苹果> 机器 (不是讨论是否要黑, 而是推荐机器)

5

2024 年了，有用 Longhorn 的朋友吗？想问问性能和稳定性咋样？

6

前端现在真的很难找工作吗？（大一软件技术学生的困惑）

7

万界星空科技QMS系统如何管理车间产品的质量

8

关于国外满 18 岁了就全靠自己这件事是真实的吗

9

LineageOS 21 更新后，很多国产 app 无法进入，有没有大佬知道怎么处理？

10

作为创业小团队或者独立开发， Node.js 好还是.NET 好

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books