I have a 100 million row csv that I have to read in chunks with pandas like this:
df_chunks = pandas.read_csv(
'my-file.csv.gz',
sep='\t',
chunksize=100000,
compression='gzip')
for df in df_chunks:
# here I filter some rows and columns and after that
# I write to a new csv
filtered_df.to_csv(
'my_filtered.csv.gz',
sep=',',
columns=['id', 'date'],
compression='gzip',
mode='a')
The data I am trying to write looks like this, is only 2 columns
id,date
42517544,2019-06-30
42517544,2019-06-30
42517544,2019-07-01
...
Now I can use something like df.drop_duplicates()
but since I am writing in chunks I could end up eventually with duplicates. Notice the file is big, around 10G, so I need to read and write in chunks.
I would like to find a way to do it with pandas and perhaps a set in memory that doesn't consume too much memory because that is a constraint as well.
What is a good approach for this ?
1 Million Rows
np.random.seed([3, 1415])
n = 1_000_000
dfout = pd.DataFrame({
'id': np.random.randint(1000, size=n),
'date': np.random.choice(pd.date_range('2019-01-01', periods=1000), size=n)
})
dfout.to_csv('my-file.csv.gz', compression='gzip', sep='\t', index=False)
Chunk as you did
df_chunks = pd.read_csv(
'my-file.csv.gz',
sep='\t',
chunksize=100000,
compression='gzip')
Write individual files per unique date
for i, df in enumerate(df_chunks):
for date, d in df.groupby('date'):
date = pd.Timestamp(date)
d.drop_duplicates().to_csv(
f'{date:%Y%m%d}.csv.gz',
compression='gzip',
mode='a',
index=False,
header=False
)
print(f'\r{i}', end='')
Read in each individual date file, drop_duplicates
, and write back out
from pathlib import Path
path = Path('.')
for i, fh in enumerate(path.glob('[0-9]' * 8 + '.csv.gz')):
df = pd.read_csv(fh, header=None)
df.drop_duplicates().to_csv(
'my_filtered.csv.gz',
compression='gzip',
mode='a',
index=False,
header=False
)
print(f'\r{i}: {fh}', end='')
df = pd.read_csv(
'my_filtered.csv.gz',
compression='gzip',
header=None,
names=['id', 'date']
)
assert len(df) == len(dfout) - dfout.duplicated().sum()
assert df.duplicated().sum() == 0
very detailed approach, this is indeed a top solution, will try it out, thanks !