Write unique rows to CSV with pandas in chunks

piRSquared 2019-07-03 22:35

Setup

1 Million Rows

np.random.seed([3, 1415])
n = 1_000_000
dfout = pd.DataFrame({
    'id': np.random.randint(1000, size=n),
    'date': np.random.choice(pd.date_range('2019-01-01', periods=1000), size=n)
})

dfout.to_csv('my-file.csv.gz', compression='gzip', sep='\t', index=False)

Solution

Chunk as you did

df_chunks = pd.read_csv(
    'my-file.csv.gz',
    sep='\t',
    chunksize=100000,
    compression='gzip')

Write individual files per unique date

for i, df in enumerate(df_chunks):
    for date, d in df.groupby('date'):
        date = pd.Timestamp(date)
        d.drop_duplicates().to_csv(
            f'{date:%Y%m%d}.csv.gz',
            compression='gzip',
            mode='a',
            index=False,
            header=False
        )
    print(f'\r{i}', end='')

Read in each individual date file, drop_duplicates, and write back out

from pathlib import Path

path = Path('.')

for i, fh in enumerate(path.glob('[0-9]' * 8 + '.csv.gz')):
    df = pd.read_csv(fh, header=None)
    df.drop_duplicates().to_csv(
        'my_filtered.csv.gz',
        compression='gzip',
        mode='a',
        index=False,
        header=False
    )
    print(f'\r{i}: {fh}', end='')

df = pd.read_csv(
    'my_filtered.csv.gz',
    compression='gzip',
    header=None,
    names=['id', 'date']
)

Validation

assert len(df) == len(dfout) - dfout.duplicated().sum()
assert df.duplicated().sum() == 0

PepperoniPizza 2019-07-04 10:56:02

very detailed approach, this is indeed a top solution, will try it out, thanks !

Related issues

How to use python cut method to create bins, accept one parameter and return appropriate bin?

Create a dictionary from a list of lists with certain criteria

selecting columns based on row value, Python, Pandas

plotting count of zeros and ones in a dataframe

BeautifulSoup find.all() web scraping returns empty

python function. output a keys list from a dictionary if the key is todays date

Best way to perform multiple amount of Pandas lookups between two DataFrames

How to get the number of columns and the width of each column in a Pandas pivot table?

Display a column when a desired value is missing while grouping in Pandas dataframe

Python hide ticks but show tick labels