Warm tip: This article is reproduced from stackoverflow.com, please click
cdap google-cloud-data-fusion google-cloud-platform

GCP Datafusion repeating same data from GCS

发布于 2020-05-25 15:51:08

I have a pipeline which reads 20 files from storage and extracts the path of each file from it and load to a table. Ideally the record count should be 20 but when i execute the pipeline,t he same record is being flown again and again making total record count to increase indefinitely. I am wondering if I am making any mistake here.

Questioner
code tutorial
Viewed
26
Tlaquetzal 2020-03-10 01:51

I just replicated the issue. My guess is that you are inserting one record in BigQuery for each record in the files. If you choose, for example, Blob format, then you will have only one record per file.