I have a pipeline which reads 20 files from storage and extracts the path of each file from it and load to a table. Ideally the record count should be 20 but when i execute the pipeline,t he same record is being flown again and again making total record count to increase indefinitely. I am wondering if I am making any mistake here.
I just replicated the issue. My guess is that you are inserting one record in BigQuery for each record in the files. If you choose, for example, Blob format, then you will have only one record per file.
I am not reading the files, files that I am reading are DICOM files with .dcm extension. I just want to capture the path of the file. Even there is only file, it loops indefinitely and repeat the same data again and again.
How is the pipeline configured? What sources and transformations you are using to take the file and insert it into the table?
Source is GCS. I gave a bucket path (Which has 20 .dcm images) and output schema has path and body. The transformation is javascript plugin (where I want to pick only path) and sink is HTTP plugin where i am posting the data.
During the javascript transform add a log to see if you are receiving just once the filepath. In addition, check the http return code in the post endpoint, it could be repeating because of http retries.