I am working as a data engineer and I have to combine some files into one file every day.
Here is what I would like to do:
- Upload file(.gz) to Azure Blob Storage every day.
- Unzip the file to parquet format.
- Combine some files into one file (partition by month).
- Upload file (file-path : Sample/year=2022/month=06/day=22/sample.gz) to Blob.
Already have been uploaded past files(year=2022/month=06/day=1-21/sample.gz) to the same directory. And those files have been combined into one file(202206.parquet)
- Unzip the file(2022/06/22) to parquet format.
- Combine 2022/06/22 file with 202206.parquet file.
**After combining, if possible, delete 202206.parquet and create a new file with 2022/06/22 file's data.
Now, I already have created pipelines in the part of step1 and step 2.
So, I need your advice focus on step3.
Any help would be greatly appreciated.
postscript : .gz file have about 25 csv file
Thank you for reading my question with poor English.