There are a lot of files with timestamp suffixed in their filename. I keep getting few of those files each day, and some of those few files have multiple copies of them with different timestamp (hh or mm or ss component of timestamp differ on each day).
I have a pipeline that copies each of those files everyday into time partitioned folders with a granularity of "yyyy/mm/dd" and so if there are multiple copies of a files present on a day with different timestamp, they get overwritten when copied into time-partitioned folder.
I am thinking of merging all copies of a file during copy as well as remove duplicate rows after merge and suffix the merged filename with most recent timestamp among all copies of the file.
How can I achieve this please ?
Sample file in a container -