Hi anonymous user,
Sorry for the delay in response. Your previous reply was moved to moderation for some reason, I am working with platform team to restore it.
Please find below insights on your follow-up query.
1) Why we are hitting wasb?(wasb://) shall I directly hit abfs?(abfs://)?
- I used wasb just for example, you can use abfs . To further explore about WASB vs ADFS, please refer this doc.
2) copyUtil using the open(read) and create function.
i) will this read/write will increase the RU's? compared to azure Blob copy!
- AFAIK the blob charges are only for the data at rest, quantity and types of operations performed & data redundancy option selected. Since blob copy is beyond my expert area, I would suggest you to please open a new thread in Storage area , where we have our experts who can better assist on this ask.
ii) The entire part file data will be loaded to the memory and then written to target location. this will badly hit memory.
- There is limit of buffered data in hadoop and it is 4MB. So it will read 4MB data and then write it to destination blob.
Azure Blob Copy: https://learn.microsoft.com/en-us/java/api/overview/azure/storage-blob-readme?view=azure-java-stable#copy-a-blob
Distcp Java Utility:https://hadoop.apache.org/docs/r3.0.0/api/org/apache/hadoop/tools/DistCp.html
Could you suggest which one is best to choose as performance and cost effective.”
- W.r.t performance I think Distcp Java Utility is better but as Azure Blob copy is beyond my exert area, I would recommend to check with storage experts in Storage area about Azure blob copy.
If you are interested to explore concrete info about Distcp Java Utility performance or cost, please reach out in Cloudera community.
Hope this helps.