I reply to my own question after a consultation of Microsoft Support.
A small correction with regards to the contents: it's not one very large, but tens of thousands of small files. This is also where the performance issues arose. The bottleneck was the IO on the storage account as too many files have been written in parallel.
The resolution for us is to step away from a copy activity and to use a small scala script instead:
Batching the mini JSON documents into .jsonl documents with a batch size of 1000 lines resulted in a reduction of execution time from 60 minutes to 20 seconds.
Reading .jsonl in Spark is the same syntax as reading .json.
import org.apache.commons.compress.archivers.tar.TarArchiveInputStream
import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream
import java.io.BufferedInputStream
import java.net.URIimport org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.commons.io.IOUtils
val inputPath = "abfss://container@XYZ.dfs.core.windows.net/input/dump_From:2023-02-20_To:2023-02-21_On:2023-02-21_01-00-01.tgz"
val outputPath = "abfss://container@XYZ.dfs.core.windows.net/Output/tar_files/dump/"
val fileSystemClient = FileSystem.get(new URI("abfss://container@XYZ.dfs.core.windows.net/"), sc.hadoopConfiguration)
val inputStream = new BufferedInputStream(fileSystemClient.open(new Path(inputPath)))
// unzip the file
val gzInput = new GzipCompressorInputStream(inputStream)
val tarInput = new TarArchiveInputStream(gzInput)
// Initialize the iterator with first entry of the tar archives
var entry = tarInput.getNextEntry
// Initialize varibales for batching
var entryCounter = 0
var batchCounter = 1
var batchSize = 1000
// create first output file
var out = fileSystemClient.create(new Path(outputPath + "batch_" + batchCounter +".jsonl"))
while (entry != null) {
// Add entry to the current batch file if not empty
if (!entry.isDirectory && entry.getSize > 0) {
IOUtils.copy(tarInput, out)
entryCounter += 1
}
if (entryCounter >= batchSize){
out.close()
batchCounter += 1
entryCounter = 0
out = fileSystemClient.create(new Path(outputPath + "batch_" + batchCounter +".jsonl"))
}
entry = tarInput.getNextEntry
}
// close last, potentially incomplete batch file
out.close()
tarInput.close()
gzInput.close()
inputStream.close()