The whole thing was really a bit haywire.
It's also a bit, because I still don't know what the output of the result should look like. Of course I've never done anything like that before. I don't even know what to expect there. I suspect that the whole thing can't be done in a reasonable amount of time with a Powershell script.
If I just do the math blindly, it doesn't work in several weeks, even with dozens of shells and countless servers?
So again, since I know more now than before:
- I have several .txt files with a total of over 10 million Item_IDs.
- And I have over 12 million TAR files. Each TAR file contains multiple files. Some are encrypted, but the important one is an XML file that also contains an Item_ID.
I need to check if the Item IDs from the .txt lists are present in the tar files.
Sorry if I haven't taken some things into account.
Of course I'm still missing some important requirements, but I've already done this:
$folderPath = "xxx"
$IdListPath = "xxx"
# Load Ids from .txt to ArrayList.
$Ids = [System.Collections.ArrayList]::new()
try
{
$stream = [System.IO.StreamReader]::new($IdListPath)
while ($line = $stream.ReadLine())
{
[void]$Ids.Add($line)
}
}
finally
{
$stream.Dispose()
}
# Get all TAR file names/ full pathes from directory/ subdirectories.
$filesArray = @([System.IO.Directory]::EnumerateFiles($folderPath,'*.TAR', [System.IO.SearchOption]::AllDirectories))
$separator = [string[]]@("<field name='ha_auftragsnr'>", "</field>")
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
# Iterate file pathes.
for (($i = 0); $i -lt $filesArray.Count; $i++)
{
# Get file to read.
$sread = [System.IO.StreamReader]::new($filesArray.Get($i), [System.Text.UTF8Encoding])
$fileContent = $null
# Get file content to memory.
while ($sread.Peek() -gt -1)
{
$fileContent = $sread.ReadToEnd()
}
# Split by xml node name.
$splitted = $fileContent.Split($separator, [System.StringSplitOptions]::RemoveEmptyEntries)
$Id = $splitted[1]
for (($j = 0); $j -lt $Ids.Count; $j++)
{
if($Id -match $Ids[$j])
{
# ToDo: Remove Id from $Ids because it was found.
# Store the output?!
break
}
}
}
$stopwatch
Avoiding the use of pipelines when you already have an array of values is correct. But why wait for the 13 minutes for the array to fill if you can begin processing as soon as the first element is available?
I thought, maybe I need the path for the output later and also for error catching?!
So far you've said that the TAR file contains a compressed XLS file, but now what the XLS file contains. Do the XLS files all use identical column names? I.e., do you know where the item ID is located? Are there multiple rows in the XLS file? If there are, typically how many rows? 1, 10, 100, 1000, 10000, 1000000?
Yes, the size of the TAR is an important fact. Possibly also the average for possible errors and error catching?!
One more thing to consider: using PowerShell 7 and the Foreach-Object with the -Parallel switch. Once the list of item IDs is loaded into an array (which would be unchanging throughout) unpack the file, do a binary search for the item ID, remove the unpacked file, save the item id and file name all within the ForEach-Object block. Doing those operations in parallel should give you a gain in performance even after the added overhead of managing the threads.
Could this really be faster?