How do I search for a few million values from an EXCEL column in a few million .TAR files

youki 991

Hi,

I have a few million item_IDs and I have to check, if they don't exists in the unziped .TAR files.

(The tar files tell, if a file has been processed.)

One .TAR file contains at least one item_ID. Searching for the item_ID by PowerShell's Select-String command in the unziped .TAR works good.

Now, of course, I'm worried whether PowerShell is a good solution at all because of the number of data and files.

(I've already tested a bit and will be happy to post the solution as soon as I have access to it and it's done.)

Is it a good idea to do it with powershell or is there any better option, maybe with Excel or any other Microsoft tool?
Since I'm limited when it comes to Excel, can I link this to an Excel View Model, perhaps via macro, and that would be a more efficient solution?

4 answers

Rich Matheisen 44,621 Reputation points

2023-02-08T20:43:47.1333333+00:00
Are the Excel files in the .tar archive files .XLS or .XSLX files? The .XLSX files are, themselves, PKZip archives. Searching them for a string is unlikely to be effective.

You can use the ImportExcel module (https://www.powershellgallery.com/packages/ImportExcel/7.0.1) to open the Excel file and then (assuming you know the name of the column you want to search), search each rows' column for the value.

Something like this (this is NOT a final solution to your problem, but use it as an example):

$Directory = C:\Junk\* # this is where the tar files are found $Item_ID = "r11" # the value to search for $ColumnName = "Col1" # the column name to search in the Excel file Get-ChildItem $Directory -Recurse -File -Filter *.xlsx| ForEach-Object{ # Extract the Excel file from the tar file # place the file into "some directory" Import-Excel "some directory\somefilename.xlsx" | ForEach-Object{ if ($_.$ColumnName -eq $Item_ID){ Write-Host "Found it!" } } Remove-Item "some directory\somefilename.xlsx" }
Please sign in to rate this answer.
Rich Matheisen 44,621 Reputation points

2023-02-08T20:59:36.95+00:00

You should partition the Excel files so you don't try to process them all at once. Then run several PowerShell instances and run a script in each one.

I don't know how the files are named, or your directory structure, but you can process groups of files by, say, the first character of the file name, or by directory name.

The point of doing that is to achieve some level of parallel processing.

youki 991 Reputation points

2023-02-09T00:06:01.5166667+00:00

Hi,

the .TAR file contains a xml file which contains the item_ID. I will generate XLSX from txt tables with the item_ID, then i have to search in the .TAR files.

I'm not sure, if it's enough to just check, if it exists. I guess i should print out the matched, not-matched item_IDs and the path.
I'm not sure right now how I would do this for myself or make the most sense.

@Rich Matheisen Yes, As I've read the Excel Max Rows Limit is 1.048.576.

I hope I'll find a corresponding XLSX or can create it for testing with a few hundred thousand rows tomorrow.

At noon, my colleague still had problems unpacking the lists due to the size. It's still txt files that we will probably have to import into XLSX, hope that won't cause any problems.

My first simple test was like so:

(Found the components here on MSDN & StackOverflow)

Import all values from the XLSX column into an array.

Import all .TAR full pathes in an array (I think it was Get-ChildItem).

Iterate the array with the full pathes and search by Select-String command while iterating the array with the XLSX values.

An article that I've found and have to take into account to leverage .NET and avoid idiomatic PowerShell to be faster:

PowerShell scripting performance considerations Maybe helpful:

Getting files in a directory that has over 7 million items using powershell

Rich Matheisen 44,621 Reputation points

2023-02-09T16:32:01.2666667+00:00

Even the older XLS files contain a lot of binary data. Searching for plain-text strings in binary data is never a good idea. You can minimize the risk of doing so by providing a boundary marker in the string to be searched for. For example, in a simple XLS file there is a cell with a data value of "r12" and another with a value of "r121". A search for the "r12" value yields two matches. If the value 0x030000 (at least in the XML file I'm looking at) is added to the end or the "r12" value the search finds only one match.

Loading the item ID's into an array is a good idea but searching that array can take a long time if it's done sequentially. Substituting a binary search reduces the number of comparisons (e.g., to 20 to find a value in an array one million items), but the array must be ordered and sorting the values may take more memory than you have. Definitely pre-sort the array beforehand, and stick to a text file of the item IDs to be loaded.

Using a hash of the values to search for is faster than a binary search, but it requires that there's enough memory. The hash does not require that the data be sorted.

You're still going to have a relatively large amount of overhead in the script (or program) involved in extraction of the XML file from the TAR file. Beside the file system read/write/delete overhead, there's the time (and I/O) involved in starting and ending the process used to extract the data.

EDIT: Changed XML to XLS (as originally intended!)

youki 991 Reputation points

2023-02-09T21:56:01.89+00:00

OK,

will check that tomorrow.

Could test a little bit, EnumerateFile took 13 Minutes to get over 12 mio. file pathes via UNC path.

On the way home, I've read that the StreamReader is the better option than the String-Select, at least it works as i've tested.

Rich Matheisen 44,621 Reputation points

2023-02-09T22:44:36.6166667+00:00

No doubt, the [System.IO.Directory]::EnumerateFiles($path,'.') is certainly faster than using a PowerShell cmdlet.

Just don't put the results into an array! Send the results into a pipeline.

$path = "c:\junk" $query01 = (Measure-Command -Expression { (([System.IO.Directory]::EnumerateFiles($path,'*.*')) | Measure-Object).Count}).TotalHours $query02 = (Measure-Command -Expression { ([System.IO.Directory]::GetFiles($path,'*.*')).Count}).TotalHours $query03 = (Measure-Command -Expression { ([System.IO.Directory]::EnumerateFiles("c:\junk", "*")).Count}).TotalHours $query04 = (Measure-Command -Expression { (Get-ItemProperty $path\*).Count}).TotalHours $query05 = (Measure-Command -Expression { (Get-ChildItem $path\*).Count}).TotalHours

youki 991 Reputation points

2023-02-11T07:58:56.5766667+00:00

This is killing me. According to my contract, I'm not allowed to work at weekends. How come I'm doing this now. I'm not even employed as a programmer.

Well, I'll count the keys first, otherwise I won't be able to sleep.

Must be a few million.

I have now tested that with a select string, 100 keys and 1 file, when the key is in position 100, it takes a whole 4 seconds. With Streamreader ReadToEnd it takes 32 milliseconds. According to the article, it might be possible to get that down to 4 milliseconds with the Aho-Corasick algorithm? It's only an estimate now.

I'll just count the keys and call it a day until Monday. It's cruel.

https://stackoverflow.com/questions/46339057/c-sharp-fastest-string-search-in-all-files

P.S: Will Test it with the pipeline but I thought it‘s recommended to avoid this?!

MotoX80 31,561 Reputation points

2023-02-11T13:30:36.1533333+00:00

Could test a little bit, EnumerateFile took 13 Minutes to get over 12 mio. file pathes via UNC path.

It should take less time if you run the Powershell process on the server that hosts the file share. Otherwise you are transmitting the TAR file files over the network every time you read them.

I you are accessing a NAS where you can't run PS on it, at least use a machine that is on the same network subnet as the NAS.

Rich Matheisen 44,621 Reputation points

2023-02-11T16:36:01.5066667+00:00

It takes a long time because you're searching for 100 keys in one file and not 1 key in a list of 100 values. Also, you're searching the entire vile 100 times. I don't know how you're reading the file, though. Are you using the Get-Content cmdlet? If you are, add the "-Raw" switch. That will avoid creating an array and just slurp the whole fine into a single value in a scalar variable, just like the StreamReader using the ReadToEnd.

So far you've said that the TAR file contains a compressed XLS file, but now what the XLS file contains. Do the XLS files all use identical column names? I.e., do you know where the item ID is located? Are there multiple rows in the XLS file? If there are, typically how many rows? 1, 10, 100, 1000, 10000, 1000000?

Avoiding the use of pipelines when you already have an array of values is correct. But why wait for the 13 minutes for the array to fill if you can begin processing as soon as the first element is available?

One more thing to consider: using PowerShell 7 and the Foreach-Object with the -Parallel switch. Once the list of item IDs is loaded into an array (which would be unchanging throughout) unpack the file, do a binary search for the item ID, remove the unpacked file, save the item id and file name all within the ForEach-Object block. Doing those operations in parallel should give you a gain in performance even after the added overhead of managing the threads.

youki 991 Reputation points

2023-02-12T01:15:32.34+00:00

The whole thing was really a bit haywire.

It's also a bit, because I still don't know what the output of the result should look like. Of course I've never done anything like that before. I don't even know what to expect there. I suspect that the whole thing can't be done in a reasonable amount of time with a Powershell script.

If I just do the math blindly, it doesn't work in several weeks, even with dozens of shells and countless servers?

So again, since I know more now than before:

I have several .txt files with a total of over 10 million Item_IDs.

And I have over 12 million TAR files. Each TAR file contains multiple files. Some are encrypted, but the important one is an XML file that also contains an Item_ID.

I need to check if the Item IDs from the .txt lists are present in the tar files.

Sorry if I haven't taken some things into account.

Of course I'm still missing some important requirements, but I've already done this:

$folderPath = "xxx" $IdListPath = "xxx" # Load Ids from .txt to ArrayList. $Ids = [System.Collections.ArrayList]::new() try { $stream = [System.IO.StreamReader]::new($IdListPath) while ($line = $stream.ReadLine()) { [void]$Ids.Add($line) } } finally { $stream.Dispose() } # Get all TAR file names/ full pathes from directory/ subdirectories. $filesArray = @([System.IO.Directory]::EnumerateFiles($folderPath,'*.TAR', [System.IO.SearchOption]::AllDirectories)) $separator = [string[]]@("<field name='ha_auftragsnr'>", "</field>") $stopwatch = [system.diagnostics.stopwatch]::StartNew() # Iterate file pathes. for (($i = 0); $i -lt $filesArray.Count; $i++) { # Get file to read. $sread = [System.IO.StreamReader]::new($filesArray.Get($i), [System.Text.UTF8Encoding]) $fileContent = $null # Get file content to memory. while ($sread.Peek() -gt -1) { $fileContent = $sread.ReadToEnd() } # Split by xml node name. $splitted = $fileContent.Split($separator, [System.StringSplitOptions]::RemoveEmptyEntries) $Id = $splitted[1] for (($j = 0); $j -lt $Ids.Count; $j++) { if($Id -match $Ids[$j]) { # ToDo: Remove Id from $Ids because it was found. # Store the output?! break } } } $stopwatch

Avoiding the use of pipelines when you already have an array of values is correct. But why wait for the 13 minutes for the array to fill if you can begin processing as soon as the first element is available?

I thought, maybe I need the path for the output later and also for error catching?!

So far you've said that the TAR file contains a compressed XLS file, but now what the XLS file contains. Do the XLS files all use identical column names? I.e., do you know where the item ID is located? Are there multiple rows in the XLS file? If there are, typically how many rows? 1, 10, 100, 1000, 10000, 1000000?

Yes, the size of the TAR is an important fact. Possibly also the average for possible errors and error catching?!

One more thing to consider: using PowerShell 7 and the Foreach-Object with the -Parallel switch. Once the list of item IDs is loaded into an array (which would be unchanging throughout) unpack the file, do a binary search for the item ID, remove the unpacked file, save the item id and file name all within the ForEach-Object block. Doing those operations in parallel should give you a gain in performance even after the added overhead of managing the threads.

Could this really be faster?

youki 991 Reputation points

2023-02-12T01:27:48.5133333+00:00

The whole thing was really a bit haywire.

It's also a bit, because I still don't know what the output of the result should look like. Of course I've never done anything like that before. I don't even know what to expect there. I suspect that the whole thing can't be done in a reasonable amount of time with a Powershell script. So that shouldn't take several weeks, should it? Can you name an acceptable time or how long it could be done?

So again, since I know more now than before:

I have several .txt files with a total of over 10 million Item_IDs.

And I have over 12 million TAR files. Each TAR file contains multiple files. Some are encrypted, but the important one is an XML file that also contains an Item_ID.

I need to check if the Item IDs from the .txt lists are present in the tar files.

Still trying to find the fastest solution. If I just do the math blindly, it doesn't work in several weeks, even with dozens of shells and countless servers?

Sorry if I haven't taken some things into account.

Of course I'm still missing some important requirements, but I've already done this:

@Rich Matheisen Should your code be more effective?

(OK, let's try commenting with MS Edge... maybe it works better that way... )

... # Load Ids from .txt to ArrayList. $Ids = [System.Collections.ArrayList]::new() try { $stream = [System.IO.StreamReader]::new($IdListPath) while ($line = $stream.ReadLine()) { [void]$Ids.Add($line) } } finally { $stream.Dispose() } # Get all TAR file names/ full pathes from directory/ subdirectories. $filesArray = @([System.IO.Directory]::EnumerateFiles($folderPath,'*.TAR', [System.IO.SearchOption]::AllDirectories)) $separator = [string[]]@("<field name='Item_ID'>", "</field>") $stopwatch = [system.diagnostics.stopwatch]::StartNew() # Iterate file pathes. for (($i = 0); $i -lt $filesArray.Count; $i++) { # Get file to read. $sread = [System.IO.StreamReader]::new($filesArray.Get($i), [System.Text.UTF8Encoding]) $fileContent = $null # Get file content to memory. while ($sread.Peek() -gt -1) { $fileContent = $sread.ReadToEnd() } # Split by xml node name. $splitted = $fileContent.Split($separator, [System.StringSplitOptions]::RemoveEmptyEntries) $Id = $splitted[1] for (($j = 0); $j -lt $Ids.Count; $j++) { if($Id -match $Ids[$j]) { # TODO: Remove ID from $IDs because it was found. # Store for the ouput. break } $test_index = $j } } $stopwatch

youki 991 Reputation points

2023-02-12T02:43:54.9333333+00:00

@Rich Matheisen

I have to mention, I was only asked if I can script. Ok, then you do that. Somehow I don't think they have the slightest idea how long it will take, because they expect it to run for a few days, lol. I'll have to address that on Monday and get an expert involved.

Rich Matheisen 44,621 Reputation points

2023-02-12T03:09:49.7733333+00:00

I have to mention, I was only asked if I can script

And you said yes!

A word of advice learned long ago in the military: Keep your hand down, your eyes open, and your mouth shut. :-)

youki 991 Reputation points

2023-02-12T11:22:16.0866667+00:00

@Rich Matheisen

Man,

I tried it 3 times yesterday and when i add a new comment it deletes my previous comment. It only works in Edge, can't add code in chrome. Must be a joke...

OK, I'll try again without having a mental breakdown...

Since I now know more than before, or have more information, let's start again.

I have one or more .txt lists with over 10 million Item_IDs (seems pretty disorganized).

And I have 12 million TAR files, each of these files contain multiple files. Some are encrypted, but one of them is an XML that also contains an Item_ID.

I now have to check which Item_IDs from the .txt lists are present in the TAR files.

Of course, there is still a lot of important information missing, such as the fact that I don't know what the output should look like.

I assumed XLSX files before because I format the .txt lists. They are actually tables with several columns, for better and faster processing I edit them in Excel. Remove all columns that only the Item_ID remains and convert them back to .txt

I have not yet considered some of your tips, please bear with me.

With the saved file paths, I would of course also be able to recognize where errors occurred.

With this amount of data, I definitely don't want to have to repeat processes unnecessarily.

There is of course an important question:

Is it even possible to do all of this in a reasonable amount of time?

Let's take 2 weeks as an example. We calculate 2 weeks in milliseconds and divide it by the processes. A PC/server can never cope with that in this time?

youki 991 Reputation points

2023-02-12T11:26:02.0366667+00:00

Since I now know more than before, or have more information, let's start again.

I have one or more .txt lists with over 10 million Item_IDs (seems pretty disorganized).

And I have 12 million TAR files, each of these files contain multiple files. Some are encrypted, but one of them is an XML that also contains an Item_ID.

I now have to check which Item_IDs from the .txt lists are present in the TAR files.

Of course, there is still a lot of important information missing, such as the fact that I don't know what the output should look like.

I assumed XLSX files before because I format the .txt lists. They are actually tables with several columns, for better and faster processing I edit them in Excel. Remove all columns that only the Item_ID remains and convert them back to .txt

I have not yet considered some of your tips, please bear with me.

With the saved file paths, I would of course also be able to recognize where errors occurred.

With this amount of data, I definitely don't want to have to repeat processes unnecessarily.

There is of course an important question:

Is it even possible to do all of this in a reasonable amount of time?

Let's take 2 weeks as an example. We calculate 2 weeks in milliseconds and divide it by the processes. A PC/server can never cope with that in this time?

$folderPath = "xxx" $IdListPath = "xxx" # Load Ids from .txt to ArrayList. $Ids = [System.Collections.ArrayList]::new() try { $stream = [System.IO.StreamReader]::new($IdListPath) while ($line = $stream.ReadLine()) { [void]$Ids.Add($line) } } finally { $stream.Dispose() } # Get all TAR file names/ full pathes from directory/ subdirectories. #$filesIEnumerable = [System.IO.Directory]::EnumerateFiles($folderPath,'*.TAR', [System.IO.SearchOption]::AllDirectories) # Get all TAR file names/ full pathes from directory/ subdirectories. $filesArray = @([System.IO.Directory]::EnumerateFiles($folderPath,'*.TAR', [System.IO.SearchOption]::AllDirectories)) $separator = [string[]]@("<field name='Item_ID'>", "</field>") $stopwatch = [system.diagnostics.stopwatch]::StartNew() # Iterate file pathes. for (($i = 0); $i -lt $filesArray.Count; $i++) { # Get file to read. $sread = [System.IO.StreamReader]::new($filesArray.Get($i), [System.Text.UTF8Encoding]) $fileContent = $null # Get file content to memory. while ($sread.Peek() -gt -1) { $fileContent = $sread.ReadToEnd() } # Split by xml node name. $splitted = $fileContent.Split($separator, [System.StringSplitOptions]::RemoveEmptyEntries) $Id = $splitted[1] # Get line with Ids $lineOfAuftragsnr = $null # Iterate list of Ids. for (($j = 0); $j -lt $Ids.Count; $j++) { if($Id -match $Ids[$j]) { # ToDo: Remove found Id from the $Ids. # Store the ouput?! break } $test_index = $j } } $stopwatch

youki 991 Reputation points

2023-02-12T11:30:31.6266667+00:00

@Rich Matheisen

Since I now know more than before, or have more information, let's start again.

I have one or more .txt lists with over 10 million Item_IDs (seems pretty disorganized).

And I have 12 million TAR files, each of these files contain multiple files. Some are encrypted, but one of them is an XML that also contains an Item_ID.

I now have to check which Item_IDs from the .txt lists are present in the TAR files.

Of course, there is still a lot of important information missing, such as the fact that I don't know what the output should look like.

I assumed XLSX files before because I format the .txt lists. They are actually tables with several columns, for better and faster processing I edit them in Excel. Remove all columns that only the Item_ID remains and convert them back to .txt

I have not yet considered some of your tips, please bear with me.

With the saved file paths, I would of course also be able to recognize where errors occurred.

With this amount of data, I definitely don't want to have to repeat processes unnecessarily.

There is of course an important question:

Is it even possible to do all of this in a reasonable amount of time?

Let's take 2 weeks as an example. We calculate 2 weeks in milliseconds and divide it by the processes. A PC/server can never cope with that in this time?

$folderPath = "xxx" $IdListPath = "xxx" # Load Ids from .txt to ArrayList. $Ids = [System.Collections.ArrayList]::new() try { $stream = [System.IO.StreamReader]::new($IdListPath) while ($line = $stream.ReadLine()) { [void]$Ids.Add($line) } } finally { $stream.Dispose() } # Get all TAR file names/ full pathes from directory/ subdirectories. #$filesIEnumerable = [System.IO.Directory]::EnumerateFiles($folderPath,'*.TAR', [System.IO.SearchOption]::AllDirectories) # Get all TAR file names/ full pathes from directory/ subdirectories. $filesArray = @([System.IO.Directory]::EnumerateFiles($folderPath,'*.TAR', [System.IO.SearchOption]::AllDirectories)) $separator = [string[]]@("<field name='Item_ID'>", "</field>") $stopwatch = [system.diagnostics.stopwatch]::StartNew() # Iterate file pathes. for (($i = 0); $i -lt $filesArray.Count; $i++) { # Get file to read. $sread = [System.IO.StreamReader]::new($filesArray.Get($i), [System.Text.UTF8Encoding]) $fileContent = $null # Get file content to memory. while ($sread.Peek() -gt -1) { $fileContent = $sread.ReadToEnd() } # Split by xml node name. $splitted = $fileContent.Split($separator, [System.StringSplitOptions]::RemoveEmptyEntries) $Id = $splitted[1] # Iterate list of Ids. for (($j = 0); $j -lt $Ids.Count; $j++) { if($Id -match $Ids[$j]) { # ToDo: Remove found Id from the $Ids. # Store the ouput?! break } $test_index = $j } } $stopwatch

youki 991 Reputation points

2023-02-13T17:24:23.1566667+00:00

@Rich Matheisen

I now completely read the IDs from the TAR files and use the HashSet function "SymmetricExceptWith". It seems extremely faster. Can the whole thing really be transferred to this large amount of data?!

Rich Matheisen 44,621 Reputation points

2023-02-14T03:52:24.23+00:00

Are you sure that's what you want? The result will be a hash that contains the values that are not present in both hashes. I thought you wanted only a list if the item ids that don't exist in the tar files?

BTW, can you send me a sample XML file? Just change the file extension to TXT an attach it to a comment. Also, do the XML files ever contain more than one "field" element?

Rich Matheisen 44,621 Reputation points

2023-02-15T03:15:52.62+00:00

I'm going to create a new answer to carry on this thread. I've got a fairly good script that uses a hash and handles the XML files in TAR as XML file -- and does it using a MemoryStream so there's no I/O except to read the TAR file. It doesn't look for your XML entities in any files in the TAR except those with XML extensions. To find the entities it uses XPath so there's no dependency on the way the XML is formatted.

It will, however, require you to install a NuGet package to get the necessary DLL. You may also have to install the 7-Zip utility to get another DLL.
Sign in to comment
Limitless Technology 43,926 Reputation points

2023-02-09T15:14:23.15+00:00

Hi. Thank you for your question and reaching out. I’d be more than happy to help you with your query

To search for a few million values from an Excel column in a few million .TAR files, you can use the Linux command line utility "grep". Grep allows you to search for specific patterns in files, and is a very effective way to search for values in large files. Here is an example of how to use grep to search for a value in a .TAR file:

grep "value" filename.tar

This command will search for the specified value in the specified .TAR file. If the value is found, it will display the line containing the value. You can also search multiple .TAR files at once by using the following command:

grep "value" *.tar

This command will search for the specified value in all .TAR files in the current directory. If the value is found, it will display the line containing the value for each file in which it is found.

If the reply was helpful, please don’t forget to upvote or accept as answer, thank you.
Please sign in to rate this answer.
Rich Matheisen 44,621 Reputation points

2023-02-09T15:33:35.0266667+00:00

** DELETED **
Sign in to comment
Limitless Technology 43,926 Reputation points

2023-02-09T15:14:36.4766667+00:00

Double post
Please sign in to rate this answer.

0 comments No comments
Sign in to comment
Rich Matheisen 44,621 Reputation points

2023-02-15T03:34:58.0833333+00:00
After installing the NuGet package SevenZipExtractor (https://www.nuget.org/packages/SevenZipExtractor) move the 64-bit 7z.dll and the SevenZipExtractor.dll into a directory (the one holding the script will work) and change the directory name in the Add-Type cmdlet.

Change the $initialsize variable to a value slightly larger than maximum number of Item_IDs you have in your list. The value won't affect the size of the has but it will allow you to sacrifice memory for speed. If memory's a real problem, just set it 256 and let the hash figure out the bucket size and expand the has as needed.

I don't have any of the XML files you'll be looking at so I guessed that the "field" entitys' "name" value was "Item_ID". XML is CASE SENSITIVE so be careful!

The code will only look at files in the TAR files that have an extension ".XML". Beyond reading the TAR all the rest is done in memory.

$folderPath = "c:\junk" $IdListPath = "c:\junk\ids.txt" # Adjust this to use the directory containing the DLL # https://www.nuget.org/packages/SevenZipExtractor Add-Type -Path "C:\junk\SevenZipExtractor.dll" $initialsize = 50 # this has an effect on lookup speed and memory consumption # refer to the REMARKS section: # https://learn.microsoft.com/en-us/dotnet/api/system.collections.hashtable.-ctor?view=netframework-4.8.1#System_Collections_Hashtable__ctor # I tried using a value or 20,000,000 and it works with a 10-digit key and a 1-byte value # Load hash from .txt to ArrayList ==> This no longer needs to be sorted! # create case-insensitive hash $Ids = [System.Collections.Specialized.CollectionsUtil]::CreateCaseInsensitiveHashtable($initialsize) try { $stream = [System.IO.StreamReader]::new($IdListPath) while ($line = $stream.ReadLine()) { [void]$Ids.Add($line, [byte]0) } } catch{ $_ return } finally { $stream.Dispose() } $stopwatch = [system.diagnostics.stopwatch]::StartNew() # Get all TAR file names/ full pathes from directory/ subdirectories. [System.IO.Directory]::EnumerateFiles($folderPath, '*.TAR', [System.IO.SearchOption]::AllDirectories) | ForEach-Object{ $extractor = New-Object SevenZipExtractor.ArchiveFile($_) foreach ($entry in $extractor.entries){ if ($entry.FileName -like "*.xml"){ [System.IO.MemoryStream]$memoryStream = New-Object System.IO.MemoryStream $entry.Extract($memoryStream) $x = New-Object Byte[] $memorystream.length $seekpos = $memorystream.seek(0,0) $streamlength =$memoryStream.Read($x,0,($memoryStream.length - 1)) # create XML doc $XML = [xml]([System.Text.Encoding]::UTF8.GetString($x)).TrimEnd(0x00) Select-Xml -XPath '//field[@name="Item_ID"]' -Xml $XML | Select-Object -ExpandProperty node | ForEach-Object{ $key = $_.'#text'.Trim() if ($Ids.ContainsKey($key)){ $Ids.$key = 1 } } $memoryStream.Dispose() } } } $stopwatch.stopwatch $stopwatch.Elapsed $Ids.GetEnumerator()| ForEach-Object{ if ($_.Value -eq 0){ $_.Key } } | Out-File -FilePath c:\Junk\UnmatchedIds.txt
Please sign in to rate this answer.
youki 991 Reputation points

2023-02-16T12:09:17.4366667+00:00

I'm not sure how and at all but I'm doing a run and reading out all IDs.

I'm trying it now, it would definitely be a problem to download stuff and get it on the server without permission because of the company structure.

The server has 64 GB Ram, not a little and I can only access it via UNC path.

Since the hash set overwrites duplicate values, this fits, because these seem to exist too.

I now split 2 times with the parameter Count 2 to get back a maximum of 2 elements. I also have to do 2 zero checks, otherwise it would be too easy. There's something totally different in between.

I now also have more error logs/checks in there to be able to check that or if it is the system.

I don't know if that's what you meant, but I tested it extensively again and it's the a.ExceptWith(b) function.

All values that are duplicates in a and b disappear and only the values that are not present in b remain in a.

I think this is faster than expected, maybe not. I will report...

youki 991 Reputation points

2023-02-27T22:42:31.0933333+00:00

@Rich Matheisen It took 3 days with 14 shells to read all 12 mio TAR files, get the IDs and print them out.

I had a misconception because I'm using a contains to count duplicates instead of just counting the ReadLines and just counting it down.

Might run it again just to test. Somehow testing and coding for logging takes the longest.

Loading the HashSets and the ExceptWith with both HashSets (10M & 12M) together takes less than 2-4 minutes.

I will report...
Sign in to comment

How do I search for a few million values ​​from an EXCEL column in a few million .TAR files

4 answers

How do I search for a few million values from an EXCEL column in a few million .TAR files