How do I search for a few million values ​​from an EXCEL column in a few million .TAR files

youki 991 Reputation points
2023-02-08T18:10:37.66+00:00

Hi,

I have a few million item_IDs and I have to check, if they don't exists in the unziped .TAR files.

(The tar files tell, if a file has been processed.)

One .TAR file contains at least one item_ID. Searching for the item_ID by PowerShell's Select-String command in the unziped .TAR works good.

Now, of course, I'm worried whether PowerShell is a good solution at all because of the number of data and files.

(I've already tested a bit and will be happy to post the solution as soon as I have access to it and it's done.)

  1. Is it a good idea to do it with powershell or is there any better option, maybe with Excel or any other Microsoft tool?
  2. Since I'm limited when it comes to Excel, can I link this to an Excel View Model, perhaps via macro, and that would be a more efficient solution?
Excel
Excel
A family of Microsoft spreadsheet software with tools for analyzing, charting, and communicating data.
1,439 questions
Office Development
Office Development
Office: A suite of Microsoft productivity software that supports common business tasks, including word processing, email, presentations, and data management and analysis.Development: The process of researching, productizing, and refining new or existing technologies.
3,459 questions
PowerShell
PowerShell
A family of Microsoft task automation and configuration management frameworks consisting of a command-line shell and associated scripting language.
2,032 questions
0 comments No comments
{count} votes

4 answers

Sort by: Most helpful
  1. Rich Matheisen 44,621 Reputation points
    2023-02-08T20:43:47.1333333+00:00

    Are the Excel files in the .tar archive files .XLS or .XSLX files? The .XLSX files are, themselves, PKZip archives. Searching them for a string is unlikely to be effective.

    You can use the ImportExcel module (https://www.powershellgallery.com/packages/ImportExcel/7.0.1) to open the Excel file and then (assuming you know the name of the column you want to search), search each rows' column for the value.

    Something like this (this is NOT a final solution to your problem, but use it as an example):

    $Directory = C:\Junk\*  # this is where the tar files are found
    $Item_ID = "r11"        # the value to search for
    $ColumnName = "Col1"    # the column name to search in the Excel file
    Get-ChildItem $Directory -Recurse -File -Filter *.xlsx|
        ForEach-Object{
            # Extract the Excel file from the tar file
            # place the file into "some directory"
            Import-Excel "some directory\somefilename.xlsx" |
                ForEach-Object{
                    if ($_.$ColumnName -eq $Item_ID){
                        Write-Host "Found it!"
                    }
                }
                Remove-Item "some directory\somefilename.xlsx"
        }
    

  2. Limitless Technology 43,926 Reputation points
    2023-02-09T15:14:23.15+00:00

    Hi. Thank you for your question and reaching out. I’d be more than happy to help you with your query

    To search for a few million values from an Excel column in a few million .TAR files, you can use the Linux command line utility "grep". Grep allows you to search for specific patterns in files, and is a very effective way to search for values in large files. Here is an example of how to use grep to search for a value in a .TAR file:

    grep "value" filename.tar

    This command will search for the specified value in the specified .TAR file. If the value is found, it will display the line containing the value. You can also search multiple .TAR files at once by using the following command:

    grep "value" *.tar

    This command will search for the specified value in all .TAR files in the current directory. If the value is found, it will display the line containing the value for each file in which it is found.

    If the reply was helpful, please don’t forget to upvote or accept as answer, thank you.


  3. Limitless Technology 43,926 Reputation points
    2023-02-09T15:14:36.4766667+00:00

    Double post

    0 comments No comments

  4. Rich Matheisen 44,621 Reputation points
    2023-02-15T03:34:58.0833333+00:00

    After installing the NuGet package SevenZipExtractor (https://www.nuget.org/packages/SevenZipExtractor) move the 64-bit 7z.dll and the SevenZipExtractor.dll into a directory (the one holding the script will work) and change the directory name in the Add-Type cmdlet.

    Change the $initialsize variable to a value slightly larger than maximum number of Item_IDs you have in your list. The value won't affect the size of the has but it will allow you to sacrifice memory for speed. If memory's a real problem, just set it 256 and let the hash figure out the bucket size and expand the has as needed.

    I don't have any of the XML files you'll be looking at so I guessed that the "field" entitys' "name" value was "Item_ID". XML is CASE SENSITIVE so be careful!

    The code will only look at files in the TAR files that have an extension ".XML". Beyond reading the TAR all the rest is done in memory.

    $folderPath = "c:\junk"
    $IdListPath = "c:\junk\ids.txt"
    
    # Adjust this to use the directory containing the DLL
    # https://www.nuget.org/packages/SevenZipExtractor
    Add-Type -Path "C:\junk\SevenZipExtractor.dll"
    
    $initialsize = 50   # this has an effect on lookup speed and memory consumption
                        # refer to the REMARKS section: 
                        # https://learn.microsoft.com/en-us/dotnet/api/system.collections.hashtable.-ctor?view=netframework-4.8.1#System_Collections_Hashtable__ctor
                        # I tried using a value or 20,000,000 and it works with a 10-digit key and a 1-byte value
    
    # Load hash from .txt to ArrayList ==> This no longer needs to be sorted!
    # create case-insensitive hash
    $Ids = [System.Collections.Specialized.CollectionsUtil]::CreateCaseInsensitiveHashtable($initialsize)
    try {
        $stream = [System.IO.StreamReader]::new($IdListPath)
        while ($line = $stream.ReadLine()) {
            [void]$Ids.Add($line, [byte]0)
        }
    }
    catch{
        $_
        return
    }
    finally {
        $stream.Dispose()
    }
    
    $stopwatch = [system.diagnostics.stopwatch]::StartNew()
    # Get all TAR file names/ full pathes from directory/ subdirectories.
    [System.IO.Directory]::EnumerateFiles($folderPath, '*.TAR', [System.IO.SearchOption]::AllDirectories) |
        ForEach-Object{
    
            $extractor = New-Object SevenZipExtractor.ArchiveFile($_)
            foreach ($entry in $extractor.entries){
                if ($entry.FileName -like "*.xml"){
                    [System.IO.MemoryStream]$memoryStream = New-Object System.IO.MemoryStream
                    $entry.Extract($memoryStream)
                    $x = New-Object Byte[] $memorystream.length
                    $seekpos = $memorystream.seek(0,0) 
                    $streamlength =$memoryStream.Read($x,0,($memoryStream.length - 1))
                    # create XML doc
                    $XML = [xml]([System.Text.Encoding]::UTF8.GetString($x)).TrimEnd(0x00)
                    Select-Xml -XPath '//field[@name="Item_ID"]' -Xml $XML | 
                        Select-Object -ExpandProperty node |
                            ForEach-Object{
                                $key = $_.'#text'.Trim()
                                if ($Ids.ContainsKey($key)){
                                    $Ids.$key = 1
                                }
                            }
                    $memoryStream.Dispose()
                }
            }
        }
    $stopwatch.stopwatch
    $stopwatch.Elapsed
    
    $Ids.GetEnumerator()|
        ForEach-Object{
            if ($_.Value -eq 0){
                $_.Key
            } 
        } | Out-File -FilePath c:\Junk\UnmatchedIds.txt