how to vectorize a CSV file in blob storage using integrated vectorization in the Azure portal

2024-09-23T14:18:42.8433333+00:00

I'm trying to vectorize a CSV file stored in blob storage using integrated vectorization in the Azure portal so I can search my csv data using vector search. My csv file is one row per product with each row having this format: "LineNumber, Category, SKUNumber, MFGNumber, description, ConcatenatedText" (where ConcatenatedText is all the columns concatenated except for linenum. I found this help article (https://learn.microsoft.com/en-us/azure/search/search-howto-index-csv-blobs) but it's not working as I'd like it.

I'd like to be able to search for a SKU# for example, or a description and return both the "chunk" containing the keyword as well as the parent row's column names (somewhat like this format: LineNumber, Category, SKUNumber, MFGNumber, description, chunk, chunk_id)

I've tried the "import and vectorize data" button on the azure portal. I was able to import my csv file and create an index. However, the wizard defaults to the "Default" parsing mode and doesn't seem to properly handle CSV files yet even though the indexer config can be updated post-import to use the "DelimitedText" parsing mode. After the import completed, I changed my indexer config to include this snipet:

 "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "delimitedText",
      "delimitedTextDelimiter": ",",
      "indexedFileNameExtensions": ".csv",
      "firstLineContainsHeaders": true
    }

I've updated my index to include the columns in my csv. I've also updated my splitSkill to chunk the "concatenatedText" column instead of the /document/content (as below):

{
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#1",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 1000,
      "pageOverlapLength": 50,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/concatenatedText"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    }

After reindexing, the search returns results that don't make sense to me. My csv has 10 rows of data. a query for one the SKU# which should return 1 row, returns 13 rows/chunks for some reason and each chunk.

Any thoughts on what can be done to get the desired result?

thanks!

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,027 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 11,206 Reputation points
    2024-09-23T16:37:30.8933333+00:00

    Hello Lamriben, Mahmoud (Cincinnati, OH),

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you would like to vectorize a CSV file stored in Azure Blob Storage using integrated vectorization and get a desired result

    With what you've done as you explained, you are on the right track but there are a few things you will need to fix.

    1. After you have your CSV file uploaded to an Azure Blob Storage container, then in the Azure portal, navigate to your Azure Cognitive Search service and use the "Import and vectorize data" wizard to import your CSV file. It will help you create an index and configure vectorization.
    2. Then, update the indexer configuration to use the DelimitedText parsing mode to parse your CSV file correctly. The example of configuration snippet looks like:
         {
                "configuration": {
                  "dataToExtract": "contentAndMetadata",
                  "parsingMode": "delimitedText",
                  "delimitedTextDelimiter": ",",
                  "indexedFileNameExtensions": ".csv",
                  "firstLineContainsHeaders": true
                }
              }
      
    3. Now, to break down the text into manageable chunks for vectorization, you will also need to update and modify the SplitSkill. For example:
                {
                "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
                "name": "#1",
                "description": "Split skill to chunk documents",
                "context": "/document",
                "defaultLanguageCode": "en",
                "textSplitMode": "pages",
                "maximumPageLength": 1000,
                "pageOverlapLength": 50,
                "maximumPagesToTake": 0,
                "inputs": [
                  {
                    "name": "text",
                    "source": "/document/concatenatedText"
                  }
                ],
                "outputs": [
                  {
                    "name": "textItems",
                    "targetName": "pages"
                  }
                ]
              }
      

    At this point to ensure accuracy or a desired result, if SplitSkill is correctly configured to chunk the ConcatenatedText column with specific settings for maximumPageLength and pageOverlapLength parameters to ensure meaningful chunks. The next is to work on your index schema includes fields for LineNumber, Category, SKUNumber, MFGNumber, description, and the chunked text to allow you to retrieve the parent row's column names along with the chunk.

    Finally, make sure your query is structured to search within the chunked text and return the relevant fields. Adding the combination of vector search and traditional text search will be a very good to achieve this.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.

    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.