how to vectorize a CSV file in blob storage using integrated vectorization in the Azure portal

Question

how to vectorize a CSV file in blob storage using integrated vectorization in the Azure portal

I'm trying to vectorize a CSV file stored in blob storage using integrated vectorization in the Azure portal so I can search my csv data using vector search. My csv file is one row per product with each row having this format: "LineNumber, Category, SKUNumber, MFGNumber, description, ConcatenatedText" (where ConcatenatedText is all the columns concatenated except for linenum. I found this help article (https://learn.microsoft.com/en-us/azure/search/search-howto-index-csv-blobs) but it's not working as I'd like it.

I'd like to be able to search for a SKU# for example, or a description and return both the "chunk" containing the keyword as well as the parent row's column names (somewhat like this format: LineNumber, Category, SKUNumber, MFGNumber, description, chunk, chunk_id)

I've tried the "import and vectorize data" button on the azure portal. I was able to import my csv file and create an index. However, the wizard defaults to the "Default" parsing mode and doesn't seem to properly handle CSV files yet even though the indexer config can be updated post-import to use the "DelimitedText" parsing mode. After the import completed, I changed my indexer config to include this snipet:

 "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "delimitedText",
      "delimitedTextDelimiter": ",",
      "indexedFileNameExtensions": ".csv",
      "firstLineContainsHeaders": true
    }

I've updated my index to include the columns in my csv. I've also updated my splitSkill to chunk the "concatenatedText" column instead of the /document/content (as below):

{
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#1",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 1000,
      "pageOverlapLength": 50,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/concatenatedText"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    }

After reindexing, the search returns results that don't make sense to me. My csv has 10 rows of data. a query for one the SKU# which should return 1 row, returns 13 rows/chunks for some reason and each chunk.

Any thoughts on what can be done to get the desired result?

thanks!

Lamriben, Mahmoud (Cincinnati, OH) 0 Reputation points

2024-10-01T18:08:35.5133333+00:00

See my comment to Sina Salam on Sep 30, 2024, 9:43 AM above for a solution on how to import and vectorize a CSV file in the azure portal.

@Sina Salam Can you let me know on the fieldMappings question if you have an idea(see above comment) ? I got my issue resolved. But I still would like to know what why my experience was different than what I read in the docs.
Sina Salam 22,031 Reputation points Volunteer Moderator

2024-10-01T19:12:44.0866667+00:00

Hi Lamriben, Mahmoud (Cincinnati, OH),

Glad to read that the issue has been resolved,

About fieldMappings: By utilizing fieldMappings, you've ensured that your indexer is explicitly aware of how to handle the fields, which can prevent issues down the line. It can be helpful for others facing similar challenges, so sharing this experience can indeed assist the community!

Also, always refer to the latest Azure documentation on https://learn.microsoft.com/en-us/azure/search/search-indexer-overview and https://learn.microsoft.com/en-us/azure/search/cognitive-search-skillset for updates and best practices.

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

1 answer

Your answer

Lamriben, Mahmoud (Cincinnati, OH) 0 Reputation points

2024-10-01T18:08:35.5133333+00:00

See my comment to Sina Salam on Sep 30, 2024, 9:43 AM above for a solution on how to import and vectorize a CSV file in the azure portal.

@Sina Salam Can you let me know on the fieldMappings question if you have an idea(see above comment) ? I got my issue resolved. But I still would like to know what why my experience was different than what I read in the docs.
Sina Salam 22,031 Reputation points Volunteer Moderator

2024-10-01T19:12:44.0866667+00:00

Hi Lamriben, Mahmoud (Cincinnati, OH),

Glad to read that the issue has been resolved,

About fieldMappings: By utilizing fieldMappings, you've ensured that your indexer is explicitly aware of how to handle the fields, which can prevent issues down the line. It can be helpful for others facing similar challenges, so sharing this experience can indeed assist the community!

Also, always refer to the latest Azure documentation on https://learn.microsoft.com/en-us/azure/search/search-indexer-overview and https://learn.microsoft.com/en-us/azure/search/cognitive-search-skillset for updates and best practices.

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Answer 1

Sina Salam 22,031 Volunteer Moderator

Hello Lamriben, Mahmoud (Cincinnati, OH),

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you would like to vectorize a CSV file stored in Azure Blob Storage using integrated vectorization and get a desired result

With what you've done as you explained, you are on the right track but there are a few things you will need to fix.

After you have your CSV file uploaded to an Azure Blob Storage container, then in the Azure portal, navigate to your Azure Cognitive Search service and use the "Import and vectorize data" wizard to import your CSV file. It will help you create an index and configure vectorization.

Then, update the indexer configuration to use the DelimitedText parsing mode to parse your CSV file correctly. The example of configuration snippet looks like:

   {
          "configuration": {
            "dataToExtract": "contentAndMetadata",
            "parsingMode": "delimitedText",
            "delimitedTextDelimiter": ",",
            "indexedFileNameExtensions": ".csv",
            "firstLineContainsHeaders": true
          }
        }

Now, to break down the text into manageable chunks for vectorization, you will also need to update and modify the SplitSkill. For example:

          {
          "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
          "name": "#1",
          "description": "Split skill to chunk documents",
          "context": "/document",
          "defaultLanguageCode": "en",
          "textSplitMode": "pages",
          "maximumPageLength": 1000,
          "pageOverlapLength": 50,
          "maximumPagesToTake": 0,
          "inputs": [
            {
              "name": "text",
              "source": "/document/concatenatedText"
            }
          ],
          "outputs": [
            {
              "name": "textItems",
              "targetName": "pages"
            }
          ]
        }

At this point to ensure accuracy or a desired result, if SplitSkill is correctly configured to chunk the ConcatenatedText column with specific settings for maximumPageLength and pageOverlapLength parameters to ensure meaningful chunks. The next is to work on your index schema includes fields for LineNumber, Category, SKUNumber, MFGNumber, description, and the chunked text to allow you to retrieve the parent row's column names along with the chunk.

Finally, make sure your query is structured to search within the chunked text and return the relevant fields. Adding the combination of vector search and traditional text search will be a very good to achieve this.

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Lamriben, Mahmoud (Cincinnati, OH) 0 Reputation points

2024-09-24T14:52:42.8666667+00:00

Thanks. That's exactly what I've done. The issues I'm seeing are: 1- The column data from the input csv don't get mapped to my index columns unless I add a "FieldMappings" to the indexer configuration and a projection to the skillset config.

2- I have 10 rows of data in the input file, yet when I search the index, I see 13 documents created (which could be explained by the chunking process).

3- When I search for a particular sku#, I get 13 results back in the "Search Explorer". When I inspect the results, only some of the records contain my keyword that I searched for.
Sina Salam 22,031 Reputation points Volunteer Moderator

2024-09-24T15:38:07.86+00:00
Hi Lamriben, Mahmoud (Cincinnati, OH),

Okay, thanks for your response.

About Field Mapping, make sure your skillset configuration includes projections to specify which fields from your input should be processed and returned in the output.

About Document Count, that discrepancy is due to chunking process. So, adjust the maximumPagesToTake parameter to limit the number of chunks or modify your chunking strategy to reduce overlap.

Now about your seeing results that don’t contain your keyword, these three suggestion should fix it:

Ensure that your textSplitMode is set appropriately. If it’s splitting in a way that includes unrelated text in the chunks, it might lead to irrelevant matches.

Review your indexing configuration to verify that the fields being indexed include the necessary content and are set to be searchable.

Check your query logic to ensure it is correctly targeting the fields you expect to search.

Lamriben, Mahmoud (Cincinnati, OH) 0

Thanks for the info. I was able to crack the puzzle although I still have one unanswered question.

The document count issue was due to the fact that when I used the "Import and Vectorize" wizard, by default the indexer uses the "Default" parsing mode. This created 3 documents/pages after the wizard completes. However, these 3 documents don't get deleted when I changed the parsing mode to "DelimitedText", reset then reran the indexer. So my 10 rows of data in the csv got appended to the index instead, thus the 13 rows in total.

The question that I still can't answer is related to "FieldMappings". I read that fieldmappings are only needed if the field names or data types are different between the source columns and destination index columns. I have kept the field name in sync between my source file and index, yet the population of the index didn't seem to work until I added the "FieldMappings" to the indexer json config. Below is an abridged version of my indexer, skillset, and index configs for review or for anyone trying to use vector embeddings / Azure AI search. Hopefully this helps someone out there.

Indexer JSON:

{
  "@odata.context": "https://mysearchservice.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"0x8DCE14F0EB7B547\"",
  "name": "my-17271833-0925-indexer",
  "description": "put any desc here",
  "dataSourceName": "demo-datasource",
  "skillsetName": "my-17271833-0925-skillset",
  "targetIndexName": "my-index-vector-17271833-0925",
  "disabled": null,
  "schedule": {
    "interval": "PT2H",
    "startTime": "2024-09-27T20:54:43.577Z"
  },
  "parameters": {
    "batchSize": null,
    "maxFailedItems": null,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "firstLineContainsHeaders": true,
      "delimitedTextDelimiter": ",",
      "indexedFileNameExtensions": ".csv",
      "parsingMode": "delimitedText"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "title",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "ItemNumber",
      "targetFieldName": "ItemNumber",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "ItemDescription",
      "targetFieldName": "ItemDescription",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "Color",
      "targetFieldName": "Color",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "Size",
      "targetFieldName": "Size",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "concatenatedText",
      "targetFieldName": "concatenatedText",
      "mappingFunction": null
    }
  ],
  "outputFieldMappings": [],
  "cache": null,
  "encryptionKey": null
}

skillset json config:

{
  "@odata.context": "https://mysearchservice.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": "\"0x8DCDD9B9D0E09D3\"",
  "name": "my-17271833-0925-skillset",
  "description": "Skillset to chunk documents in the my folder and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#1",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 500,
      "pageOverlapLength": 20,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/concatenatedText"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "#2",
      "description": null,
      "context": "/document/pages/*",
      "resourceUri": "https://demomodel.openai.azure.com",
      "apiKey": "<redacted>",
      "deploymentId": "demoTextada002EmbeddingDeployment",
      "dimensions": 1536,
      "modelName": "text-embedding-ada-002",
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "text_vector"
        }
      ],
      "authIdentity": null
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "my-index-vector-17271833-0925",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "text_vector",
            "source": "/document/pages/*/text_vector",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "chunk",
            "source": "/document/pages/*",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "title",
            "source": "/document/title",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "ItemNumber",
            "source": "/document/ItemNumber",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "ItemDescription",
            "source": "/document/ItemDescription",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "Color",
            "source": "/document/Color",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "Size",
            "source": "/document/Size",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "concatenatedText",
            "source": "/document/concatenatedText",
            "sourceContext": null,
            "inputs": []
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  },
  "encryptionKey": null
}

index json

{
  "@odata.context": "https://mysearchservice.search.windows.net/$metadata#indexes/$entity",
  "@odata.etag": "\"0x8DCDD9C80EFED09\"",
  "name": "my-index-vector-17271833-0925",
  "defaultScoringProfile": null,
  "fields": [
    {
      "name": "chunk_id",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": true,
      "key": true,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "keyword",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "parent_id",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": true,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "chunk",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "title",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "text_vector",
      "type": "Collection(Edm.Single)",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "dimensions": 1536,
      "vectorSearchProfile": "vect-17271833-0925-azOpenAi-text-profile",
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "ItemNumber",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "ItemDescription",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "Color",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },
    {
      "name": "Size",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    },	
    {
      "name": "concatenatedText",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "stored": true,
      "sortable": true,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": "standard.lucene",
      "normalizer": null,
      "dimensions": null,
      "vectorSearchProfile": null,
      "vectorEncoding": null,
      "synonymMaps": []
    }
  ],
  "scoringProfiles": [],
  "corsOptions": null,
  "suggesters": [],
  "analyzers": [],
  "normalizers": [],
  "tokenizers": [],
  "tokenFilters": [],
  "charFilters": [],
  "encryptionKey": null,
  "similarity": {
    "@odata.type": "#Microsoft.Azure.Search.BM25Similarity",
    "k1": null,
    "b": null
  },
  "semantic": {
    "defaultConfiguration": "vect-17271833-0925-semantic-config",
    "configurations": [
      {
        "name": "vect-17271833-0925-semantic-config",
        "prioritizedFields": {
          "titleField": {
            "fieldName": "ItemDescription"
          },
          "prioritizedContentFields": [
            {
              "fieldName": "chunk"
            }
          ],
          "prioritizedKeywordsFields": []
        }
      }
    ]
  },
  "vectorSearch": {
    "algorithms": [
      {
        "name": "vect-17271833-0925-alg",
        "kind": "hnsw",
        "hnswParameters": {
          "metric": "cosine",
          "m": 4,
          "efConstruction": 400,
          "efSearch": 500
        },
        "exhaustiveKnnParameters": null
      }
    ],
    "profiles": [
      {
        "name": "vect-17271833-0925-azOpenAi-text-profile",
        "algorithm": "vect-17271833-0925-alg",
        "vectorizer": "vect-17271833-0925-azOpenAi-text-vectorizer",
        "compression": null
      }
    ],
    "vectorizers": [
      {
        "name": "vect-17271833-0925-azOpenAi-text-vectorizer",
        "kind": "azureOpenAI",
        "azureOpenAIParameters": {
          "resourceUri": "https://demomodel.openai.azure.com",
          "deploymentId": "demoTextada002EmbeddingDeployment",
          "apiKey": "<redacted>",
          "modelName": "text-embedding-ada-002",
          "authIdentity": null
        },
        "customWebApiParameters": null,
        "aiServicesVisionParameters": null,
        "amlParameters": null
      }
    ],
    "compressions": []
  }
}

Share via

how to vectorize a CSV file in blob storage using integrated vectorization in the Azure portal

1 answer

Your answer