Hello Lamriben, Mahmoud (Cincinnati, OH),
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you would like to vectorize a CSV file stored in Azure Blob Storage using integrated vectorization and get a desired result
With what you've done as you explained, you are on the right track but there are a few things you will need to fix.
- After you have your CSV file uploaded to an Azure Blob Storage container, then in the Azure portal, navigate to your Azure Cognitive Search service and use the "Import and vectorize data" wizard to import your CSV file. It will help you create an index and configure vectorization.
- Then, update the indexer configuration to use the
DelimitedText
parsing mode to parse your CSV file correctly. The example of configuration snippet looks like:{ "configuration": { "dataToExtract": "contentAndMetadata", "parsingMode": "delimitedText", "delimitedTextDelimiter": ",", "indexedFileNameExtensions": ".csv", "firstLineContainsHeaders": true } }
- Now, to break down the text into manageable chunks for vectorization, you will also need to update and modify the SplitSkill. For example:
{ "@odata.type": "#Microsoft.Skills.Text.SplitSkill", "name": "#1", "description": "Split skill to chunk documents", "context": "/document", "defaultLanguageCode": "en", "textSplitMode": "pages", "maximumPageLength": 1000, "pageOverlapLength": 50, "maximumPagesToTake": 0, "inputs": [ { "name": "text", "source": "/document/concatenatedText" } ], "outputs": [ { "name": "textItems", "targetName": "pages" } ] }
At this point to ensure accuracy or a desired result, if SplitSkill
is correctly configured to chunk the ConcatenatedText
column with specific settings for maximumPageLength
and pageOverlapLength
parameters to ensure meaningful chunks. The next is to work on your index schema includes fields for LineNumber
, Category
, SKUNumber
, MFGNumber
, description
, and the chunked text to allow you to retrieve the parent row's column names along with the chunk.
Finally, make sure your query is structured to search within the chunked text and return the relevant fields. Adding the combination of vector search and traditional text search will be a very good to achieve this.
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.