Index CSV blobs and files using delimitedText parsing mode

Článok
10/24/2024

Applies to: Blob storage indexers, Files indexers

In Azure AI Search, indexers for Azure Blob Storage and Azure Files support a delimitedText parsing mode for CSV files that treats each line in the CSV as a separate search document. For example, given the following comma-delimited text, the delimitedText parsing mode would result in two documents in the search index:

id, datePublished, tags
1, 2016-01-12, "azure-search,azure,cloud"
2, 2016-07-07, "cloud,mobile"

If a field inside the CSV file contains the delimiter, it should be wrapped in quotes. If the field contains a quote, it must be escaped using double quotes ("").

id, datePublished, tags
1, 2020-01-05, "tags,with,""quoted text"""

Without the delimitedText parsing mode, the entire contents of the CSV file would be treated as one search document.

Whenever you create multiple search documents from a single blob, be sure to review Indexing blobs to produce multiple search documents to understand how document key assignments work. The blob indexer is capable of finding or generating values that uniquely define each new document. Specifically, it can create a transitory AzureSearch_DocumentKey when a blob is parsed into smaller parts, where the value is then used as the search document's key in the index.

Set up CSV indexing

To index CSV blobs, create or update an indexer definition with the delimitedText parsing mode on a Create Indexer request.

Only UTF-8 encoding is supported.

{
  "name" : "my-csv-indexer",
  ... other indexer properties
  "parameters" : { "configuration" : { "parsingMode" : "delimitedText", "firstLineContainsHeaders" : true } }
}

firstLineContainsHeaders indicates that the first (nonblank) line of each blob contains headers. If blobs don't contain an initial header line, the headers should be specified in the indexer configuration:

"parameters" : { "configuration" : { "parsingMode" : "delimitedText", "delimitedTextHeaders" : "id,datePublished,tags" } }

You can customize the delimiter character using the delimitedTextDelimiter configuration setting. For example:

"parameters" : { "configuration" : { "parsingMode" : "delimitedText", "delimitedTextDelimiter" : "|" } }

Note

In delimited text parsing mode, Azure AI Search assumes that all blobs are CSV. If you have a mix of CSV and non-CSV blobs in the same data source, consider using file extension filters to control which files are imported on each indexer run.

Request examples

Putting it all together, here are the complete payload examples.

Datasource:

POST https://[service name].search.windows.net/datasources?api-version=2024-07-01
Content-Type: application/json
api-key: [admin key]
{
    "name" : "my-blob-datasource",
    "type" : "azureblob",
    "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
    "container" : { "name" : "my-container", "query" : "<optional, my-folder>" }
}

Indexer:

POST https://[service name].search.windows.net/indexers?api-version=2024-07-01
Content-Type: application/json
api-key: [admin key]
{
  "name" : "my-csv-indexer",
  "dataSourceName" : "my-blob-datasource",
  "targetIndexName" : "my-target-index",
  "parameters" : { "configuration" : { "parsingMode" : "delimitedText", "delimitedTextHeaders" : "id,datePublished,tags" } }
}

Zdieľať cez

Index CSV blobs and files using delimitedText parsing mode

Set up CSV indexing

Request examples

Pripomienky

Ďalšie zdroje informácií

Zdieľať cez

Index CSV blobs and files using delimitedText parsing mode

Set up CSV indexing

Request examples

Related content

Pripomienky

Ďalšie zdroje informácií