Quickstart: Integrated vectorization (preview)

Important

Import and vectorize data wizard is in public preview under Supplemental Terms of Use. It targets the 2023-10-01-Preview REST API.

Get started with integrated vectorization (preview) using the Import and vectorize data wizard in the Azure portal. This wizard calls an Azure OpenAI text embedding model to vectorize content during indexing and for queries.

In this preview version of the wizard:

  • Source data is blob only, using the default parsing mode (one search document per blob).

  • Index schema is nonconfigurable. Source fields include content (chunked and vectorized), metadata_storage_name for title, and a metadata_storage_path for the document key which is populated as parent_id in the Index.

  • Vectorization is Azure OpenAI only (text-embedding-ada-002), using the HNSW algorithm with defaults.

  • Chunking is nonconfigurable. The effective settings are:

    textSplitMode: "pages",
    maximumPageLength: 2000,
    pageOverlapLength: 500
    

Prerequisites

  • An Azure subscription. Create one for free.

  • Azure AI Search, in any region and on any tier. Most existing services support vector search. For a small subset of services created prior to January 2019, an index containing vector fields fails on creation. In this situation, a new service must be created.

  • Azure OpenAI endpoint with a deployment of text-embedding-ada-002 and an API key or Cognitive Services OpenAI User permissions to upload data. You can only choose one vectorizer in this preview, and the vectorizer must be Azure OpenAI.

  • Azure Storage account, standard performance (general-purpose v2), Hot and Cool access tiers.

  • Blobs providing text content, unstructured docs only, and metadata. In this preview, your data source must be Azure blobs.

  • Read permissions in Azure Storage. A storage connection string that includes an access key gives you read access to storage content. If instead you're using Microsoft Entra logins and roles, make sure the search service's managed identity has Storage Blob Data Reader permissions.

  • All components (data source and embedding endpoint) must have public access enabled for the portal nodes to be able to access them. Otherwise, the wizard will fail. After the wizard runs, firewalls and private endpoints can be enabled in the different integration components for security. If private endpoints are already present and can't be disabled, the alternative option is to run the respective end-to-end flow from a script or program from a Virtual Machine within the same VNET as the private endpoint. Here is a Python code sample for integrated vectorization. In the same GitHub repo are samples in other programming languages.

Check for space

Many customers start with the free service. The free tier is limited to three indexes, three data sources, three skillsets, and three indexers. Make sure you have room for extra items before you begin. This quickstart creates one of each object.

Check for semantic ranking

This wizard supports semantic ranking, but only on Basic tier and above, and only if semantic ranking is already enabled on your search service. If you're using a billable tier, check to see if semantic ranking is enabled.

Screenshot of the semantic ranker configuration page.

Prepare sample data

This section points you to data that works for this quickstart.

  1. Sign in to the Azure portal with your Azure account, and go to your Azure Storage account.

  2. In the navigation pane, under Data Storage, select Containers.

  3. Create a new container and then upload the health-plan PDF documents used for this quickstart.

  4. Before leaving the Azure Storage account in the Azure portal, grant Storage Blob Data Reader permissions on the container, assuming you want role-based access. Or, get a connection string to the storage account from the Access keys page.

Get connection details for Azure OpenAI

The wizard needs an endpoint, a deployment of text-embedding-ada-002, and either an API key or a search service managed identity with Cognitive Services OpenAI User permissions.

  1. Sign in to the Azure portal with your Azure account, and go to your Azure OpenAI resource.

  2. Under Keys and management, copy the endpoint.

  3. On the same page, copy a key or check Access control to assign role members to your search service identity.

  4. Under Model deployments, select Manage deployments to open Azure AI Studio. Copy the deployment name of text-embedding-ada-002.

Start the wizard

To get started, browse to your Azure AI Search service in the Azure portal and open the Import and vectorize data wizard.

  1. Sign in to the Azure portal with your Azure account, and go to your Azure AI Search service.

  2. On the Overview page, select Import and vectorize data.

    Screenshot of the wizard command.

Connect to your data

The next step is to connect to a data source to use for the search index.

  1. In the Import and vectorize data wizard on the Connect to your data tab, expand the Data Source dropdown list and select Azure Blob Storage.

  2. Specify the Azure subscription, storage account, and container that provides the data.

  3. For the connection, either provide a full access connection string that includes a key, or specify a managed identity that has Storage Blob Data Reader permissions on the container.

  4. Specify whether you want deletion detection:

    Screenshot of the data source page.

  5. Select Next: Vectorize and Enrich to continue.

Enrich and vectorize your data

In this step, specify the embedding model used to vectorize chunked data.

  1. Provide the subscription, endpoint, API key, and model deployment name.

  2. Optionally, you can crack binary images (for example, scanned document files) and use OCR to recognize text.

  3. Optionally, you can add semantic ranking to rerank results at the end of query execution, promoting the most semantically relevant matches to the top.

  4. Specify a run time schedule for the indexer.

    Screenshot of the enrichment page.

  5. Select Next: Create and Review to continue.

Run the wizard

This step creates the following objects:

  • Data source connection to your blob container.

  • Index with vector fields, vectorizers, vector profiles, vector algorithms. You aren't prompted to design or modify the default index during the wizard workflow. Indexes conform to the 2023-10-01-Preview version.

  • Skillset with Text Split skill for chunking and AzureOpenAIEmbeddingModel for vectorization.

  • Indexer with field mappings and output field mappings (if applicable).

If you get errors, review permissions first. You need Cognitive Services OpenAI User on Azure OpenAI and Storage Blob Data Reader on Azure Storage. Your blobs must be unstructured (chunked data is pulled from the blob's "content" property).

Check results

Search explorer accepts text strings as input and then vectorizes the text for vector query execution.

  1. Select your index.

  2. Optionally, select Query options and hide vector values in search results. This step makes your search results easier to read.

    Screenshot of the query options button.

  3. Select JSON view so that you can enter text for your vector query in the text vector query parameter.

    Screenshot of JSON selector.

    This wizard offers a default query that issues a vector query on the "vector" field, returning the 5 nearest neighbors. If you opted to hide vector values, your default query includes a "select" statement that excludes the vector field from search results.

    {
       "select": "chunk_id,parent_id,chunk,title",
       "vectorQueries": [
           {
              "kind": "text",
              "text": "*",
              "k": 5,
              "fields": "vector"
           }
        ]
    }
    
  4. Replace the text "*" with a question related to health plans, such as "which plan has the lowest deductible".

  5. Select Search to run the query.

    Screenshot of search results.

    You should see 5 matches, where each document is a chunk of the original PDF. The title field shows which PDF the chunk comes from.

  6. To see all of the chunks from a specific document, add a filter for the title field for a specific PDF:

    {
       "select": "chunk_id,parent_id,chunk,title",
       "filter": "title eq 'Benefit_Options.pdf'",
       "count": true,
       "vectorQueries": [
           {
              "kind": "text",
              "text": "*",
              "k": 5,
              "fields": "vector"
           }
        ]
    }
    
    

Clean up

Azure AI Search is a billable resource. If it's no longer needed, delete it from your subscription to avoid charges.

Next steps

This quickstart introduced you to the Import and vectorize data wizard that creates all of the objects necessary for integrated vectorization. If you want to explore each step in detail, try an integrated vectorization sample.