Quickstart: Import and vectorize data wizard (preview)

Artikkeli
06/17/2024

Important

Import and vectorize data wizard is in public preview under Supplemental Terms of Use. By default, it targets the 2024-05-01-Preview REST API.

Get started with integrated vectorization (preview) using the Import and vectorize data wizard in the Azure portal. This wizard calls a user-specified embedding model to vectorize content during indexing and for queries.

You need three Azure resources and some sample files to complete this walkthrough:

Azure Blob storage or Microsoft Fabric with OneLake for your data
Azure vectorizations: either Azure AI services multiservice account, Azure OpenAI, or Azure AI Studio model catalog
Azure AI Search for indexing and queries

Preview limitations

Source data is either Azure Blob Storage or OneLake files and shortcuts, using the default parsing mode (one search document per blob or file).
Index schema is nonconfigurable. Source fields include "content" (chunked and vectorized), "metadata_storage_name" for title, and a "metadata_storage_path" for the document key, represented as parent_id in the Index.

Chunking is nonconfigurable. The effective settings are:

textSplitMode: "pages",
maximumPageLength: 2000,
pageOverlapLength: 500

For fewer limitations or more data source options, try a code-base approach. See integrated vectorization sample for details.

Prerequisites

An Azure subscription. Create one for free.
For data, use either an Azure Storage account or a OneLake lakehouse. For Azure Storage, use a standard performance (general-purpose v2) account. Access tiers can be hot, cool, and cold.
For vectorization, have an Azure AI services multiservice account or Azure OpenAI endpoint with deployments.

For multimodal with Azure AI Vision, create an Azure AI service in SwedenCentral, EastUS, NorthEurope, WestEurope, WestUS, SoutheastAsia, KoreaCentral, FranceCentral, AustraliaEast, WestUS2, SwitzerlandNorth, JapanEast. Check the documentation for an updated list.

You can also use Azure AI Studio model catalog (and hub and project) with model deployments.
Azure AI Search, in the same region as your Azure AI service. We recommend Basic tier or higher.s
Role assignments or API keys are required for connections to embedding models and data sources. Instructions for role-based access are provided in this article.

All of the above resources must have public access enabled for the portal nodes to be able to access them. Otherwise, the wizard fails. After the wizard runs, firewalls and private endpoints can be enabled on the different integration components for security.

If private endpoints are already present and can't be disabled, the alternative option is to run the respective end-to-end flow from a script or program from a virtual machine within the same virtual network as the private endpoint. Here's a Python code sample for integrated vectorization. In the same GitHub repo are samples in other programming languages.

A free search service supports role-based access control on connections to Azure AI Search, but it doesn't support managed identities on outbound connections to Azure Storage or Azure AI Vision. This means you must use key-based authentication on free search service connections to other Azure services. For more secure connections, use basic tier or above and configure a managed identity and role assignments to admit requests from Azure AI Search on other Azure services.

Check for space

If you're starting with the free service, you're limited to three indexes, three data sources, three skillsets, and three indexers. Make sure you have room for extra items before you begin. This quickstart creates one of each object.

Check for service identity

We recommend role assignments for search service connections to other resources.

On Azure AI Search, enable role-based access.
Configure your search service to use a system or user-assigned managed identity.

In the following sections, you can assign the search service managed identity to roles in other services. Steps for role assignments are provided where applicable.

Check for semantic ranking

This wizard supports semantic ranking, but only on Basic tier and higher, and only if semantic ranking is already enabled on your search service. If you're using a billable tier, check to see if semantic ranking is enabled.

Prepare sample data

This section points you to data that works for this quickstart.

Azure Storage
OneLake

Sign in to the Azure portal with your Azure account, and go to your Azure Storage account.
In the navigation pane, under Data Storage, select Containers.
Create a new container and then upload the health-plan PDF documents used for this quickstart.
On Access control, assign Storage Blob Data Reader on the container to the search service identity. Or, get a connection string to the storage account from the Access keys page.

Sign in to the Power BI and create a workspace.
In Power BI, select Workspaces from the left-hand menu and open the workspace you created.
Assign permissions at the workspace level:
1. Select Manage access in the top right menu.
2. Select Add people or groups.
3. Enter the name of your search service. For example, if the URL is https://my-demo-service.search.windows.net, the search service name is my-demo-service.
4. Select a role. The default is Viewer, but you need Contributor to pull data into a search index.
Load the sample data:
1. From the Power BI switcher located at the bottom left, select Data Engineering.
2. In the Data Engineering screen, select Lakehouse to create a lakehouse.
3. Provide a name and then select Create to create and open the new lakehouse.
4. Select Upload files and then upload the health-plan PDF documents used for this quickstart.
Before leaving the lakehouse, copy the URL, or get the workspace and lakehouse IDs, so that you can specify the lakehouse in the wizard. The URL is in this format: https://msit.powerbi.com/groups/00000000-0000-0000-0000-000000000000/lakehouses/11111111-1111-1111-1111-111111111111?experience=data-engineering

Set up embedding models

Integrated vectorization and the Import and vectorize data wizard tap into deployed embedding models during indexing to convert text and images into vectors.

You can use embedding models deployed in Azure OpenAI, Azure AI Vision for multimodal embeddings, or in the model catalog in Azure AI Studio.

Import and vectorize data supports: text-embedding-ada-002, text-embedding-3-large, text-embedding-3-small. Internally, the wizard uses the AzureOpenAIEmbedding skill to connect to Azure OpenAI.

Use these instructions to assign permissions or get an API key for search service connection to Azure OpenAI. You should set up permissions or have connection information in hand before running the wizard.

Sign in to the Azure portal with your Azure account, and go to your Azure OpenAI resource.
Set up permissions:
1. Select Access control from the left menu.
2. Select Add and then select Add role assignment.
3. Under Job function roles, select Cognitive Services OpenAI User and then select Next.
4. Under Members, select Managed identity and then select Members.
5. Filter by subscription and resource type (Search services), and then select the managed identity of your search service.
6. Select Review + assign.
On the Overview page, select Click here to view endpoints and Click here to manage keys if you need to copy an endpoint or API key. You can paste these values into the wizard if you're using an Azure OpenAI resource with key-based authentication.
Under Resource Management and Model deployments, select Manage Deployments to open Azure AI Studio.
Copy the deployment name of text-embedding-ada-002 or another supported embedding model. If you don't have an embedding model, deploy one now.

Start the wizard

Sign in to the Azure portal with your Azure account, and go to your Azure AI Search service.
On the Overview page, select Import and vectorize data.

Connect to your data

The next step is to connect to a data source to use for the search index.

In the Import and vectorize data wizard on the Connect to your data tab, expand the Data Source dropdown list and select Azure Blob Storage or OneLake.
Specify the Azure subscription.
For OneLake, specify the lakehouse URL or provide the workspace and lakehouse IDs.
For Azure Storage, select the account and container that provides the data.
Specify whether you want deletion detection.
Select Next.

Vectorize your text

In this step, specify the embedding model used to vectorize chunked data.

Specify whether deployed models are on Azure OpenAI, the Azure AI Studio model catalog, or an existing Azure AI Vision multimodal resource in the same region as Azure AI Search.
Specify the Azure subscription.
For Azure OpenAI, select the service, model deployment, and authentication type. See Set up embedding models for details.
For AI Studio catalog, select the project, model deployment, and authentication type. See Set up embedding models for details.
For AI Vision vectorization, select the account. See Set up embedding models for details.
Select the checkbox acknowledging the billing impact of using these resources.
Select Next.

Vectorize and enrich your images

If your content includes images, you can apply AI in two ways:

Use a supported image embedding model from the catalog, or choose the Azure AI Vision multimodal embeddings API to vectorize images.
Use OCR to recognize text in images.

Azure AI Search and your Azure AI resource must be in the same region.

Specify the kind of connection the wizard should make. For image vectorization, it can connect to embedding models in Azure AI Studio or Azure AI Vision.
Specify the subscription.
For Azure AI Studio model catalog, specify the project and deployment. See Setting up an embedding model for details.
Optionally, you can crack binary images (for example, scanned document files) and use OCR to recognize text.
Select the checkbox acknowledging the billing impact of using these resources.
Select Next.

Advanced settings

Optionally, you can add semantic ranking to rerank results at the end of query execution, promoting the most semantically relevant matches to the top.
Optionally, specify a run time schedule for the indexer.
Select Next.

Run the wizard

On Review and create, specify a prefix for the objects created when the wizard runs. A common prefix helps you stay organized.
Select Create to run the wizard. This step creates the following objects:
- Data source connection.
- Index with vector fields, vectorizers, vector profiles, vector algorithms. You aren't prompted to design or modify the default index during the wizard workflow. Indexes conform to the 2024-05-01-preview REST API.
- Skillset with Text Split skill for chunking and an embedding skill for vectorization. The embedding skill is either the AzureOpenAIEmbeddingModel skill for Azure OpenAI or AML skill for Azure AI Studio model catalog.
- Indexer with field mappings and output field mappings (if applicable).

If you can't select Azure AI Vision vectorizer, make sure you have an Azure AI Vision resource in a supported region, and that your search service managed identity has Cognitive Services OpenAI User permissions.

If you can't progress through the wizard because other options aren't available (for example, you can't select a data source or an embedding model), revisit the role assignments. Error messages indicate that models or deployments don't exist, when in fact the real issue is that the search service doesn't have permission to access them.

Check results

Search explorer accepts text strings as input and then vectorizes the text for vector query execution.

In the Azure portal, under Search Management and Indexes, select the index your created.
Optionally, select Query options and hide vector values in search results. This step makes your search results easier to read.
Select JSON view so that you can enter text for your vector query in the text vector query parameter.

This wizard offers a default query that issues a vector query on the "vector" field, returning the 5 nearest neighbors. If you opted to hide vector values, your default query includes a "select" statement that excludes the vector field from search results.
```
{
   "select": "chunk_id,parent_id,chunk,title",
   "vectorQueries": [
       {
          "kind": "text",
          "text": "*",
          "k": 5,
          "fields": "vector"
       }
    ]
}
```
Replace the text "*" with a question related to health plans, such as "which plan has the lowest deductible".
Select Search to run the query.

You should see 5 matches, where each document is a chunk of the original PDF. The title field shows which PDF the chunk comes from.

To see all of the chunks from a specific document, add a filter for the title field for a specific PDF:

{
   "select": "chunk_id,parent_id,chunk,title",
   "filter": "title eq 'Benefit_Options.pdf'",
   "count": true,
   "vectorQueries": [
       {
          "kind": "text",
          "text": "*",
          "k": 5,
          "fields": "vector"
       }
    ]
}

Clean up

Azure AI Search is a billable resource. If it's no longer needed, delete it from your subscription to avoid charges.

Next steps

This quickstart introduced you to the Import and vectorize data wizard that creates all of the objects necessary for integrated vectorization. If you want to explore each step in detail, try an integrated vectorization sample.

Jaa