Tutorial: Enrich Cognitive Search index with custom classes from your data

With the abundance of electronic documents within the enterprise, the problem of search through them becomes a tiring and expensive task. Azure Cognitive Search helps with searching through your files based on their indices. Custom text classification helps in enriching the indexing of these files by classifying them into your custom classes.

In this tutorial, you will learn how to:

  • Create a custom text classification project.
  • Publish Azure function.
  • Add Index to your Azure Cognitive search.

Prerequisites

Upload sample data to blob container

After you have created an Azure storage account and connected it to your Language resource, you will need to upload the documents from the sample dataset to the root directory of your container. These documents will later be used to train your model.

  1. Download the sample dataset for multi label classification projects.

  2. Open the .zip file, and extract the folder containing the documents.

The provided sample dataset contains about 200 documents, each of which is a summary for a movie. Each document belongs to one or more of the following classes:

  • "Mystery"
  • "Drama"
  • "Thriller"
  • "Comedy"
  • "Action"
  1. In the Azure portal, navigate to the storage account you created, and select it.

  2. In your storage account, select Containers from the left menu, located below Data storage. On the screen that appears, select + Container. Give the container the name example-data and leave the default Public access level.

    A screenshot showing the main page for a storage account.

  3. After your container has been created, select it. Then click Upload button to select the .txt and .json files you downloaded earlier.

    A screenshot showing the button for uploading files to the storage account.

Create a custom text classification project

Once your resource and storage container are configured, create a new custom text classification project. A project is a work area for building your custom ML models based on your data. Your project can only be accessed by you and others who have access to the Language resource being used.

  1. Sign into the Language Studio. A window will appear to let you select your subscription and Language resource. Select your Language resource.

  2. Under the Classify text section of Language Studio, select Custom text classification.

    A screenshot showing the location of custom text classification in the Language Studio landing page.

  3. Select Create new project from the top menu in your projects page. Creating a project will let you label data, train, evaluate, improve, and deploy your models.

    A screenshot of the project creation page.

  4. After you click, Create new project, a window will appear to let you connect your storage account. If you've already connected a storage account, you will see the storage accounted connected. If not, choose your storage account from the dropdown that appears and click on Connect storage account; this will set the required roles for your storage account. This step will possibly return an error if you are not assigned as owner on the storage account.

    Note

    • You only need to do this step once for each new language resource you use.
    • This process is irreversible, if you connect a storage account to your Language resource you cannot disconnect it later.
    • You can only connect your Language resource to one storage account.

    A screenshot of the storage connection screen for custom classification projects.

  5. Select project type. You can either create a Multi label classification project where each document can belong to one or more classes or Single label classification project where each document can belong to only one class. The selected type can't be changed later. Learn more about project types

    A screenshot of the available custom classification project types.

  6. Enter the project information, including a name, description, and the language of the documents in your project. If you're using the example dataset, select English. You won’t be able to change the name of your project later. Click Next.

    Tip

    Your dataset doesn't have to be entirely in the same language. You can have multiple documents, each with different supported languages. If your dataset contains documents of different languages or if you expect text from different languages during runtime, select enable multi-lingual dataset option when you enter the basic information for your project. This option can be enabled later from the Project settings page.

  7. Select the container where you have uploaded your dataset.

    Note

    If you have already labeled your data make sure it follows the supported format and click on Yes, my documents are already labeled and I have formatted JSON labels file and select the labels file from the drop-down menu below. Click Next.

  8. Review the data you entered and select Create Project.

Train your model

Typically after you create a project, you go ahead and start tagging the documents you have in the container connected to your project. For this tutorial, you have imported a sample tagged dataset and initialized your project with the sample JSON tags file.

To start training your model from within the Language Studio:

  1. Select Training jobs from the left side menu.

  2. Select Start a training job from the top menu.

  3. Select Train a new model and type in the model name in the text box. You can also overwrite an existing model by selecting this option and choosing the model you want to overwrite from the dropdown menu. Overwriting a trained model is irreversible, but it won't affect your deployed models until you deploy the new model.

    Create a new training job

  4. Select data splitting method. You can choose Automatically splitting the testing set from training data where the system will split your labeled data between the training and testing sets, according to the specified percentages. Or you can Use a manual split of training and testing data, this option is only enabled if you have added documents to your testing set during data labeling. See How to train a model for more information on data splitting.

  5. Click on the Train button.

  6. If you click on the training job ID from the list, a side pane will appear where you can check the Training progress, Job status, and other details for this job.

    Note

    • Only successfully completed training jobs will generate models.
    • Training can take some time between a couple of minutes and several hours based on the size of your labeled data.
    • You can only have one training job running at a time. You can't start other training job within the same project until the running job is completed.

Deploy your model

Generally after training a model you would review it's evaluation details and make improvements if necessary. In this quickstart, you will just deploy your model, and make it available for you to try in Language Studio, or you can call the prediction API.

To deploy your model from within the Language Studio:

  1. Select Deploying a model from the left side menu.

  2. Click on Add deployment to start a new deployment job.

    A screenshot showing the deployment button

  3. Select Create new deployment to create a new deployment and assign a trained model from the dropdown below. You can also Overwrite an existing deployment by selecting this option and select the trained model you want to assign to it from the dropdown below.

    Note

    Overwriting an existing deployment doesn't require changes to your Prediction API call but the results you get will be based on the newly assigned model.

    A screenshot showing the deployment screen

  4. click on Deploy to start the deployment job.

  5. After deployment is successful, an expiration date will appear next to it. Deployment expiration is when your deployed model will be unavailable to be used for prediction, which typically happens twelve months after a training configuration expires.

Use CogSvc language utilities tool for Cognitive search integration

Publish your Azure Function

  1. Download and use the provided sample function.

  2. After you download the sample function, open the program.cs file in Visual Studio and publish the function to Azure.

Prepare configuration file

  1. Download sample configuration file and open it in a text editor.

  2. Get your storage account connection string by:

    1. Navigating to your storage account overview page in the Azure portal.
    2. In the Access Keys section in the menu to the left of the screen, copy your Connection string to the connectionString field in the configuration file, under blobStorage.
    3. Go to the container where you have the files you want to index and copy container name to the containerName field in the configuration file, under blobStorage.
  3. Get your cognitive search endpoint and keys by:

    1. Navigating to your resource overview page in the Azure portal.
    2. Copy the Url at the top-right section of the page to the endpointUrl field within cognitiveSearch.
    3. Go to the Keys section in the menu to the left of the screen. Copy your Primary admin key to the apiKey field within cognitiveSearch.
  4. Get Azure Function endpoint and keys

    1. To get your Azure Function endpoint and keys, go to your function overview page in the Azure portal.
    2. Go to Functions menu on the left of the screen, and click on the function you created.
    3. From the top menu, click Get Function Url. The URL will be formatted like this: YOUR-ENDPOINT-URL?code=YOUR-API-KEY.
    4. Copy YOUR-ENDPOINT-URL to the endpointUrl field in the configuration file, under azureFunction.
    5. Copy YOUR-API-KEY to the apiKey field in the configuration file, under azureFunction.
  5. Get your resource keys endpoint

    • Go to your resource overview page in the Azure portal

    • From the menu on the left side, select Keys and Endpoint. You will use the endpoint and key for the API requests

    A screenshot showing the key and endpoint page in the Azure portal.

  6. Get your custom text classification project secrets

    1. You will need your project-name, project names are case-sensitive. Project names can be found in project settings page.

    2. You will also need the deployment-name. Deployment names can be found in Deploying a model page.

Run the indexer command

After you've published your Azure function and prepared your configs file, you can run the indexer command.

    indexer index --index-name <name-your-index-here> --configs <absolute-path-to-configs-file>

Replace name-your-index-here with the index name that appears in your Cognitive Search instance.

Next steps