Tutorial: Enrich a Cognitive Search index with custom entities from your data

In enterprise, having an abundance of electronic documents can mean that searching through them is a time-consuming and expensive task. Azure Cognitive Search can help with searching through your files, based on their indices. Custom named entity recognition can help by extracting relevant entities from your files, and enriching the process of indexing these files.

In this tutorial, you learn how to:

  • Create a custom named entity recognition project.
  • Publish Azure function.
  • Add an index to Azure Cognitive Search.

Prerequisites

Upload sample data to blob container

After you have created an Azure storage account and connected it to your Language resource, you will need to upload the documents from the sample dataset to the root directory of your container. These documents will later be used to train your model.

  1. Download the sample dataset from GitHub.

  2. Open the .zip file, and extract the folder containing the documents.

  3. In the Azure portal, navigate to the storage account you created, and select it.

  4. In your storage account, select Containers from the left menu, located below Data storage. On the screen that appears, select + Container. Give the container the name example-data and leave the default Public access level.

    A screenshot showing the main page for a storage account.

  5. After your container has been created, select it. Then click Upload button to select the .txt and .json files you downloaded earlier.

    A screenshot showing the button for uploading files to the storage account.

The provided sample dataset contains 20 loan agreements. Each agreement includes two parties: a lender and a borrower. You can use the provided sample file to extract relevant information for: both parties, an agreement date, a loan amount, and an interest rate.

Create a custom named entity recognition project

Once your resource and storage account are configured, create a new custom NER project. A project is a work area for building your custom ML models based on your data. Your project can only be accessed by you and others who have access to the Language resource being used.

  1. Sign into the Language Studio. A window will appear to let you select your subscription and Language resource. Select the Language resource you created in the above step.

  2. Under the Extract information section of Language Studio, select Custom named entity recognition.

    A screenshot showing the location of custom NER in the Language Studio landing page.

  3. Select Create new project from the top menu in your projects page. Creating a project will let you tag data, train, evaluate, improve, and deploy your models.

    A screenshot of the project creation page.

  4. After you click, Create new project, a window will appear to let you connect your storage account. If you've already connected a storage account, you will see the storage accounted connected. If not, choose your storage account from the dropdown that appears and click on Connect storage account; this will set the required roles for your storage account. This step will possibly return an error if you are not assigned as owner on the storage account.

    Note

    • You only need to do this step once for each new resource you use.
    • This process is irreversible, if you connect a storage account to your Language resource you cannot disconnect it later.
    • You can only connect your Language resource to one storage account.

    A screenshot showing the storage connection screen.

  5. Enter the project information, including a name, description, and the language of the files in your project. If you're using the example dataset, select English. You won’t be able to change the name of your project later. Click Next

    Tip

    Your dataset doesn't have to be entirely in the same language. You can have multiple documents, each with different supported languages. If your dataset contains documents of different languages or if you expect text from different languages during runtime, select enable multi-lingual dataset option when you enter the basic information for your project. This option can be enabled later from the Project settings page.

  6. Select the container where you have uploaded your dataset. If you have already labeled data make sure it follows the supported format and click on Yes, my files are already labeled and I have formatted JSON labels file and select the labels file from the drop-down menu. Click Next.

  7. Review the data you entered and select Create Project.

Train your model

Typically after you create a project, you go ahead and start tagging the documents you have in the container connected to your project. For this tutorial, you have imported a sample tagged dataset and initialized your project with the sample JSON tags file.

To start training your model from within the Language Studio:

  1. Select Training jobs from the left side menu.

  2. Select Start a training job from the top menu.

  3. Select Train a new model and type in the model name in the text box. You can also overwrite an existing model by selecting this option and choosing the model you want to overwrite from the dropdown menu. Overwriting a trained model is irreversible, but it won't affect your deployed models until you deploy the new model.

    Create a new training job

  4. Select data splitting method. You can choose Automatically splitting the testing set from training data where the system will split your labeled data between the training and testing sets, according to the specified percentages. Or you can Use a manual split of training and testing data, this option is only enabled if you have added documents to your testing set during data labeling. See How to train a model for information about data splitting.

  5. Click on the Train button.

  6. If you click on the Training Job ID from the list, a side pane will appear where you can check the Training progress, Job status, and other details for this job.

    Note

    • Only successfully completed training jobs will generate models.
    • Training can take some time between a couple of minutes and several hours based on the size of your labeled data.
    • You can only have one training job running at a time. You can't start other training job within the same project until the running job is completed.

Deploy your model

Generally after training a model you would review its evaluation details and make improvements if necessary. In this quickstart, you will just deploy your model, and make it available for you to try in Language Studio, or you can call the prediction API.

To deploy your model from within the Language Studio:

  1. Select Deploying a model from the left side menu.

  2. Click on Add deployment to start a new deployment job.

    A screenshot showing the deployment button

  3. Select Create new deployment to create a new deployment and assign a trained model from the dropdown below. You can also Overwrite an existing deployment by selecting this option and select the trained model you want to assign to it from the dropdown below.

    Note

    Overwriting an existing deployment doesn't require changes to your prediction API call but the results you get will be based on the newly assigned model.

    A screenshot showing the deployment screen

  4. Click on Deploy to start the deployment job.

  5. After deployment is successful, an expiration date will appear next to it. Deployment expiration is when your deployed model will be unavailable to be used for prediction, which typically happens twelve months after a training configuration expires.

Use CogSvc language utilities tool for Cognitive search integration

Publish your Azure Function

  1. Download and use the provided sample function.

  2. After you download the sample function, open the program.cs file in Visual Studio and publish the function to Azure.

Prepare configuration file

  1. Download sample configuration file and open it in a text editor.

  2. Get your storage account connection string by:

    1. Navigating to your storage account overview page in the Azure portal.
    2. In the Access Keys section in the menu to the left of the screen, copy your Connection string to the connectionString field in the configuration file, under blobStorage.
    3. Go to the container where you have the files you want to index and copy container name to the containerName field in the configuration file, under blobStorage.
  3. Get your cognitive search endpoint and keys by:

    1. Navigating to your resource overview page in the Azure portal.
    2. Copy the Url at the top-right section of the page to the endpointUrl field within cognitiveSearch.
    3. Go to the Keys section in the menu to the left of the screen. Copy your Primary admin key to the apiKey field within cognitiveSearch.
  4. Get Azure Function endpoint and keys

    1. To get your Azure Function endpoint and keys, go to your function overview page in the Azure portal.
    2. Go to Functions menu on the left of the screen, and select on the function you created.
    3. From the top menu, select Get Function Url. The URL will be formatted like this: YOUR-ENDPOINT-URL?code=YOUR-API-KEY.
    4. Copy YOUR-ENDPOINT-URL to the endpointUrl field in the configuration file, under azureFunction.
    5. Copy YOUR-API-KEY to the apiKey field in the configuration file, under azureFunction.
  5. Get your resource keys endpoint

    1. Go to your resource overview page in the Azure portal

    2. From the menu on the left side, select Keys and Endpoint. You will use the endpoint and key for the API requests

      A screenshot showing the key and endpoint page in the Azure portal

  6. Get your custom NER project secrets

    1. You will need your project-name, project names are case-sensitive. Project names can be found in project settings page.

    2. You will also need the deployment-name. Deployment names can be found in Deploying a model page.

Run the indexer command

After you've published your Azure function and prepared your configs file, you can run the indexer command.

    indexer index --index-name <name-your-index-here> --configs <absolute-path-to-configs-file>

Replace name-your-index-here with the index name that appears in your Cognitive Search instance.

Next steps