Configure a reference dataset in Azure IoT Data Processor Preview

Important

Azure IoT Operations Preview – enabled by Azure Arc is currently in PREVIEW. You shouldn't use this preview software in production environments.

See the Supplemental Terms of Use for Microsoft Azure Previews for legal terms that apply to Azure features that are in beta, preview, or otherwise not yet released into general availability.

Reference datasets within the Azure IoT Data Processor Preview store reference data that pipelines can use for enrichment and contextualization. The data inside the reference data store is organized into datasets, each with multiple keys.

Prerequisites

  • A functioning instance of Data Processor.
  • A Data Processor pipeline with an input stage that deserializes incoming data.

Configure a reference data store

To add a dataset to the data store, you have two options:

  • Select the Reference datasets tab on the pipeline configuration page.
  • Select Create new when the destination type is selected as Reference datasets in the output stage of a pipeline.
Field Description Required Example
Name Name of the dataset. Yes mes-sql
Description Description of the dataset. No erp data
Payload Path to data within the message to store in the dataset No .payload
Expiration time Time validity for the reference data applied to each ingested message. No 12h
Timestamp The jq path is for the timestamp field in the reference data. This field is used for timestamp-based joins in the enrich stage. No .payload.saptimestamp
Keys See keys configuration in the following table.

Timestamps referenced should be in RFC3339, ISO 8601, or Unix timestamp format. By default, the expiration time for a dataset is set to 24h. This default ensures that no stale data is enriched beyond 24 hours (if the data is not updated) or grow unbounded which can fill up the disk.

Each key includes:

Field Description Required Selection Example
Property name Name of the key. This key is used for name-based joins in the enrich stage. No None assetSQL
Property path jq path to the key within the message No None .payload.unique_id
Primary key Determines whether the property is a primary key. Used for updating or appending ingested data into a dataset. No Yes/No Yes

Keys in the dataset aren't required but are recommended for keeping the dataset up to date.

Important

Remember that .payload is automatically appended to the jq path. Reference data only stores the data within the .payload object of the message. Specify the path excluding the .payload prefix.

Tip

It takes a few seconds to deploy the dataset to your cluster and become visible in the dataset list view.

The following notes relate to the dataset configuration options in the previous tables:

  • Property names are case sensitive.
  • You can have up to 10 properties per dataset.
  • Only one primary key can be selected in each dataset.
  • String is the only valid data type for the dataset key values.
  • Primary keys are used to update or append ingested data into a dataset. If a new message comes in with the same primary key, the previous entry is updated. If a new value comes in for the primary key, that new key and the associated value are appended to the dataset
  • The timestamp in the reference dataset is used for timestamp-based join conditions in the enrich stage.
  • You can use the transform stage to transfer data into the payload object as reference datasets store only the data within the .payload object of the message and exclude the associated metadata.

View your datasets

To view the available datasets:

  1. Select Reference datasets in the pipeline editor experience. A list of all available datasets is visible in the Reference datasets view.
  2. Select a dataset to view its configuration details, including dataset keys and timestamps.

Example

This example describes a manufacturing facility where several pieces of equipment are installed at different locations. An ERP system tracks the installations, stores the data in database, and records the following details for each piece of equipment: name, location, installation date, and a boolean that indicates whether it's a spare. For example:

equipment location installationDate isSpare
Oven Seattle 3/5/2002 FALSE
Mixer Tacoma 11/15/2005 FALSE
Slicer Seattle 4/25/2021 TRUE

This ERP data is a useful source of contextual data for the time series data that comes from each location. You can send this data to Data Processor to store in a reference dataset and use it to enrich messages in other pipelines.

When you send data from a database, such as Microsoft SQL server, to Data Processor, it deserializes it into a format that it can process. The following JSON shows an example payload that represents the data from a database within Data Processor:

{
    "payload": { 
        { 
            "equipment": "Oven", 
            "location": "Seattle", 
            "installationDate": "2002-03-05T00:00:00Z", 
            "isSpare": "FALSE" 
        }, 
        { 
            "equipment": "Mixer", 
            "location": "Tacoma", 
            "installationDate": "2005-11-15T00:00:00Z", 
            "isSpare": "FALSE"
        }, 
        { 
            "equipment": "Slicer", 
            "location": "Seattle", 
            "installationDate": "2021-04-25T00:00:00Z", 
            "isSpare": "TRUE"
        } 
    }
} 

Use the following configuration for the reference dataset:

Field Example
Name equipment
Timestamp .installationDate
Expiration time 12h

The two keys:

Field Example
Property name equipment name
Property path .equipment
Primary key Yes
Field Example
Property name location
Property path .location
Primary key No

Each dataset can only have one primary key.

All incoming data within the pipeline is stored in the equipment dataset in the reference data store. The stored data includes the installationDate timestamp and keys such as equipment name and location.

These properties are available in the enrichment stages of other pipelines where you can use them to provide context and add additional information to the messages being processed. For example, you can use this data to supplement sensor readings from a specific piece of equipment with its installation date and location. To learn more, see the Enrich stage.

Within the equipment dataset, the equipment name key serves as the primary key. When th pipeline ingests new data, Data Processor checks this property to determine how to handle the incoming data:

  • If a message arrives with an equipment name key that doesn't yet exist in the dataset (such as Pump), Data Processor adds a new entry to the dataset. This entry includes the new equipment name type and its associated data such as location, installationDate, and isSpare.
  • If a message arrives with an equipment name key that matches an existing entry in the dataset (such as Slicer), Data Processor updates that entry. The associated data for that equipment such as location, installationDate, and isSpare updates with the values from the incoming message.

The equipment dataset in the reference data store is an up-to-date source of information that can enhance and contextualize the data flowing through other pipelines in Data Processor using the Enrich stage.