Configure streaming ingestion on your Azure Synapse Data Explorer pool (Preview)

Artikkel
02/18/2022

Streaming ingestion is useful for loading data when you need low latency between ingestion and query. Consider using streaming ingestion in the following scenarios:

Latency of less than a second is required.
To optimize operational processing of many tables where the stream of data into each table is relatively small (a few records per second), but the overall data ingestion volume is high (thousands of records per second).

If the stream of data into each table is high (over 4 GB per hour), consider using batch ingestion.

To learn more about different ingestion methods, see data ingestion overview.

Choose the appropriate streaming ingestion type

Two streaming ingestion types are supported:

Ingestion type	Description
Event Hub or IoT Hub	Hubs are configured as table streaming data sources. For information about setting these up, see Event Hub.
Custom ingestion	Custom ingestion requires you to write an application that uses one of the Azure Synapse Data Explorer client libraries. Use the information in this topic to configure custom ingestion. You may also find the C# streaming ingestion sample application helpful.

Use the following table to help you choose the ingestion type that's appropriate for your environment:

Criterion	Event Hub / IoT Hub	Custom Ingestion
Data delay between ingestion initiation and the data available for query	Longer delay	Shorter delay
Development overhead	Fast and easy setup, no development overhead	High development overhead to create an application ingest the data, handle errors, and ensure data consistency

Note

Ingesting data from an Event Hub into Data Explorer pools will not work if your Synapse workspace uses a managed virtual network with data exfiltration protection enabled.

Prerequisites

An Azure subscription. Create a free Azure account.
Create a Data Explorer pool using Synapse Studio or the Azure portal

Create a Data Explorer database.

In Synapse Studio, on the left-side pane, select Data.

Select + (Add new resource) > Data Explorer pool, and use the following information:

Setting	Suggested value	Description
Pool name	contosodataexplorer	The name of the Data Explorer pool to use
Name	TestDatabase	The database name must be unique within the cluster.
Default retention period	365	The time span (in days) for which it's guaranteed that the data is kept available to query. The time span is measured from the time that data is ingested.
Default cache period	31	The time span (in days) for which to keep frequently queried data available in SSD storage or RAM, rather than in longer-term storage.

Select Create to create the database. Creation typically takes less than a minute.

Get the Query and Data Ingestion endpoints.
1. In Synapse Studio, on the left-side pane, select Manage > Data Explorer pools.
2. Select the Data Explorer pool you want to use to view its details.
3. Make a note of the Query and Data Ingestion endpoints. Use the Query endpoint as the cluster when configuring connections to your Data Explorer pool. When configuring SDKs for data ingestion, use the data ingestion endpoint.

Performance and operational considerations

The main contributors that can impact streaming ingestion are:

Compute Specification: Streaming ingestion performance and capacity scales with increased Data Explorer pool sizes. The number of concurrent ingestion requests is limited to six per core. For example, for 16 core workload type, such as Compute Optimized (Large) and Storage Optimized (Large), the maximal supported load is 96 concurrent ingestion requests. For two core workload type, such as Compute Optimized (Extra Small), the maximal supported load is 12 concurrent ingestion requests.
Data size limit: The data size limit for a streaming ingestion request is 4 MB.
Schema updates: Schema updates, such as creation and modification of tables and ingestion mappings, may take up to five minutes for the streaming ingestion service. For more information see Streaming ingestion and schema changes.
SSD capacity: Enabling streaming ingestion on a Data Explorer pool, even when data isn't ingested via streaming, uses part of the local SSD disk of the Data Explorer pool machines for streaming ingestion data and reduces the storage available for hot cache.

Enable streaming ingestion on your Data Explorer pool

Before you can use streaming ingestion, you must enable the capability on your Data Explorer pool and define a streaming ingestion policy. You can enable the capability when creating the Data Explorer pool, or add it to an existing Data Explorer pool.

Warning

Review the limitations prior to enabling streaming ingestion.

Enable streaming ingestion while creating a new Data Explorer pool

You can enable streaming ingestion while creating a new Data Explorer pool using Azure Synapse Studio or the Azure portal.

Studio
Portal

While creating a Data Explorer pool using the steps in Create a Data Explorer pool using Synapse Studio, in the Additional settings tab, select Streaming ingestion > Enabled.

Enable streaming ingestion while creating a Data Explorer pool in Azure Synapse Data Explorer.

Enable streaming ingestion on an existing Data Explorer pool

If you have an existing Data Explorer pool, you can enable streaming ingestion using the Azure portal.

In the Azure portal, go to your Data Explorer pool.
In Settings, select Configurations.
In the Configurations pane, select On to enable Streaming ingestion.
Select Save.

Create a target table and define the policy

Create a table to receive the streaming ingestion data and define its related policy using Azure Synapse Studio or the Azure portal.

Studio
Portal

In Synapse Studio, on the left-side pane, select Develop.
Under KQL scripts, Select + (Add new resource) > KQL script. On the right-side pane, you can name your script.
In the Connect to menu, select contosodataexplorer.
In the Use database menu, select TestDatabase.

Paste in the following command, and select Run to create the table.

.create table TestTable (TimeStamp: datetime, Name: string, Metric: int, Source:string)

Copy one of the following commands into the Query pane and select Run. This defines the streaming ingestion policy on the table you created or on the database that contains the table.

Tip

A policy that is defined at the database level applies to all existing and future tables in the database.
- To define the policy on the table you created, use:
```
.alter table TestTable policy streamingingestion enable
```
- To define the policy on the database containing the table you created, use:
```
.alter database StreamingTestDb policy streamingingestion enable
```

In the Azure Synapse Analytics portal, navigate to your Data Explorer pool.
Select Query.
To create the table that will receive the data via streaming ingestion, copy the following command into the Query pane and select Run.
```
.create table TestTable (TimeStamp: datetime, Name: string, Metric: int, Source:string)
```
Copy one of the following commands into the Query pane and select Run. This defines the streaming ingestion policy on the table you created or on the database that contains the table.

Tip

A policy that is defined at the database level applies to all existing and future tables in the database.
- To define the policy on the table you created, use:
```
.alter table TestTable policy streamingingestion enable
```
- To define the policy on the database containing the table you created, use:
```
.alter database StreamingTestDb policy streamingingestion enable
```

Create a streaming ingestion application to ingest data to your Data Explorer pool

Create your application for ingesting data to your Data Explorer pool using your preferred language. For the poolPath variable, use the Query endpoint you made a note of in the Prerequisites.

using Kusto.Data;
using Kusto.Ingest;
using System.IO;
using Kusto.Data.Common;

namespace StreamingIngestion
{
    class Program
    {
        static void Main(string[] args)
        {
            string poolPath = "https://<Poolname>.<WorkspaceName>.kusto.windows.net";
            string appId = "<appId>";
            string appKey = "<appKey>";
            string appTenant = "<appTenant>";
            string dbName = "<dbName>";
            string tableName = "<tableName>";

            // Create Kusto connection string with App Authentication
            var csb =
                new KustoConnectionStringBuilder(poolPath)
                    .WithAadApplicationKeyAuthentication(
                        applicationClientId: appId,
                        applicationKey: appKey,
                        authority: appTenant
                    );

            // Create a disposable client that will execute the ingestion
            using (IKustoIngestClient client = KustoIngestFactory.CreateStreamingIngestClient(csb))
            {
                // Initialize client properties
                var ingestionProperties =
                    new KustoIngestionProperties(
                        databaseName: dbName,
                        tableName: tableName
                    );

                // Ingest from a compressed file
                var fileStream = File.Open("MyFile.gz", FileMode.Open);
                // Create source options
                var sourceOptions = new StreamSourceOptions()
                {
                    CompressionType = DataSourceCompressionType.GZip,
                };
                // Ingest from stream
                var status = client.IngestFromStreamAsync(fileStream, ingestionProperties, sourceOptions).GetAwaiter().GetResult();
            }
        }
    }
}

from Azure Synapse Analytics.kusto.data import KustoConnectionStringBuilder

from Azure Synapse Analytics.kusto.ingest import (
    IngestionProperties,
    DataFormat,
    KustoStreamingIngestClient
)

poolPath = "https://<Poolname>.<WorkspaceName>.kusto.windows.net"
appId = "<appId>"
appKey = "<appKey>"
appTenant = "<appTenant>"
dbName = "<dbName>"
tableName = "<tableName>"

csb = KustoConnectionStringBuilder.with_aad_application_key_authentication(
    poolPath,
    appId,
    appKey,
    appTenant
)
client = KustoStreamingIngestClient(csb)

ingestionProperties = IngestionProperties(
    database=dbName,
    table=tableName,
    data_format=DataFormat.CSV
)

# Ingest from file
# Automatically detects gz format
client.ingest_from_file("MyFile.gz", ingestion_properties=ingestionProperties)

// Load modules using ES6 import statements:
import { DataFormat, IngestionProperties, StreamingIngestClient } from "azure-kusto-ingest";
import { KustoConnectionStringBuilder } from "azure-kusto-data";

// For earlier version, load modules using require statements:
// const IngestionProperties = require("azure-kusto-ingest").IngestionProperties;
// const KustoConnectionStringBuilder = require("azure-kusto-data").KustoConnectionStringBuilder;
// const {DataFormat} = require("azure-kusto-ingest").IngestionPropertiesEnums;
// const StreamingIngestClient = require("azure-kusto-ingest").StreamingIngestClient;

const poolPath = "https://<Poolname>.<WorkspaceName>.kusto.windows.net";
const appId = "<appId>";
const appKey = "<appKey>";
const appTenant = "<appTenant>";
const dbName = "<dbName>";
const tableName = "<tableName>";
const mappingName = "<mappingName>"; // Required for JSON formatted files

const ingestionProperties = new IngestionProperties({
    database: dbName, // Your database
    table: tableName, // Your table
    format: DataFormat.JSON,
    ingestionMappingReference: mappingName
});

// Initialize client with engine endpoint
const client = new StreamingIngestClient(
    KustoConnectionStringBuilder.withAadApplicationKeyAuthentication(
        poolPath,
        appId,
        appKey,
        appTenant
    ),
    ingestionProperties
);

// Automatically detects gz format
await client.ingestFromFile("MyFile.gz", ingestionProperties);

import (
    "context"
    "github.com/Azure Synapse Analytics/azure-kusto-go/kusto"
    "github.com/Azure Synapse Analytics/azure-kusto-go/kusto/ingest"
    "github.com/Azure Synapse Analytics/go-autorest/autorest/Azure Synapse Analytics/auth"
)

func ingest() {
    poolPath := "https://<Poolname>.<WorkspaceName>.kusto.windows.net"
    appId := "<appId>"
    appKey := "<appKey>"
    appTenant := "<appTenant>"
    dbName := "<dbName>"
    tableName := "<tableName>"
    mappingName := "<mappingName>" // Optional, can be nil

    // Creates a Kusto Authorizer using your client identity, secret, and tenant identity.
    // You may also uses other forms of authorization, see GoDoc > Authorization type.
    // auth package is: "github.com/Azure Synapse Analytics/go-autorest/autorest/Azure Synapse Analytics/auth"
    authorizer := kusto.Authorization{
        Config: auth.NewClientCredentialsConfig(appId, appKey, appTenant),
    }

    // Create a client
    client, err := kusto.New(poolPath, authorizer)
    if err != nil {
        panic("add error handling")
    }

    // Create an ingestion instance
    // Pass the client, the name of the database, and the name of table you wish to ingest into.
    in, err := ingest.New(client, dbName, tableName)
    if err != nil {
        panic("add error handling")
    }

    // Go currently only supports streaming from a byte array with a maximum size of 4 MB.
    jsonEncodedData := []byte("{\"a\":  1, \"b\":  10}\n{\"a\":  2, \"b\":  20}")

    // Ingestion from a stream commits blocks of fully formed data encodes (JSON, AVRO, ...) into Kusto:
    if err := in.Stream(context.Background(), jsonEncodedData, ingest.JSON, mappingName); err != nil {
        panic("add error handling")
    }
}

import com.microsoft.Azure Synapse Analytics.kusto.data.auth.ConnectionStringBuilder;
import com.microsoft.Azure Synapse Analytics.kusto.ingest.IngestClient;
import com.microsoft.Azure Synapse Analytics.kusto.ingest.IngestClientFactory;
import com.microsoft.Azure Synapse Analytics.kusto.ingest.IngestionProperties;
import com.microsoft.Azure Synapse Analytics.kusto.ingest.result.OperationStatus;
import com.microsoft.Azure Synapse Analytics.kusto.ingest.source.CompressionType;
import com.microsoft.Azure Synapse Analytics.kusto.ingest.source.StreamSourceInfo;
import java.io.FileInputStream;
import java.io.InputStream;

public class FileIngestion {
    public static void main(String[] args) throws Exception {
        String poolPath = "https://<Poolname>.<WorkspaceName>.kusto.windows.net";
        String appId = "<appId>";
        String appKey = "<appKey>";
        String appTenant = "<appTenant>";
        String dbName = "<dbName>";
        String tableName = "<tableName>";

        // Build connection string and initialize
        ConnectionStringBuilder csb =
            ConnectionStringBuilder.createWithAadApplicationCredentials(
                poolPath,
                appId,
                appKey,
                appTenant
            );

        // Initialize client and its properties
        IngestClient client = IngestClientFactory.createClient(csb);
        IngestionProperties ingestionProperties =
            new IngestionProperties(
                dbName,
                tableName
            );

        // Ingest from a compressed file
        // Create Source info
        InputStream zipInputStream = new FileInputStream("MyFile.gz");
        StreamSourceInfo zipStreamSourceInfo = new StreamSourceInfo(zipInputStream);
        // If the data is compressed
        zipStreamSourceInfo.setCompressionType(CompressionType.gz);
        // Ingest from stream
        OperationStatus status = client.ingestFromStream(zipStreamSourceInfo, ingestionProperties).getIngestionStatusCollection().get(0).status;
    }
}

Disable streaming ingestion on your Data Explorer pool

Warning

Disabling streaming ingestion may take a few hours.

Before disabling streaming ingestion on your Data Explorer pool, drop the streaming ingestion policy from all relevant tables and databases. The removal of the streaming ingestion policy triggers data rearrangement inside your Data Explorer pool. The streaming ingestion data is moved from the initial storage to permanent storage in the column store (extents or shards). This process can take between a few seconds to a few hours, depending on the amount of data in the initial storage.

Drop the streaming ingestion policy

You can drop the streaming ingestion policy using Azure Synapse Studio or the Azure portal.

Studio
Portal

In Synapse Studio, on the left-side pane, select Develop.
Under KQL scripts, Select + (Add new resource) > KQL script. On the right-side pane, you can name your script.
In the Connect to menu, select contosodataexplorer.
In the Use database menu, select TestDatabase.
Paste in the following command, and select Run to create the table.
```
.delete table TestTable policy streamingingestion
```
In the Azure portal, go to your Data Explorer pool.
In Settings, select Configurations.
In the Configurations pane, select On to enable Streaming ingestion.
Select Save.

In the Azure portal, go to your Data Explorer pool and select Query.
To drop the streaming ingestion policy from the table, copy the following command into Query pane and select Run.
```
.delete table TestTable policy streamingingestion
```
In Settings, select Configurations.
In the Configurations pane, select Off to disable Streaming ingestion.
Select Save.

Limitations

Database cursors aren't supported for a database if the database itself or any of its tables have the Streaming ingestion policy defined and enabled.
Data mappings must be pre-created for use in streaming ingestion. Individual streaming ingestion requests don't accommodate inline data mappings.
Extent tags can't be set on the streaming ingestion data.
Update policy. The update policy can reference only the newly-ingested data in the source table and not any other data or tables in the database.
If streaming ingestion is used on any of the tables of the database, this database cannot be used as leader for follower databases or as a data provider for Azure Synapse Analytics Data Share.