Azure Data Explorer data ingestion overview
Data ingestion is the process used to load data from one or more sources into a table in Azure Data Explorer. Once ingested, the data becomes available for query.
The diagram below shows the end-to-end flow for working in Azure Data Explorer and shows different ingestion methods.
The Azure Data Explorer Data Management service, which is responsible for data ingestion, implements the following process:
Azure Data Explorer can pull data from an external source or read requests from a pending Azure queue that is shared with clients. Data is batched or streamed by the Data Management service. Batch data flowing to the same database and table is optimized for ingestion throughput. Azure Data Explorer validates initial data and converts data formats where necessary. Further data manipulation includes matching schema, organizing, indexing, encoding, and compressing the data. Data is persisted in storage according to the set retention policy. The Data Management service then triggers the ingest operation in Azure Data Explorer, where it's made available for query.
Supported data formats, properties, and permissions
Supported data formats: The data formats that Azure Data Explorer can understand and ingest natively, such as Parquet and JSON.
Ingestion properties: The properties that affect how the data is ingested, such as tagging, mapping, and creation time.
Permissions: * Permissions: The permissions required to access resources used in commands and processes, including the following:
- To ingest data into an existing table without changing its schema requires Database Ingestor permissions.
- To create a new table requires Database User or Database Admin permissions.
- To change the schema of an existing table requires Table Admin, inherited by the user that created the table, or Database Admin permissions.
For more information, see Kusto role-based access control.
Batching vs streaming ingestion
Batching ingestion does data batching and is optimized for high ingestion throughput. This method is the preferred and most performant type of ingestion. Data is batched according to ingestion properties. Small batches of data are then merged, and optimized for fast query results. By default, the maximum batching value is 5 minutes, 1000 items, or a total size of 1 GB. The data size limit for a batch ingestion command is 6 GB. To learn more, see the ingestion batching policy.
Streaming ingestion is ongoing data ingestion from a streaming source. Streaming ingestion allows near real-time latency for small sets of data per table. Data is initially ingested to row store, then moved to column store extents. Streaming ingestion can be done using an Azure Data Explorer client library or one of the supported data pipelines. To learn more, see Configure streaming ingestion.
Ingestion methods and tools
Azure Data Explorer supports several ingestion methods, each with its own target scenarios. These methods include ingestion tools, connectors and plugins to diverse services, managed pipelines, programmatic ingestion using SDKs, and direct access to ingestion.
For a list of data connectors, see Data connectors overview.
Ingestion using managed pipelines
For organizations who wish to have management (throttling, retries, monitors, alerts, and more) done by an external service, using a connector is likely the most appropriate solution. Queued ingestion is appropriate for large data volumes. Azure Data Explorer supports the following Azure Pipelines:
Event Grid: A pipeline that listens to Azure storage, and updates Azure Data Explorer to pull information when subscribed events occur. For more information, see Ingest Azure Blobs into Azure Data Explorer.
Event Hub: A pipeline that transfers events from services to Azure Data Explorer. For more information, see Ingest data from event hub into Azure Data Explorer.
Azure Data Factory (ADF): A fully managed data integration service for analytic workloads in Azure. Azure Data Factory connects with over 90 supported sources to provide efficient and resilient data transfer. ADF prepares, transforms, and enriches data to give insights that can be monitored in different kinds of ways. This service can be used as a one-time solution, on a periodic timeline, or triggered by specific events.
- Integrate Azure Data Explorer with Azure Data Factory.
- Use Azure Data Factory to copy data from supported sources to Azure Data Explorer.
- Copy in bulk from a database to Azure Data Explorer by using the Azure Data Factory template.
- Use Azure Data Factory command activity to run Azure Data Explorer management commands
Programmatic ingestion using SDKs
Azure Data Explorer provides SDKs that can be used for query and data ingestion. Programmatic ingestion is optimized for reducing ingestion costs (COGs), by minimizing storage transactions during and following the ingestion process.
Available SDKs and open-source projects
The ingestion wizard: Enables you to quickly ingest data by creating and adjusting tables from a wide range of source types. The ingestion wizard automatically suggests tables and mapping structures based on the data source in Azure Data Explorer. The wizard can be used for one-time ingestion, or to define continuous ingestion via Event Grid on the container to which the data was ingested.
LightIngest: A command-line utility for ad-hoc data ingestion into Azure Data Explorer. The utility can pull source data from a local folder or from an Azure blob storage container.
Ingest management commands
Use commands to ingest data directly to your cluster. This method bypasses the Data Management services, and therefore should be used only for exploration and prototyping. Don't use this method in production or high-volume scenarios.
Inline ingestion: A management command .ingest inline is sent to your cluster, with the data to be ingested being a part of the command text itself. This method is intended for improvised testing purposes.
Ingest from query: A management command .set, .append, .set-or-append, or .set-or-replace is sent to your cluster, with the data specified indirectly as the results of a query or a command.
Ingest from storage (pull): A management command .ingest into is sent to your cluster, with the data stored in external storage, such as Azure Blob Storage, accessible by your cluster and pointed-to by the command.
Comparing ingestion methods and tools
|Ingestion name||Data type||Maximum file size||Streaming, batching, direct||Most common scenarios||Considerations|
|Get data experience||*sv, JSON||1 GB uncompressed (see note)||Batching to container, local file and blob in direct ingestion||One-off, create table schema, definition of continuous ingestion with Event Grid, bulk ingestion with container (up to 5,000 blobs; no limit when using historical ingestion)|
|LightIngest||All formats supported||1 GB uncompressed (see note)||Batching via DM or direct ingestion||Data migration, historical data with adjusted ingestion timestamps, bulk ingestion (no size restriction)||Case-sensitive, space-sensitive|
|ADX Kafka||Avro, ApacheAvro, JSON, CSV, Parquet, and ORC||Unlimited. Inherits Java restrictions.||Batching, streaming||Existing pipeline, high volume consumption from the source.||Preference may be determined by which “multiple producer/consumer” service is already used, or how managed of a service is desired.|
|ADX to Apache Spark||Every format supported by the Spark environment||Unlimited||Batching||Existing pipeline, preprocessing on Spark before ingestion, fast way to create a safe (Spark) streaming pipeline from the various sources the Spark environment supports.||Consider cost of Spark cluster. For batch write, compare with Azure Data Explorer data connection for Event Grid. For Spark streaming, compare with the data connection for event hub.|
|LogStash||JSON||Unlimited. Inherits Java restrictions.||Inputs to the connector are Logstash events, and the connector outputs to Kusto using batching ingestion.||Existing pipeline, leverage the mature, open source nature of Logstash for high volume consumption from the input(s).||Preference may be determined by which “multiple producer/consumer” service is already used, or how managed of a service is desired.|
|Azure Data Factory (ADF)||Supported data formats||Unlimited *(per ADF restrictions)||Batching or per ADF trigger||Supports formats that are usually unsupported, large files, can copy from over 90 sources, from on perm to cloud||This method takes relatively more time until data is ingested. ADF uploads all data to memory and then begins ingestion.|
|Power Automate||All formats supported||1 GB uncompressed (see note)||Batching||Ingestion commands as part of flow. Used to automate pipelines.|
|Logic Apps||All formats supported||1 GB uncompressed (see note)||Batching||Used to automate pipelines|
|IoT Hub||Supported data formats||N/A||Batching, streaming||IoT messages, IoT events, IoT properties|
|Event Hub||Supported data formats||N/A||Batching, streaming||Messages, events|
|Event Grid||Supported data formats||1 GB uncompressed||Batching||Continuous ingestion from Azure storage, external data in Azure storage||Ingestion can be triggered by blob renaming or blob creation actions|
|.NET SDK||All formats supported||1 GB uncompressed (see note)||Batching, streaming, direct||Write your own code according to organizational needs|
|Python||All formats supported||1 GB uncompressed (see note)||Batching, streaming, direct||Write your own code according to organizational needs|
|Node.js||All formats supported||1 GB uncompressed (see note||Batching, streaming, direct||Write your own code according to organizational needs|
|Java||All formats supported||1 GB uncompressed (see note)||Batching, streaming, direct||Write your own code according to organizational needs|
|REST||All formats supported||1 GB uncompressed (see note)||Batching, streaming, direct||Write your own code according to organizational needs|
|Go||All formats supported||1 GB uncompressed (see note)||Batching, streaming, direct||Write your own code according to organizational needs|
When referenced in the above table, ingestion supports a maximum file size of 6 GB. The recommendation is to ingest files between 100 MB and 1 GB.
Once you have chosen the most suitable ingestion method for your needs, do the following steps:
Set batching policy (optional)
The batching manager batches ingestion data based on the ingestion batching policy. Define a batching policy before ingestion. See ingestion best practices - optimizing for throughput. Batching policy changes can require up to 5 minutes to take effect. The policy sets batch limits according to three factors: time elapsed since batch creation, accumulated number of items (blobs), or total batch size. By default, settings are 5 minutes / 1000 blobs / 1 GB, with the limit first reached taking effect. Therefore there's usually a 5-minute delay when queueing sample data for ingestion.
Set retention policy
Data ingested into a table in Azure Data Explorer is subject to the table's effective retention policy. Unless set on a table explicitly, the effective retention policy is derived from the database's retention policy. Hot retention is a function of cluster size and your retention policy. Ingesting more data than you have available space will force the first in data to cold retention.
Make sure that the database's retention policy is appropriate for your needs. If not, explicitly override it at the table level. For more information, see retention policy.
Create a table
To ingest data programmatically, a table needs to be created beforehand. If you're using the Get data experience, you can create a table as part of the ingestion flow.
- Create a table with a command.
If a record is incomplete or a field cannot be parsed as the required data type, the corresponding table columns will be populated with null values.
Create schema mapping
Schema mapping helps bind source data fields to destination table columns. Mapping allows you to take data from different sources into the same table, based on the defined attributes. Different types of mappings are supported, both row-oriented (CSV, JSON and AVRO), and column-oriented (Parquet). In most methods, mappings can also be pre-created on the table and referenced from the ingest command parameter.
Set update policy (optional)
Some of the data format mappings (Parquet, JSON, and Avro) support simple and useful ingest-time transformations. If the scenario requires more complex processing at ingestion, adjust the update policy, which supports lightweight processing using query commands. The update policy automatically runs extractions and transformations on ingested data on the original table, and ingests the resulting data into one or more destination tables.
You can ingest sample data into the table you created in your database using commands or the ingestion wizard. To ingest your own data, you can select from a range of options, including ingestion tools, connectors to diverse services, managed pipelines, programmatic ingestion using SDKs, and direct access to ingestion.