Ingest data from Cribl stream into Azure Data Explorer

מאמר
09/19/2024

Cribl Stream is a processing engine that securely collects, processes, and streams machine event data from any source. It allows you to parse and process that data for any destination for analysis and management in a secure manner.

This article shows how to ingest data with Cribl Stream.

For a complete list of data connectors, see Data integrations overview.

Prerequisites

A Cribl Stream account
An Azure Data Explorer cluster and database with the default cache and retention policies.
A query environment. For more information, see Query integrations overview.
Your Kusto cluster URI for the TargetURI value in the format https://ingest-<cluster>.<region>.kusto.windows.net. For more information, see Add a cluster connection.

Create a Microsoft Entra service principal

The Microsoft Entra service principal can be created through the Azure portal or programmatically, as in the following example.

This service principal is the identity used by the connector to write data to your table in Kusto. You grant permissions for this service principal to access Kusto resources.

Sign in to your Azure subscription via Azure CLI. Then authenticate in the browser.
```
az login
```
Choose the subscription to host the principal. This step is needed when you have multiple subscriptions.
```
az account set --subscription YOUR_SUBSCRIPTION_GUID
```

Create the service principal. In this example, the service principal is called my-service-principal.

az ad sp create-for-rbac -n "my-service-principal" --role Contributor --scopes /subscriptions/{SubID}

From the returned JSON data, copy the appId, password, and tenant for future use.

{
  "appId": "00001111-aaaa-2222-bbbb-3333cccc4444",
  "displayName": "my-service-principal",
  "name": "my-service-principal",
  "password": "00001111-aaaa-2222-bbbb-3333cccc4444",
  "tenant": "00001111-aaaa-2222-bbbb-3333cccc4444"
}

You've created your Microsoft Entra application and service principal.

Create a target table

Create a target table for the incoming data and an ingestion mapping to map the ingested data columns to the columns in the target table.

Run the following table creation command in your query editor, replacing the placeholder TableName with the name of the target table:
```
.create table <TableName> (_raw: string, _time: long, cribl_pipe: dynamic)
```

Run the following create ingestion mapping command, replacing the placeholders TableName with the target table name and TableNameMapping with the name of the ingestion mapping:

.create table <TableName> ingestion csv mapping '<TableNameMapping>' 'CriblLogMapping' '[{"Name":"_raw","DataType":"string","Ordinal":"0","ConstValue":null},{"Name":"_time","DataType":"long","Ordinal":"1","ConstValue":null},{"Name":"cribl_pipe","DataType":"dynamic","Ordinal":"2","ConstValue":null}]'

Grant the service principal from Create a Microsoft Entra service principal database ingestor role permissions to work with the database. For more information, see Examples. Replace the placeholder DatabaseName with the name of the target database and ApplicationID with the AppId value you saved when creating a Microsoft Entra service principal.
```
.add database <DatabaseName> ingestors ('aadapp=<ApplicationID>') 'App Registration'
```

Create Cribl Stream destination

The following section describes how to create a Cribl Stream destination that writes data to your table in Kusto. Each table requires a separate Cribl Stream destination connector.

Select destination

To connect Cribl Stream to your table:

From the top navigation in Cribl, select Manage then select a Worker Group.
Select Routing > QuickConnect (Stream) > Add Destination.
In the Set up new QuickConnect Destination window, choose Azure Data Explorer, then Add now.

Set up general settings

In the New Data Explorer window, in General Settings set the following settings:

Setting	Value	Description
Output ID	<OutputID>, for instance, KustoDestination	The name used to identify your destination.
Ingestion Mode	Batching (default) or Streaming	The settings for ingestion mode. Batching allows your table to pull batches of data from a Cribl storage container when ingesting large amounts of data over a short amount of time. Streaming sends data directly to the target KQL table. Streaming is useful for ingesting smaller amounts of data, or for example, sending a critical alert in real-time. Streaming can achieve lower latency than batching. If the ingestion mode is set to Streaming, you'll need to enable a streaming policy. For more information, see Streaming ingestion policy.
Cluster base URI	base URI	The base URI.
Ingestion service URI	ingestion URI	Displays when Batching mode is selected. The ingestion URI.
Database name	<DatabaseName>	The name of your target database.
Table name	<TableName>	The name of your target table.
Validate database settings	Yes (default) or No.	Validates the service principal app credentials you entered when you save or start your destination. It validates the table name, except when Add mapping object is on. This setting should be disabled if your app doesn't have both Database Viewer and Table Viewer roles.
Add mapping object	Yes or No (default.)	Displayed only when Batching mode is selected instead of the default Data mapping text field. Selecting Yes opens a window to enter a data mapping as a JSON object.
Data mapping	The mapping schema name as defined in the Create a target table step.	The mapping schema name. The default view when Add mapping object is set to No.
Compress	gzip (default)	When Data format is set to Parquet, Compress isn't available.
Data format	JSON (default), Raw, or Parquet.	The data format. Parquet is only available in Batching mode and only supported on Linux.
Backpressure behavior	Block (default) or Drop	Choose whether to block or drop events when receivers are exerting backpressure.
Tags	Optional values	Optional tags to filter and group destinations in Cribl Stream’s Manage Destinations page. Use a tab or hard return between tag names. These tags aren’t added to processed events.

When completed, select Next.

Authentication settings

Select Authentication Settings in the sidebar. Use the values you saved in Create a Microsoft Entra service principal along with your base URI as follows:

Setting	Value	Description
Tenant ID	<TenantID>	Use the `tenant` value you saved in Create a Microsoft Entra service principal.
Client ID	<ClientID>	Use the `appId` values you saved in Create a Microsoft Entra service principal.
Scope	`<baseuri>/.default`	Use the value from base URI for baseuri.
Authentication method	Client secret, Client secret (text secret), or Certificate	Options are Client secret Use the client secret of the Microsoft Entra application you created in Create a Microsoft Entra service principal for Client secret. For Certificate your certificate uses the public key you registered/will register for the Microsoft Entra application you created in Create a Microsoft Entra service principal.

Then select Next.

Persistent Queue

Displays when Ingestion mode is set to Streaming, and Backpressure behavior is set to Persistent Queue.

Setting	Value	Description
Max file size	1 MB (default)	The maximum queue file size to reach before closing the file. Include units such as KB or MB, when entering a number.
Max queue size	5 GB (default)	The maximum amount of disk space that the queue can consume on each Worker Process before the Destination stops queueing data. A required value of positive numbers with units such as KB, MB, or GB. The maximum value is 1 TB.
Queue file path	`$CRIBL_HOME/state/queues` (default)	The persistent queue file location. Cribl Stream appends `/<worker‑id>/<output‑id>` to this value.
Compression	None (default), gzip	The compression method to use to compress the persisted data, upon closing.
Queue-full behavior	Block or Drop	Choose to block or drop events when the queue exerts backpressure due to low disk or full disk capacity.
Strict ordering	Yes (default) or No	When set to Yes events are forwarded based on first in, first out ordering. Set to No to send new events before earlier queued events.
Drain rate limit (EPS)	0 (default)	This option is displayed when Strict ordering is set to No, to allow you to set a throttling rate (in events per second) on writing from the queue to receivers. Throttling the drain rate of queued events boosts new or active connection throughput. Zero disables throttling.
Clear Persistent Queue	NA	Select to delete files currently queued for delivery to your Destination. You'll need to confirm this action since queued data is permanently deleted without getting delivered.

When complete, select Next.

Processing settings

Setting	Value	Description
Pipeline	<\defined_pipeline>	An optional pipeline to process data before sending it out using this output.
System fields	`cribl_pipe` (default), `cribl_host`, `cribl_input`, `cribl_output`, `cribl_route`, or `cribl_wp`	A list of fields that are automatically added to events before they're sent to their destination. Wildcards are supported.

When complete, select Next.

Parquet settings

Displays when Parquet is selected for Data Format.

Choosing Parquet opens a Parquet Settings tab, to select the Parquet schema.

Setting	Value	Description
Automatic schema	On or Off	Select On to generate a Parquet schema based on the events of each Parquet file that Cribl Stream writes.
Parquet schema	drop-down	Displays when Automatic schema is set to Off to allow you to select your parquet schema.
Parquet version	1.0, 2.4, 2.6 (default)	The version determines the supported data types and how they're represented.
Data page version	V1, V2 (default)	The data page serialization format. If your Parquet reader doesn't support Parquet V2, use V1.
Group row limit	1000 (default)	The maximum number of rows that every group can contain.
Page size	1 MB (default)	The target memory size for page segments. Lower values can improve reading speed, while higher values can improve compression.
Log invalid rows	Yes or No	When Yes is selected, and Log level is set to `debug`, outputs up to 20 unique rows that were skipped due to data format mismatch.
Write statistics	On (default) or Off	Select On if you have Parquet statistic viewing tools configured.
Write page indexes	On (default) or Off	Select On if your Parquet reader uses Parquet page index statistics to enable page skipping.
Write page checksum	On or Off	Select On if you use Parquet tools to check data integrity using Parquet page checksums.
Metadata (optional)*		The Destination file metadata properties that can be included as key-value pairs.

Retries

Displays when Ingestion mode is set to Streaming.

Setting	Value	Description
Honor Retry-After header	Yes or No	Whether to honor a `Retry-After` header. When enabled, a received `Retry-After` header takes precedence is used before other configured options in the Retries section, as long as the header specifies a delay of 180 seconds or less. Otherwise, `Retry-After` headers are ignored.
Settings for failed HTTP requests	HTTP status codes	A list of HTTP status codes to automatically retry if they fail to connect. Cribl Stream automatically retries 429 failed requests.
Retry timed-out HTTP requests	On or Off	When set, more retry behavior settings become available.
Pre-backoff interval (ms)	1000 ms (default)	The wait time before retrying.
Backoff multiplier	2 s (default)	Used as the base for exponential backoff algorithm to determine the interval between retries.
Backoff limit (ms)	10,000 ms (default)	The maximum backoff interval for the final streaming retry. Possible values range from 10,000 milliseconds (10 seconds) to 180,000 milliseconds (3 minutes.)

When complete, select Next.

Advanced settings

Select Advanced Settings from the sidebar. The following describes the advanced settings when Batching is selected:

Setting	Value	Description
Flush immediately	Yes or No (default.)	Set to Yes to override data aggregation in Kusto. For more information, see Best practices for the Kusto Ingest library.
Retain blob on success	Yes or No (default.)	Set to Yes to retain data blob upon ingestion completion.
Extent tags	<\ExtentTag, ET2,...>	Set tags, if desired, to partitioned extents of the target table.
Enforce uniqueness via tag values		Select Add value to specify an `ingest-by` value list to use to filter incoming extents and discard the extents matching a listed value. For more information, see Extents (data shards)
Report level	DoNotReport, FailuresOnly (default), and FailuresAndSuccesses.	The ingestion status reporting level.
Report method	Queue (default), Table, and QueueAndTable (Recommended.)	Target for ingestion status reporting.
Additional fields		Add more configuration properties, if desired, to send to the ingestion service.
Staging location	`/tmp` (default)	Local filesystem location in which to buffer files before compressing and moving them to the final destination. Cribl recommends a stable and high-performance location.
File name suffix expression	`.${C.env["CRIBL_WORKER_ID"]}.${__format}${__compression === "gzip" ? ".gz" : ""}`(default)	A JavaScript expression enclosed in quotes or backticks used as the output filename suffix. `format` can be JSON or raw, and `__compression` can be none or gzip. A random sequence of six characters is appended to the end of the file names to prevent them from getting overwritten.
Max file size (MB)	32 MB (default)	The maximum uncompressed output file size that files can reach before they close and are moved to the storage container.
Max file open time (sec)	300 seconds (default)	The maximum amount of time, in seconds, to write to a file before it's closed and moved to the storage container.
Max file idle time (sec)	30 seconds (default)	The maximum amount of time, in seconds, to keep inactive files open before they close and are moved to the storage container.
Max open files	100 (default)	The maximum number of files to keep open at the same time before the oldest open files are closed and moved to the storage container.
Max concurrent file parts	1 (default)	The maximum number of file parts to upload at the same time. The default is 1 and the highest is 10. Setting the value to one allows sending one part at a time, sequentially.
Remove empty staging dirs	Yes (default) or No	When toggled on Cribl Stream deletes empty staging directories after moving files. This prevents the proliferation of orphaned empty directories. When enabled, exposes Staging cleanup period.
Staging cleanup period	300 (default)	The amount in time in seconds until empty directories are deleted when Remove staging dirs is enabled. Displays when Remove empty staging dirs is set to Yes. The minimum value is 10 seconds, and maximum is 86,400 seconds (every 24 hours.)
Environment		When empty (default) the configuration is enabled everywhere. If you’re using GitOps, you can specify the Git branch where you want to enable the configuration.

When completed, select Save.

Connection configuration

From the Connection Configuration window that opens, select Passthru connection then Save. The connector starts queueing the data.

Confirm data ingestion

Once data arrives in the table, confirm the transfer of data, by checking the row count:
```
<Tablename> 
| count
```

Confirm the ingestions queued in the last five minutes:

.show commands-and-queries 
| where Database == "" and CommandType == "DataIngestPull" 
| where LastUpdatedOn >= ago(5m)

Confirm that there are no failures in the ingestion process:

For batching:

.show ingestion failures

For streaming:

.show streamingingestion failures 
| order by LastFailureOn desc

Verify data in your table:
```
<TableName>
| take 10
```

For query examples and guidance, see Write queries in KQL and Kusto Query Language documentation.

שתף באמצעות

Ingest data from Cribl stream into Azure Data Explorer

Prerequisites

Create a Microsoft Entra service principal

Create a target table

Create Cribl Stream destination

Select destination

Set up general settings

Authentication settings

Persistent Queue

Processing settings

Parquet settings

Retries

Advanced settings

Connection configuration

Confirm data ingestion

משוב

משאבים נוספים

שתף באמצעות

Ingest data from Cribl stream into Azure Data Explorer

Prerequisites

Create a Microsoft Entra service principal

Create a target table

Create Cribl Stream destination

Select destination

Set up general settings

Authentication settings

Persistent Queue

Processing settings

Parquet settings

Retries

Advanced settings

Connection configuration

Confirm data ingestion

Related content

משוב

משאבים נוספים