Events
Mar 31, 11 PM - Apr 2, 11 PM
The ultimate Microsoft Fabric, Power BI, SQL, and AI community-led event. March 31 to April 2, 2025.
Register todayThis browser is no longer supported.
Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.
Azure Data Lake Storage is a highly scalable and cost-effective data lake solution for big data analytics. It combines the power of a high-performance file system with massive scale and economy to help you reduce your time to insight. Data Lake Storage Gen2 extends Azure Blob Storage capabilities and is optimized for analytics workloads.
Azure Data Explorer integrates with Azure Blob Storage and Azure Data Lake Storage (Gen1 and Gen2), providing fast, cached, and indexed access to data stored in external storage. You can analyze and query data without prior ingestion into Azure Data Explorer. You can also query across ingested and uningested external data simultaneously. For more information, see how to create an external table using the Azure Data Explorer web UI wizard. For a brief overview, see external tables.
Tip
The best query performance necessitates data ingestion into Azure Data Explorer. The capability to query external data without prior ingestion should only be used for historical data or data that are rarely queried. Optimize your external data query performance for best results.
Let's say you have lots of CSV files containing historical info on products stored in a warehouse, and you want to do a quick analysis to find the five most popular products from last year. In this example, the CSV files look like:
Timestamp | ProductId | ProductDescription |
---|---|---|
2019-01-01 11:21:00 | TO6050 | 3.5in DS/HD Floppy Disk |
2019-01-01 11:30:55 | YDX1 | Yamaha DX1 Synthesizer |
... | ... | ... |
The files are stored in Azure Blob storage mycompanystorage
under a container named archivedproducts
, partitioned by date:
https://mycompanystorage.blob.core.windows.net/archivedproducts/2019/01/01/part-00000-7e967c99-cf2b-4dbb-8c53-ce388389470d.csv.gz
https://mycompanystorage.blob.core.windows.net/archivedproducts/2019/01/01/part-00001-ba356fa4-f85f-430a-8b5a-afd64f128ca4.csv.gz
https://mycompanystorage.blob.core.windows.net/archivedproducts/2019/01/01/part-00002-acb644dc-2fc6-467c-ab80-d1590b23fc31.csv.gz
https://mycompanystorage.blob.core.windows.net/archivedproducts/2019/01/01/part-00003-cd5fad16-a45e-4f8c-a2d0-5ea5de2f4e02.csv.gz
https://mycompanystorage.blob.core.windows.net/archivedproducts/2019/01/02/part-00000-ffc72d50-ff98-423c-913b-75482ba9ec86.csv.gz
...
To run a KQL query on these CSV files directly, use the .create external table
command to define an external table in Azure Data Explorer. For more information on external table create command options, see external table commands.
.create external table ArchivedProducts(Timestamp:datetime, ProductId:string, ProductDescription:string)
kind=blob
partition by (Date:datetime = bin(Timestamp, 1d))
dataformat=csv
(
h@'https://mycompanystorage.blob.core.windows.net/archivedproducts;StorageSecretKey'
)
The external table is now visible in the left pane of the Azure Data Explorer web UI:
Once an external table is defined, the external_table()
function can be used to refer to it. The rest of the query is standard Kusto Query Language.
external_table("ArchivedProducts")
| where Timestamp > ago(365d)
| summarize Count=count() by ProductId,
| top 5 by Count
You can query both external tables and ingested data tables within the same query. You can join
or union
the external table with other data from Azure Data Explorer, SQL servers, or other sources. Use a let( ) statement
to assign a shorthand name to an external table reference.
In the example below, Products is an ingested data table and ArchivedProducts is an external table that we've defined previously:
let T1 = external_table("ArchivedProducts") | where TimeStamp > ago(100d);
let T = Products; //T is an internal table
T1 | join T on ProductId | take 10
Azure Data Explorer allows querying hierarchical formats, such as JSON
, Parquet
, Avro
, and ORC
. To map hierarchical data schema to an external table schema (if it's different), use external table mappings commands. For instance, if you want to query JSON log files with the following format:
{
"timestamp": "2019-01-01 10:00:00.238521",
"data": {
"tenant": "e1ef54a6-c6f2-4389-836e-d289b37bcfe0",
"method": "RefreshTableMetadata"
}
}
{
"timestamp": "2019-01-01 10:00:01.845423",
"data": {
"tenant": "9b49d0d7-b3e6-4467-bb35-fa420a25d324",
"method": "GetFileList"
}
}
...
The external table definition looks like this:
.create external table ApiCalls(Timestamp: datetime, TenantId: guid, MethodName: string)
kind=blob
dataformat=multijson
(
h@'https://storageaccount.blob.core.windows.net/container1;StorageSecretKey'
)
Define a JSON mapping that maps data fields to external table definition fields:
.create external table ApiCalls json mapping 'MyMapping' '[{"Column":"Timestamp","Properties":{"Path":"$.timestamp"}},{"Column":"TenantId","Properties":{"Path":"$.data.tenant"}},{"Column":"MethodName","Properties":{"Path":"$.data.method"}}]'
When you query the external table, the mapping will be invoked, and relevant data will be mapped to the external table columns:
external_table('ApiCalls') | take 10
For more info on mapping syntax, see data mappings.
Use the test cluster called help to try out different Azure Data Explorer capabilities. The help cluster contains an external table definition for a New York City taxi dataset containing billions of taxi rides.
This section shows the query used to create the TaxiRides external table in the help cluster. Since this table has already been created, you can skip this section and go directly to query TaxiRides external table data.
.create external table TaxiRides
(
trip_id: long,
vendor_id: string,
pickup_datetime: datetime,
dropoff_datetime: datetime,
store_and_fwd_flag: string,
rate_code_id: int,
pickup_longitude: real,
pickup_latitude: real,
dropoff_longitude: real,
dropoff_latitude: real,
passenger_count: int,
trip_distance: real,
fare_amount: real,
extra: real,
mta_tax: real,
tip_amount: real,
tolls_amount: real,
ehail_fee: real,
improvement_surcharge: real,
total_amount: real,
payment_type: string,
trip_type: int,
pickup: string,
dropoff: string,
cab_type: string,
precipitation: int,
snow_depth: int,
snowfall: int,
max_temperature: int,
min_temperature: int,
average_wind_speed: int,
pickup_nyct2010_gid: int,
pickup_ctlabel: string,
pickup_borocode: int,
pickup_boroname: string,
pickup_ct2010: string,
pickup_boroct2010: string,
pickup_cdeligibil: string,
pickup_ntacode: string,
pickup_ntaname: string,
pickup_puma: string,
dropoff_nyct2010_gid: int,
dropoff_ctlabel: string,
dropoff_borocode: int,
dropoff_boroname: string,
dropoff_ct2010: string,
dropoff_boroct2010: string,
dropoff_cdeligibil: string,
dropoff_ntacode: string,
dropoff_ntaname: string,
dropoff_puma: string
)
kind=blob
partition by (Date:datetime = bin(pickup_datetime, 1d))
dataformat=csv
(
h@'https://storageaccount.blob.core.windows.net/container1;secretKey'
)
You can find the created TaxiRides table by looking at the left pane of the Azure Data Explorer web UI:
Sign in to https://dataexplorer.azure.com/clusters/help/databases/Samples.
Run this query on the external table TaxiRides to show rides for each day of the week, across the entire dataset.
external_table("TaxiRides")
| summarize count() by dayofweek(pickup_datetime)
| render columnchart
This query shows the busiest day of the week. Since the data isn't partitioned, the query may take up to several minutes to return results.
Run this query on the external table TaxiRides to show taxi cab types (yellow or green) used in January of 2017.
external_table("TaxiRides")
| where pickup_datetime between (datetime(2017-01-01) .. datetime(2017-02-01))
| summarize count() by cab_type
| render piechart
This query uses partitioning, which optimizes query time and performance. The query filters on a partitioned column (pickup_datetime) and returns results in a few seconds.
You can write other queries to run on the external table TaxiRides and learn more about the data.
Optimize your query performance in the lake by using the following best practices for querying external data.
Check that external data is in the same Azure region as your Azure Data Explorer cluster. This setup reduces cost and data fetch time.
The optimal file size is hundreds of Mb (up to 1 GB) per file. Avoid many small files that require unneeded overhead, such as slower file enumeration process and limited use of columnar format. The number of files should be greater than the number of CPU cores in your Azure Data Explorer cluster.
Use compression to reduce the amount of data being fetched from the remote storage. For Parquet format, use the internal Parquet compression mechanism that compresses column groups separately, allowing you to read them separately. To validate use of compression mechanism, check that the files are named as follows: <filename>.gz.parquet or <filename>.snappy.parquet and not <filename>.parquet.gz.
Organize your data using "folder" partitions that enable the query to skip irrelevant paths. When planning partitioning, consider file size and common filters in your queries such as timestamp or tenant ID.
Select VM SKUs with more cores and higher network throughput (memory is less important). For more information, see Select the correct VM SKU for your Azure Data Explorer cluster.
Events
Mar 31, 11 PM - Apr 2, 11 PM
The ultimate Microsoft Fabric, Power BI, SQL, and AI community-led event. March 31 to April 2, 2025.
Register today