infer_storage_schema plugin

This plugin infers schema of external data, and returns it as CSL schema string. The string can be used when creating external tables. The plugin is invoked with the evaluate operator.

Syntax

evaluate infer_storage_schema( Options )

Parameters

Name Type Required Description
Options dynamic A property bag specifying the properties of the request.

Properties of the request

Name Type Required Description
StorageContainers dynamic An array of storage connection strings that represent prefix URI for stored data artifacts.
DataFormat string One of the supported data formats.
FileExtension string If specified, the function will only scan files ending with this file extension. Specifying the extension may speed up the process or eliminate data reading issues.
FileNamePrefix bool If specified, the function will only scan files starting with this prefix. Specifying the prefix may speed up the process.
Mode string The schema inference strategy. A value of: any, last, all. The function will infer data schema from the first found file, from the last written file, or from all files respectively. The default value is last.

Returns

The infer_storage_schema plugin returns a single result table containing a single row/column holding CSL schema string.

Note

  • Storage container URI secret keys must have the permissions for List in addition to Read.
  • Schema inference strategy 'all' is a very "expensive" operation, as it implies reading from all artifacts found and merging their schema.
  • Some returned types may not be the actual ones as a result of wrong type guess (or, as a result of schema merge process). This is why you should review the result carefully before creating an external table.

Example

let options = dynamic({
  'StorageContainers': [
    h@'https://storageaccount.blob.core.windows.net/MovileEvents;secretKey'
  ],
  'FileExtension': '.parquet',
  'FileNamePrefix': 'part-',
  'DataFormat': 'parquet'
});
evaluate infer_storage_schema(options)

Output

CslSchema
app_id:string, user_id:long, event_time:datetime, country:string, city:string, device_type:string, device_vendor:string, ad_network:string, campaign:string, site_id:string, event_type:string, event_name:string, organic:string, days_from_install:int, revenue:real

Use the returned schema in external table definition:

.create external table MobileEvents(
    app_id:string, user_id:long, event_time:datetime, country:string, city:string, device_type:string, device_vendor:string, ad_network:string, campaign:string, site_id:string, event_type:string, event_name:string, organic:string, days_from_install:int, revenue:real
)
kind=blob
partition by (dt:datetime = bin(event_time, 1d), app:string = app_id)
pathformat = ('app=' app '/dt=' datetime_pattern('yyyyMMdd', dt))
dataformat = parquet
(
    h@'https://storageaccount.blob.core.windows.net/MovileEvents;secretKey'
)