Exploring captured Avro files in Azure Event Hubs

This article provides the schema for Avro files captured by Azure Event Hubs and a few tools to explore the files.

Schema

The Avro files produced by Event Hubs Capture have the following Avro schema:

Image showing the schema of Avro files captured by Azure Event Hubs.

Azure Storage Explorer

You can verify that captured files were created in the Azure Storage account using tools such as Azure Storage Explorer. You can download files locally to work on them.

An easy way to explore Avro files is by using the Avro Tools jar from Apache. You can also use Apache Drill for a lightweight SQL-driven experience or Apache Spark to perform complex distributed processing on the ingested data.

Use Apache Drill

Apache Drill is an "open-source SQL query engine for Big Data exploration" that can query structured and semi-structured data wherever it is. The engine can run as a standalone node or as a huge cluster for great performance.

A native support to Azure Blob storage is available, which makes it easy to query data in an Avro file, as described in the documentation:

Apache Drill: Azure Blob Storage Plugin

To easily query captured files, you can create and execute a VM with Apache Drill enabled via a container to access Azure Blob storage. See the following sample: Streaming at Scale with Event Hubs Capture.

Use Apache Spark

Apache Spark is a "unified analytics engine for large-scale data processing." It supports different languages, including SQL, and can easily access Azure Blob storage. There are a few options to run Apache Spark in Azure, and each provides easy access to Azure Blob storage:

Use Avro Tools

Avro Tools are available as a jar package. After you download the jar file, you can see the schema of a specific Avro file by running the following command:

java -jar avro-tools-1.9.1.jar getschema <name of capture file>

This command returns

{

    "type":"record",
    "name":"EventData",
    "namespace":"Microsoft.ServiceBus.Messaging",
    "fields":[
                 {"name":"SequenceNumber","type":"long"},
                 {"name":"Offset","type":"string"},
                 {"name":"EnqueuedTimeUtc","type":"string"},
                 {"name":"SystemProperties","type":{"type":"map","values":["long","double","string","bytes"]}},
                 {"name":"Properties","type":{"type":"map","values":["long","double","string","bytes"]}},
                 {"name":"Body","type":["null","bytes"]}
             ]
}

You can also use Avro Tools to convert the file to JSON format and perform other processing.

To perform more advanced processing, download and install Avro for your choice of platform. At the time of this writing, there are implementations available for C, C++, C#, Java, NodeJS, Perl, PHP, Python, and Ruby.

Apache Avro has complete Getting Started guides for Java and Python. You can also read the Getting started with Event Hubs Capture article.

Next steps

Event Hubs Capture is the easiest way to get data into Azure. Using Azure Data Lake, Azure Data Factory, and Azure HDInsight, you can perform batch processing and other analytics using familiar tools and platforms of your choosing, at any scale you need. See the following articles to learn more about this feature.