Extraction of ROS (bag) files

In implementations, where data is ingested as large bag files (ROS format) in raw zone, before they're processed by the next set of data pipelines these bag files need to be extracted.

Data Pre-Processing.

Some of the factors to consider are: autoscaling, parallel processing, ability to run container workloads, ability to fit with overall data processing pipeline, and cost effectiveness. The following Learn how to use Azure Batch tools and technologies section talks about the preferred options, their limitations, challenges and best practices.

Learn how to use Azure Batch tools and technologies

There are multiple compute options in Azure Cloud and Azure Batch is one such option for data processing. Some of the high-level feature considerations for compute are:

  • Easy integration with data pipelines (for example, Azure Data Factory (ADF) pipelines).
  • Supports long running jobs.
  • Supports parallel processing.
  • Can run container workloads.
  • Supports auto-scaling.
  • Provides options for choosing the right SKU for different workloads.

Azure Batch supports all these features, with some limitations/challenges. Here are some best practices to overcome those limitations:

How to manage Batch application dependencies

To process workloads, processor code needs to be deployed on the Azure Batch nodes. This involves installing dependencies that are required by the application on the batch node. Azure Batch provides a startup task, which can be used to install these dependencies. However, if there are too many dependencies, then it delays the node readiness. Autoscaling is also impacted because every new node must have these dependencies installed as the first step. Setup time can vary from 3-30 minutes depending on the list of required dependencies.

In such cases, it is best practice to containerize the application and run container work loads on Azure Batch.

Mount Azure Blob storage by using the NFS 3.0 protocol on Batch nodes

Consider the use case where there is a need to process large files. It isn't optimal to download these files onto the Batch nodes from the storage account, extract the contents and then upload back to the storage account. To ensure nodes have enough storage attached, you may be required to do some kind of cleansing to free the space after the job is done. The downloading and uploading of contents to storage account takes extra time.

Best practice for this is to mount the storage accounts onto the Batch nodes and access the data directly. It should be noted that, NFS mounts are not supported on windows nodes. For more information, see Mounting storage accounts via NFS.

Running container workloads on Azure Batch

ADF does not natively support running container workloads on Azure Batch. The Modern Data Warehouse includes a code sample that allows you to work around this limitation:

Running container workloads from Azure Data Factory on Azure Batch for data processing

For more information