Custom data upload clients

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

The following topics demonstrate how you can use PowerShell and the .NET SDKs to upload and to serialize data:

You can also use the AZCopy utility in scripts to automate uploading data to HDInsight. For more details see AzCopy – Uploading/Downloading files for Windows Azure Blobs on the Azure storage team blog. In addition, a library called Casablanca can be used to access Azure storage from native C++ code. For more details see Announcing Casablanca, a Native Library to Access the Cloud From C++.

Considerations

Consider the following factors when designing your automated data ingestion processes:

  • Consider how much effort is required to create an automated upload solution, and balance this with the advantages it provides. If you are simply experimenting with data in an iterative scenario, you may not need an automated solution. Creating automated processes to upload data is probably worthwhile only when you will repeat the operation on a regular basis, or when you need to integrate big data processing into a business application.
  • When creating custom tools or scripts to upload data to a cluster, consider including the ability to accept command-line parameters so that the tools can be used in a range of automation processes.
  • Consider how you will protect the data, the cluster, and the solution as a whole from inappropriate use of custom upload tools and applications. It may be possible to set permissions on tools, files, folders, and other resources to restrict access to only authorized users.
  • PowerShell is a good solution for uploading data files in scenarios where users are exploring data iteratively and need a simple, repeatable way to upload source data for processing. You can also use PowerShell as part of an automated processing solution in which data is uploaded automatically by a scheduled operating system task or SQL Server Integration Services package.
  • .NET Framework code that uses the .NET SDK for HDInsight can be used to upload data for processing by HDInsight jobs. This may be a better choice than using PowerShell for large volumes of data.
  • In addition to the HDInsight-specific APIs for uploading data to the cluster, the more general Azure Storage API offers greater flexibility by allowing you to upload data directly to Azure blob storage as files, or write data directly to blobs in an Azure blob storage container. This enables you to build client applications that capture real-time data and write it directly to a blob for processing in HDInsight without first storing the data in local files.
  • Other tools and frameworks are available that can help you to build data ingestion mechanisms. For example, Falcon provides an automatable system for data replication, data lifecycle management (such as data eviction), data lineage and tracing, and process coordination and scheduling based on a declarative programming model.

More information

For information about creating end-to-end automated solutions that include automated upload stages, see Building end-to-end solutions using HDInsight.

For more details of the tools and technologies available for automating upload processes see Appendix A - Tools and technologies reference.

For information on using PowerShell with HDInsight see HDInsight PowerShell Cmdlets Reference Documentation.

For information on using the HDInsight SDK see HDInsight SDK Reference Documentation and the incubator projects on the Codeplexwebsite.

Next Topic | Previous Topic | Home | Community