HDInsight provides a Hadoop distributed file system (HDFS) over Azure Storage, and Azure Data Lake Storage. This storage includes Gen2. Azure Storage and Data Lake Storage Gen2 are designed as HDFS extensions. They enable the full set of components in the Hadoop environment to operate directly on the data it manages. Azure Storage, Data Lake Storage Gen2 are distinct file systems. The systems are optimized for storage of data and computations on that data. For information about the benefits of using Azure Storage, see Use Azure Storage with HDInsight. See also, Use Data Lake Storage Gen2 with HDInsight.
For example, hadoop fs -copyFromLocal data.txt /example/data/data.txt
Because the default file system for HDInsight is in Azure Storage, /example/data/data.txt is actually in Azure Storage. You can also refer to the file as:
On Apache HBase clusters, the default block size used when writing data is 256 KB. While this works fine when using HBase APIs or REST APIs, using the hadoop or hdfs dfs commands to write data larger than ~12 GB results in an error. For more information, see storage exception for write on blob.
Graphical clients
There are also several applications that provide a graphical interface for working with Azure Storage. The following table is a list of a few of these applications:
The Azure Data Factory service is a fully managed service for composing data: storage, processing, and movement services into streamlined, adaptable, and reliable data production pipelines.
Sqoop is a tool designed to transfer data between Hadoop and relational databases. Use it to import data from a relational database management system (RDBMS), such as SQL Server, MySQL, or Oracle. Then into the Hadoop distributed file system (HDFS). Transform the data in Hadoop with MapReduce or Hive, and then export the data back into an RDBMS.
Demonstrate understanding of common data engineering tasks to implement and manage data engineering workloads on Microsoft Azure, using a number of Azure services.