Azure HDInsight FAQ
Below are some frequently asked questions about Windows Azure HDInsight:
General
Q: What is HDInsight?
HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud. A modern, cloud-based data platform that manages data of any type, whether structured or unstructured, and of any size, HDInsight makes it possible for you to gain the full value of big data.
Q: Where can I find more information about HDInsight?
See windowsazure.com.
[Provide a list of the articles]
Q: What are the benefits of using HDInsight?
The benefits include:
- Insights with familiar tools.
- Deployment agility.
- Enterprise-ready Hadoop.
- Rich developer experience.
See HDInsight Service.
Q: How much does HDInsight cost?
See HDInsight Pricing Details.
Q: How can I enable HDInsight?
Windows Azure HDInsight has been released to public. You need a Windows Azure account to use the feature. If you don't have a Windows Azure account, you can create a free trial account in just a couple of minutes. For details, see Windows Azure Free Trial.
Q: What are the Hadoop components supported by HDInsight?
The default cluster version used by Windows Azure HDInsight is 2.1. It is based on the Hortonworks Data Platform version 1.3.0 and provides Hadoop services with the component versions itemized in the following table:
Component | Version |
Apache Hadoop | 1.2.0 |
Apache Hive | 0.11.0 |
Apache Pig | 0.11 |
Apache Sqoop | 1.4.3 |
Apache Oozie | 3.2.2 |
Apache HCatalog | Merged with Hive |
Apache Templeton | Merged with Hive |
Ambari | API v1.0 |
For more information, see What version of Hadoop is in Windows Azure HDInsight?
Q: Is HBASE currently supported by HDInsight?
No. For a list of supported components, see What version of Hadoop is in Windows Azure HDInsight?
Q: How can I connect to the Head node of the cluster using RDP?
You must enable remote from Windows Azure Management portal before you can RDP into the head node. To enable remote:
- Log on to the Windows Azure Management portal.
- Click HDInsight from the left. You should see a list of the cluster on the right.
- Click the cluster you want to connect to using RDP.
- Click CONFIGURATION from the top.
- Click ENABLE REMOTE from the bottom.
You must create a RDP user account. The account username must be different from the Hadoop user account username you created during the provision process. You must also set an expiration date. The expiration date must be from today to seven days after today. You don't usually need more than 7 days, because most of the tasks can be performed using Windows Azure PowerShell. For more information, see the following articles:
- Administer HDInsight clusters using Management Portal
- Administer HDInsight using PowerShell
- Submit Hadoop jobs programmetically
Cluster provision
Q: What are the options for provision HDInsight cluster?
There are four ways you can provision a HDInsight cluster:
- Windows Azure PowerShell
- Windows Azure Management portal
- HDInsight .NET SDK
- Cross-platform command line interface
See Provision HDInsight clusters.
Q: Where does my HDInsight cluster store it data?
HDInsight cluster uses a Windows Azure blob storage container as the default file system for its system files. The same container can be used to store data files. You can also add additional Windows Azure storage accounts during the provision process. See Using Windows Azure Blob storage for HDInsight.
Q: How can I add more storage accounts to my cluster?
This can be done during the provision process. See Provision HDInsight clusters.
Q: Can I share a Blob storage contain by two clusters as their default file system container?
No.
Q: Can I reuse a Blob storage contain as the default file system container?
Q: Should I store the data file in the default file system container or a different Blob storage container?
HDFS and WASB
Q: What is WASB?
WASB stands for Windows Azure storage -blob or Windows Azure Blob storage. Windows Azure HDInsight uses Windows Azure Blob storage as the default file system. You can also link to other Windows Azure storage account during the provision process. Because Hadoop is designed for offline batch process, it is recommended to provision a cluster, run MapReduce jobs, and then delete the cluster for the cost benefits. Windows Azure Blob storage helps retaining the data when the cluster is deleted. For more information, see Using Blob storage with HDInsight.
Where can I run Hadoop HDFS shell commands?
HDInsight uses Windows Azure Blob storage (WASB) as the default file system. Typically, you use Windows Azure PowerShell to perform most of the file operations on HDFS. To run HDFS commands, you can connect to the cluster using RDP, and then open Hadoop command line. For more information of administering HDInsight, see Administer HDInsight clusters using Management Portal. For a list of HDFS command, see Apache Hadoop web site.
Q: How can I open Hadoop command line?
You must first connect to the cluster using RDP. From the desktop, click Hadoop command line. For more information, see Administer HDInsight clusters using Management Portal.
Q: What are the 0 byte files shown in Windows Azure explorers?
Using Windows Azure Explorer tools, you may notice some 0 byte files with the same file names as the folder names. These files serve two purposes:
- In case of empty folders, they serve as a marker of the existence of the folder. Blob storage is clever enough to know that if a blob exists called foo/bar then there is a folder called foo. But if you want to have an empty folder called foo, then the only way to signify that is by having this special 0 byte file in place.
- They hold some special metadata needed by the Hadoop file system, notably the permissions and owners for the folders. So do NOT delete this files unless you want to delete the corresponding folders.
Q: How can I get data up to HDInsight?
Windows Azure HDInsight provides two options in how it manages its data, Windows Azure Blob storage (WASB) and Hadoop Distributed File System (HDFS). HDFS is designed to store data used by Hadoop applications. Data stored in Windows Azure Blob storage can be accessed by Hadoop applications using WASB, which provides a full featured HDFS file system over Windows Azure Blob storage. It has been designed as an HDFS extension to provide a seamless experience to customers by enabling the full set of components in the Hadoop ecosystem to operate directly on the data it manages. Both options are distinct file systems that are optimized for storage of data and computations on that data. For the benefits of using WASB, see Using Windows Azure Blob Storage with HDInsight. For more information on uploading data to HDInsight, see Upload data to HDInsight
Why am I getting the org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /lineitemData already exists error?
You will get this error when you run a MapReduce job and the output folder already exists.
If the folder doesn't exist, and you still get the same error. It could be the 0-byte file with the same name as the folder name. You must delete that 0-byte file. For more information, see the "What are the 0 byte files shown in Windows Azure explorers? question in this FAQ.
MapReduce
Q: Can I write MapReduce programs in a .NET language?
You can use Hadoop streamming API to ran MapReduce jobs written in other programming languages other than Java. For a C# Hadoop streaming example, see Develop and deploy a Haddop streaming job to HDInsight.
Q: How can I submit MapReduce Jobs?
There are several ways to submit MapReduce jobs on HDInsight clusters:
- Windows Azure Powershell
- HDInsight .NET SDK
- Hadoop command line
- Cross-platform command line interface. Currently the CLI only supports cluster management functions, such as provision, listing and deleting clusters. The job submission function is under development.
For more information, see Submit Hadoop jobs.
Q: How can I find out the status of a MapReduce job?
Q: How can I get the output of a MapReduce job?
Q: Where is the source code for the hadoop-example.jar file samples?
Windowsazure.com has several articles walking you through the samples. You can find the source code from the articles. The source code can also be found at github.
Q: Why am I getting the following errors?
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: -Dmapreduce.job.credentials.binary=c:/hdfs/mapred/local/taskTracker/admin/jobcache/job_201311272153_0006/jobToken
Templeton add a new feature which enables passing -Dmapreduce.job.credentials.binary in the command. This new feature causes regression on some MapReduce classes which do not read argument thoroughly.
HDP1.3:
[cmd, /c, call, C:\hadoop\hadoop-1.2.0-SNAPSHOT/bin/hadoop.cmd, jar, WordCount-1.0-SNAPSHOT.jar, com.microsoft.hdinsight.samples.WordCount, -D"mapreduce.job.credentials.binary=C:/hdp/data/hdfs/mapred/local/taskTracker/hadoop/jobcache/job_201308221621_0027/jobToken", "/Test/testinput.txt", "/Test/output"]
HDP1.1:
[cmd, /c, call, C:\hadoop\hadoop-1.2.0-SNAPSHOT/bin/hadoop.cmd, jar, WordCount-1.0-SNAPSHOT.jar, com.microsoft.hdinsight.samples.WordCount, "/Test/testinput.txt", "/Test/output"]
In the MapReduce class which fails, the main function parses the parameters as the following:
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
which will mistaken the generic option as input path and get error in the Path constructor.
To fix the problem, use the following code instead:
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount ");
System.exit(2);
}
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
Hive
Q: How can I run Hive Queries?
You can use Windows Azure PowerShell cmdlets to execute Hive queries. These cmdlets are Invoke_Hive and Start_AzureHDInsightJob. For some samples, see Submit Hadoop jobs programmatically.
To open Hive command line, you must first connect to the cluster using RDP, and then follow the following steps:
Double-click Hadoop command line from desktop.
From Hadoop command line, enter the following commands to run the show tables Hive command:
cd %hive_home%\bin hive show tables;
Q: What are the differences between internal table and external table?
Internal Tables are also known as Managed Tables. In case of an internal table, both data and metadata are managed by Hadoop, whereas in case of External Tables, only the metadata is manged by Hadoop and the table points to a location where data is already present or is going to be loaded. If data files are already present in HDFS then one can create external tables pointing to those data files and be able to query the data files without having to load those files explicitly into a hive table. If an internal table is dropped, both data and metadata are erased, whereas if an external table is dropped, the data remains intact and only the metadata is erased. Internal tables are analogous to tables in SQL Server and external tables are analogous to views in SQL Server.
Q: Why can't I create an external Hive table from the data stored outside the default file system container?
This is a Hive restriction.
Pig
Q: How can I open Pig command line?
Q: How can I submit a Pig job?
Sqoop
Q: What is the syntax of JDBC connection string?
Oozie
HCatalog
Templeton
Windows Azure PowerShell
Where can I get Windows Azuer PowerShell?
You can get it from WindowsAzure.com Download page.
How can I hide the credential prompt when calling Start-AzureHDInsightJob?
When calling Start-AzureHDInsightJob, the cmdlet will prompt you to enter the Hadoop user credential. To avoid the prompt, you can use the -Credentials switch to supply a PSCredential object. Here is a sample:
$clusterName = "myhdicluster"
$username = "admin"
$password = "Pass@word1"
# Create a PSCredential object
$password = ConvertTo-SecureString $password -AsPlainText -Force
$creds = New-Object System.Management.Automation.PSCredential ($username, $password)
#Submit the job to the cluster
$HiveJob = Start-AzureHDInsightJob -Credentials $creds -Cluster $clusterName -JobDefinition $HiveJobDef
HDInsight .NET SDK
Others Languages
FAQ do Windows Azure HDInsight (pt-BR)
See Also
Another important place to find an extensive amount of Cortana Intelligence Suite related articles is the TechNet Wiki itself. The best entry point is Cortana Intelligence Suite Resources on the TechNet Wiki.