使用 Azure HDInsight 針對 Apache Hadoop HDFS 問題進行疑難排解

文章
07/30/2024

了解使用 Hadoop 分散式檔案系統 (HDFS) 時最常發生的問題和解決方式。如需完整的命令清單，請參閱 HDFS 命令指南和檔案系統殼層指南。

如何從叢集內部存取本機 HDFS？

問題

從命令列和應用程式程式碼存取本機 HDFS，而不是從 HDInsight 叢集內部使用 Azure Blob 儲存體或 Azure Data Lake Storage 進行存取。

解決步驟

在命令提示字元中依照字面使用 hdfs dfs -D "fs.default.name=hdfs://mycluster/" ...，如下列命令所示：

hdfs dfs -D "fs.default.name=hdfs://mycluster/" -ls /
Found 3 items
drwxr-xr-x   - hdiuser hdfs          0 2017-03-24 14:12 /EventCheckpoint-30-8-24-11102016-01
drwx-wx-wx   - hive    hdfs          0 2016-11-10 18:42 /tmp
drwx------   - hdiuser hdfs          0 2016-11-10 22:22 /user

從原始程式碼依照字面使用 URI hdfs://mycluster/，如下列範例應用程式所示：

import java.io.IOException;
import java.net.URI;
import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

public class JavaUnitTests {

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();
        String hdfsUri = "hdfs://mycluster/";
        conf.set("fs.defaultFS", hdfsUri);
        FileSystem fileSystem = FileSystem.get(URI.create(hdfsUri), conf);
        RemoteIterator<LocatedFileStatus> fileStatusIterator = fileSystem.listFiles(new Path("/tmp"), true);
        while(fileStatusIterator.hasNext()) {
            System.out.println(fileStatusIterator.next().getPath().toString());
        }
    }
}

使用下列命令在 HDInsight 叢集上執行已編譯的 .jar 檔案 (例如名為 java-unit-tests-1.0.jar 的檔案)：

hadoop jar java-unit-tests-1.0.jar JavaUnitTests
hdfs://mycluster/tmp/hive/hive/5d9cf301-2503-48c7-9963-923fb5ef79a7/inuse.info
hdfs://mycluster/tmp/hive/hive/5d9cf301-2503-48c7-9963-923fb5ef79a7/inuse.lck
hdfs://mycluster/tmp/hive/hive/a0be04ea-ae01-4cc4-b56d-f263baf2e314/inuse.info
hdfs://mycluster/tmp/hive/hive/a0be04ea-ae01-4cc4-b56d-f263baf2e314/inuse.lck

在 Blob 上寫入時的儲存體例外狀況

問題

使用 hadoop 或 hdfs dfs 命令在 HBase 叢集上寫入 ~12 GB 或更大的檔案時，您可能會遇到下列錯誤：

ERROR azure.NativeAzureFileSystem: Encountered Storage Exception for write on Blob : example/test_large_file.bin._COPYING_ Exception details: null Error Code : RequestBodyTooLarge
copyFromLocal: java.io.IOException
        at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:661)
        at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:366)
        at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:350)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.microsoft.azure.storage.StorageException: The request body is too large and exceeds the maximum permissible limit.
        at com.microsoft.azure.storage.StorageException.translateException(StorageException.java:89)
        at com.microsoft.azure.storage.core.StorageRequest.materializeException(StorageRequest.java:307)
        at com.microsoft.azure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:182)
        at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlockInternal(CloudBlockBlob.java:816)
        at com.microsoft.azure.storage.blob.CloudBlockBlob.uploadBlock(CloudBlockBlob.java:788)
        at com.microsoft.azure.storage.blob.BlobOutputStream$1.call(BlobOutputStream.java:354)
        ... 7 more

原因

在寫入 Azure 儲存體時，HDInsight 叢集上的 HBase 會將區塊大小預設為 256 KB。雖然這適用於 HBase API 或 REST API，但會導致在使用 hadoop 或 hdfs dfs 命令列公用程式時發生錯誤。

解決方法

使用 fs.azure.write.request.size 來指定較大的區塊大小。您可以使用 -D 參數，針對每一次使用進行這項修改。以下命令是搭配使用此參數與 hadoop 命令的範例︰

hadoop -fs -D fs.azure.write.request.size=4194304 -copyFromLocal test_large_file.bin /example/data

您也可以使用 Apache Ambari 來全域提高 fs.azure.write.request.size 的值。使用下列步驟即可變更 Ambari Web UI 中的值︰

在瀏覽器中，移至叢集的 Ambari Web UI。 URL 是 https://CLUSTERNAME.azurehdinsight.net，其中 CLUSTERNAME 是叢集名稱。出現提示時，請輸入該叢集的管理員名稱和密碼。
在畫面左側選取 [HDFS]，然後選取 [設定] 索引標籤。
在 [篩選...] 欄位中，輸入 fs.azure.write.request.size。
將值從 262144 (256 KB) 變更為新值。例如，4194304 (4 MB)。

如需有關使用 Ambari 的詳細資訊，請參閱使用 Apache Ambari Web UI 來管理 HDInsight 叢集。

du

如果只是檔案，則 -du 命令會顯示所指定目錄中所含檔案和目錄的大小，或檔案的長度。

-s 選項會產生所顯示檔案長度的摘要彙總。
-h 選項會格式化檔案大小。

範例：

hdfs dfs -du -s -h hdfs://mycluster/
hdfs dfs -du -s -h hdfs://mycluster/tmp

rm

-rm 命令會刪除指定為引數的檔案。

範例：

hdfs dfs -rm hdfs://mycluster/tmp/testfile

下一步

如果您沒有看到您的問題，或無法解決您的問題，請瀏覽下列其中一個管道以取得更多支援：

透過 Azure 社群支援獲得由 Azure 專家所提供的解答。
連線至 @AzureSupport，這是用來改善客戶體驗的官方 Microsoft Azure 帳戶。將 Azure 社群連線到正確的資源：解答、支援和專家。
如果需要更多協助，您可在 Azure 入口網站提交支援要求。從功能表列中選取 [支援] 或開啟 [說明 + 支援] 中樞。如需詳細資訊，請參閱如何建立 Azure 支援要求。 Microsoft Azure 訂用帳戶包括訂用帳戶管理及帳務支援的存取權，而技術支援由其中一項 Azure 支援方案提供。

分享方式：

使用 Azure HDInsight 針對 Apache Hadoop HDFS 問題進行疑難排解

如何從叢集內部存取本機 HDFS？

問題

解決步驟

在 Blob 上寫入時的儲存體例外狀況

問題

原因

解決方法

du

rm

下一步

意見反映

更多資源