FileDataset 類別

參考

代表數據存放區或公用 URL 中要用於 Azure Machine Learning 中的檔案參考集合。

FileDataset 會定義一系列延遲評估、不可變的作業，以將數據從數據源載入檔案數據流。除非要求 FileDataset 傳遞數據，否則不會從來源載入數據。

FileDataset 是使用 from_files FileDatasetFactory 類別的方法所建立。

如需詳細資訊，請參閱新增 & 註冊數據集一文。若要開始使用檔案資料集，請參閱 https://aka.ms/filedataset-samplenotebook。

初始化 FileDataset 物件。

此建構函式不應該直接叫用。數據集旨在使用 FileDatasetFactory 類別來建立。

繼承: AbstractDataset

FileDataset

建構函式

FileDataset()

備註

FileDataset 可作為實驗執行的輸入。它也可以向具有指定名稱的工作區註冊，稍後再由該名稱擷取。

您可以叫用這個類別上可用的不同子設定方法來子集 FileDataset。子設定的結果一律是新的 FileDataset。

當要求 FileDataset 將數據傳遞至另一個儲存機制時，會發生實際的數據載入 (例如下載或掛接至本機路徑的檔案) 。

方法

as_cache	注意這是實驗性方法，可以隨時變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。建立對應至datacache_store和數據集的 DatacacheConsumptionConfig。
as_download	建立 DatasetConsumptionConfig，並將模式設定為下載。在提交的執行中，數據集中的檔案將會下載到計算目標上的本機路徑。您可以從自變數值和執行內容input_datasets欄位擷取下載位置。我們將會自動產生輸入名稱。如果您想要指定自定義輸入名稱，請呼叫 as_named_input 方法。 # Given a run submitted with dataset input like this: dataset_input = dataset.as_download() experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input])) # Following are sample codes running in context of the submitted run: # The download location can be retrieved from argument values import sys download_location = sys.argv[1] # The download location can also be retrieved from input_datasets of the run context. from azureml.core import Run download_location = Run.get_context().input_datasets['input_1']
as_hdfs	將模式設定為 hdfs。在提交的 synapse 執行中，數據集中的檔案將會轉換成計算目標上的本機路徑。 hdfs 路徑可以從自變數值和os環境變數中擷取。 `# Given a run submitted with dataset input like this: dataset_input = dataset.as_hdfs() experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input])) # Following are sample codes running in context of the submitted run: # The hdfs path can be retrieved from argument values import sys hdfs_path = sys.argv[1] # The hdfs path can also be retrieved from input_datasets of the run context. import os hdfs_path = os.environ['input_<hash>']`
as_mount	建立 DatasetConsumptionConfig，並將模式設定為掛接。在提交的執行中，數據集中的檔案會掛接至計算目標上的本機路徑。您可以從自變數值和執行內容input_datasets字段擷取裝入點。我們將會自動產生輸入名稱。如果您想要指定自定義輸入名稱，請呼叫 as_named_input 方法。 `# Given a run submitted with dataset input like this: dataset_input = dataset.as_mount() experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input])) # Following are sample codes running in context of the submitted run: # The mount point can be retrieved from argument values import sys mount_point = sys.argv[1] # The mount point can also be retrieved from input_datasets of the run context. from azureml.core import Run mount_point = Run.get_context().input_datasets['input_1']`
download	下載數據集定義為本機檔案的檔案數據流。
file_metadata	注意這是實驗性方法，可以隨時變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。藉由指定元數據行名稱來取得檔案元數據表達式。支持的檔案元數據數據行為 Size、LastModifiedTime、CreationTime、Extension 和 CanSeek
filter	注意這是實驗性方法，可以隨時變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。篩選數據，只保留符合指定表達式的記錄。
hydrate	注意這是實驗性方法，可以隨時變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。將數據集凍結成datacache_store中指定的要求複本。
mount	建立內容管理員，以掛接數據集所定義的檔案數據流作為本機檔案。
random_split	將數據集中的檔案數據流隨機分割成兩個部分，並以指定的百分比來大致分割。傳回的第一個數據集大約包含 `percentage` 檔案參考的總數，而第二個數據集則包含其餘的檔案參考。
skip	依指定的計數，略過數據集頂端的檔案數據流。
take	依指定的計數，從數據集頂端取得檔案數據流的範例。
take_sample	以大約指定的機率，取得數據集中檔案數據流的隨機樣本。
to_path	取得資料集所定義之每個檔案數據流的檔案路徑清單。

as_cache

注意

這是實驗性方法，可以隨時變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。

建立對應至datacache_store和數據集的 DatacacheConsumptionConfig。

as_cache(datacache_store)

參數

名稱	Description
datacache_store 必要	DatacacheStore 要用來凍結的 datacachestore。

傳回

類型	Description
DatacacheConsumptionConfig	組態物件，描述如何在執行中具體化 datacache。

as_download

建立 DatasetConsumptionConfig，並將模式設定為下載。

在提交的執行中，數據集中的檔案將會下載到計算目標上的本機路徑。您可以從自變數值和執行內容input_datasets欄位擷取下載位置。我們將會自動產生輸入名稱。如果您想要指定自定義輸入名稱，請呼叫 as_named_input 方法。


   # Given a run submitted with dataset input like this:
   dataset_input = dataset.as_download()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The download location can be retrieved from argument values
   import sys
   download_location = sys.argv[1]

   # The download location can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   download_location = Run.get_context().input_datasets['input_1']

as_download(path_on_compute=None)

參數

名稱	Description
path_on_compute	str 計算上要提供數據的目標路徑。預設值: None

備註

從單一檔案的路徑建立數據集時，下載位置會是單一下載文件的路徑。否則，下載位置將會是所有下載檔之封入資料夾的路徑。

如果path_on_compute以 /開頭，則會將其視為絕對路徑。如果它不是以 /開頭，則會將其視為相對於工作目錄的相對路徑。如果您已指定絕對路徑，請確定作業具有寫入該目錄的許可權。

as_hdfs

將模式設定為 hdfs。

在提交的 synapse 執行中，數據集中的檔案將會轉換成計算目標上的本機路徑。 hdfs 路徑可以從自變數值和os環境變數中擷取。


   # Given a run submitted with dataset input like this:
   dataset_input = dataset.as_hdfs()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The hdfs path can be retrieved from argument values
   import sys
   hdfs_path = sys.argv[1]

   # The hdfs path can also be retrieved from input_datasets of the run context.
   import os
   hdfs_path = os.environ['input_<hash>']

as_hdfs()

備註

從單一檔案的路徑建立數據集時，hdfs 路徑會是單一檔案的路徑。否則，hdfs 路徑會是所有掛接檔案的封入資料夾路徑。

as_mount

建立 DatasetConsumptionConfig，並將模式設定為掛接。

在提交的執行中，數據集中的檔案會掛接至計算目標上的本機路徑。您可以從自變數值和執行內容input_datasets字段擷取裝入點。我們將會自動產生輸入名稱。如果您想要指定自定義輸入名稱，請呼叫 as_named_input 方法。


   # Given a run submitted with dataset input like this:
   dataset_input = dataset.as_mount()
   experiment.submit(ScriptRunConfig(source_directory, arguments=[dataset_input]))


   # Following are sample codes running in context of the submitted run:

   # The mount point can be retrieved from argument values
   import sys
   mount_point = sys.argv[1]

   # The mount point can also be retrieved from input_datasets of the run context.
   from azureml.core import Run
   mount_point = Run.get_context().input_datasets['input_1']

as_mount(path_on_compute=None)

參數

名稱	Description
path_on_compute	str 計算上要提供數據的目標路徑。預設值: None

備註

從單一檔案的路徑建立數據集時，裝入點會是單一掛接檔案的路徑。否則，載入點會是所有掛接檔案的封入資料夾路徑。

download

下載數據集定義為本機檔案的檔案數據流。

download(target_path=None, overwrite=False, ignore_not_found=False)

參數

名稱	Description
target_path 必要	str 要下載檔案的本機目錄。如果為 None，數據將會下載到暫存目錄。
overwrite 必要	bool 指出是否要覆寫現有的檔案。預設值是 False。如果覆寫設定為 True，則會覆寫現有的檔案;否則會引發例外狀況。
ignore_not_found 必要	bool 指出如果找不到數據集所指向的某些檔案，是否無法下載。預設值是 False。如果ignore_not_found設為 False，如果任何檔案下載失敗，下載將會失敗;否則，只要沒有遇到其他錯誤類型，就會記錄未找到錯誤的衝突，而且 dowload 將會成功。

傳回

類型	Description
list(str)	傳回所下載每個檔案的檔案路徑陣列。

備註

如果target_path以 /開頭，則會將其視為絕對路徑。如果它不是以 /開頭，則會被視為相對於目前工作目錄的相對路徑。

file_metadata

注意

這是實驗性方法，可以隨時變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。

藉由指定元數據行名稱來取得檔案元數據表達式。

支持的檔案元數據數據行為 Size、LastModifiedTime、CreationTime、Extension 和 CanSeek

file_metadata(col)

參數

名稱	Description
col 必要	str 數據行的名稱

傳回

類型	Description
<xref:azureml.dataprep.api.expression.RecordFieldExpression>	傳回表達式，這個表示式會擷取指定數據行中的值。

filter

注意

這是實驗性方法，可以隨時變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。

篩選數據，只保留符合指定表達式的記錄。

filter(expression)

參數

名稱	Description
expression 必要	<xref:azureml.dataprep.api.expression.Expression> 要評估的運算式。

傳回

類型	Description
FileDataset	已修改的數據集 (取消註冊) 。

備註

表達式的開頭是使用數據行名稱來編製數據集的索引。它們支持各種函式和運算符，而且可以使用邏輯運算符來結合。產生的表達式會在數據提取發生且未定義的位置時，針對每個記錄延遲評估。


   (dataset.file_metadata('Size') > 10000) & (dataset.file_metadata('CanSeek') == True)
   dataset.file_metadata('Extension').starts_with('j')

hydrate

注意

這是實驗性方法，可以隨時變更。請參閱 https://aka.ms/azuremlexperimental 以取得詳細資訊。

將數據集凍結成datacache_store中指定的要求複本。

hydrate(datacache_store, replica_count=None)

參數

名稱	Description
datacache_store 必要	DatacacheStore 要用來凍結的 datacachestore。
replica_count 必要	<xref:Int>, <xref:optional> 要凍結的複本數目。

傳回

類型	Description
DatacacheHydrationTracker	組態物件，描述如何在執行中具體化 datacache。

mount

建立內容管理員，以掛接數據集所定義的檔案數據流作為本機檔案。

mount(mount_point=None, **kwargs)

參數

名稱	Description
mount_point 必要	str 要掛接檔案的本機目錄。如果為 None，數據會掛接至暫存目錄，您可以藉由呼叫 MountContext.mount_point 實例方法來找到此目錄。

傳回

類型	Description
<xref:MountContext>: <xref:the> <xref:context> <xref:manager.> <xref:Upon> <xref:entering> <xref:the> <xref:context> <xref:manager>, <xref:the> <xref:dataflow> <xref:will> <xref:be> <xref:mounted> <xref:to> <xref:the> <xref:mount_point.> <xref:Upon> exit, <xref:it> <xref:will> <xref:remove> <xref:the> mount <xref:point> <xref:and> clean <xref:up> <xref:the> <xref:daemon> <xref:process> <xref:used> <xref:to> mount <xref:the> <xref:dataflow.>	傳回用於管理掛接生命周期的內容管理員。

類型

Description

<xref:MountContext>: <xref:the> <xref:context> <xref:manager.> <xref:Upon> <xref:entering> <xref:the> <xref:context> <xref:manager>, <xref:the> <xref:dataflow> <xref:will> <xref:be> <xref:mounted> <xref:to> <xref:the> <xref:mount_point.> <xref:Upon> exit, <xref:it> <xref:will> <xref:remove> <xref:the> mount <xref:point> <xref:and> clean <xref:up> <xref:the> <xref:daemon> <xref:process> <xref:used> <xref:to> mount <xref:the> <xref:dataflow.>

傳回用於管理掛接生命周期的內容管理員。

備註

系統將會傳回內容管理員以管理掛接的生命週期。若要掛接，您必須輸入內容管理員，然後從內容管理員取消掛接。

只有在已安裝原生套件 libfuse 的 Unix 或類似 Unix 的作業系統上才支援掛接。如果您在 Docker 容器內執行，docker 容器必須以 –privileged 旗標啟動，或以 –cap-add SYS_ADMIN –device /dev/fuse 啟動。


   datastore = Datastore.get(workspace, 'workspaceblobstore')
   dataset = Dataset.File.from_files((datastore, 'animals/dog/year-*/*.jpg'))

   with dataset.mount() as mount_context:
       # list top level mounted files and folders in the dataset
       os.listdir(mount_context.mount_point)

   # You can also use the start and stop methods
   mount_context = dataset.mount()
   mount_context.start()  # this will mount the file streams
   mount_context.stop()  # this will unmount the file streams

如果target_path以 /開頭，則會將其視為絕對路徑。如果它不是以 /開頭，則會被視為相對於目前工作目錄的相對路徑。

random_split

將數據集中的檔案數據流隨機分割成兩個部分，並以指定的百分比來大致分割。

傳回的第一個數據集大約包含 percentage 檔案參考的總數，而第二個數據集則包含其餘的檔案參考。

random_split(percentage, seed=None)

參數

名稱	Description
percentage 必要	float 分割數據集的近似百分比。這必須是介於 0.0 和 1.0 之間的數位。
seed 必要	int 要用於隨機產生器的選擇性種子。

傳回

類型	Description
(FileDataset, FileDataset)	傳回新的 FileDataset 物件的元組，代表分割之後的兩個數據集。

skip

依指定的計數，略過數據集頂端的檔案數據流。

skip(count)

參數

名稱	Description
count 必要	int 要略過的檔案數據流數目。

傳回

類型	Description
FileDataset	傳回新的 FileDataset 物件，代表略過檔案數據流的數據集。

take

依指定的計數，從數據集頂端取得檔案數據流的範例。

take(count)

參數

名稱	Description
count 必要	int 要接受的檔案數據流數目。

傳回

類型	Description
FileDataset	會傳回代表取樣數據集的新 FileDataset 物件。

take_sample

以大約指定的機率，取得數據集中檔案數據流的隨機樣本。

take_sample(probability, seed=None)

參數

名稱	Description
probability 必要	float 範例中包含檔案數據流的機率。
seed 必要	int 要用於隨機產生器的選擇性種子。

傳回

類型	Description
FileDataset	會傳回代表取樣數據集的新 FileDataset 物件。

to_path

取得資料集所定義之每個檔案數據流的檔案路徑清單。

to_path()

傳回

類型	Description
list(str)	傳回檔案路徑的陣列。

備註

下載或掛接檔案數據流時，檔案路徑是本機檔案的相對路徑。

根據數據源的指定方式來建立數據集，將會從檔案路徑中移除一般前置詞。例如：


   datastore = Datastore.get(workspace, 'workspaceblobstore')
   dataset = Dataset.File.from_files((datastore, 'animals/dog/year-*/*.jpg'))
   print(dataset.to_path())

   # ['year-2018/1.jpg'
   #  'year-2018/2.jpg'
   #  'year-2019/1.jpg']

   dataset = Dataset.File.from_files('https://dprepdata.blob.core.windows.net/demo/green-small/*.csv')

   print(dataset.to_path())
   # ['/green_tripdata_2013-08.csv']

共用方式為

FileDataset 類別

建構函式

備註

方法

as_cache

參數

傳回

as_download

參數

備註

as_hdfs

備註

as_mount

參數

備註

download

參數

傳回

備註

file_metadata

參數

傳回

filter

參數

傳回

備註

hydrate

參數

傳回

mount

參數

傳回

備註

random_split

參數

傳回

skip

參數

傳回

take

參數

傳回

take_sample

參數

傳回

to_path

傳回

備註

意見反應

其他資源