TextLoaderSaverCatalog.CreateTextLoader Method
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Overloads
CreateTextLoader(DataOperationsCatalog, TextLoader+Options, IMultiStreamSource) |
Create a text loader TextLoader. |
CreateTextLoader(DataOperationsCatalog, TextLoader+Column[], Char, Boolean, IMultiStreamSource, Boolean, Boolean, Boolean) |
Create a text loader TextLoader. |
CreateTextLoader<TInput>(DataOperationsCatalog, TextLoader+Options, IMultiStreamSource) |
Create a text loader TextLoader by inferencing the dataset schema from a data model type. |
CreateTextLoader<TInput>(DataOperationsCatalog, Char, Boolean, IMultiStreamSource, Boolean, Boolean, Boolean) |
Create a text loader TextLoader by inferencing the dataset schema from a data model type. |
CreateTextLoader(DataOperationsCatalog, TextLoader+Options, IMultiStreamSource)
Create a text loader TextLoader.
public static Microsoft.ML.Data.TextLoader CreateTextLoader (this Microsoft.ML.DataOperationsCatalog catalog, Microsoft.ML.Data.TextLoader.Options options, Microsoft.ML.Data.IMultiStreamSource dataSample = default);
static member CreateTextLoader : Microsoft.ML.DataOperationsCatalog * Microsoft.ML.Data.TextLoader.Options * Microsoft.ML.Data.IMultiStreamSource -> Microsoft.ML.Data.TextLoader
<Extension()>
Public Function CreateTextLoader (catalog As DataOperationsCatalog, options As TextLoader.Options, Optional dataSample As IMultiStreamSource = Nothing) As TextLoader
Parameters
- catalog
- DataOperationsCatalog
The DataOperationsCatalog catalog.
- options
- TextLoader.Options
Defines the settings of the load operation.
- dataSample
- IMultiStreamSource
The optional location of a data sample. The sample can be used to infer slot name annotations if present, and also the number
of slots in Columns defined with TextLoader.Range with null
maximum index.
If the sample has been saved with ML.NET's SaveAsText(DataOperationsCatalog, IDataView, Stream, Char, Boolean, Boolean, Boolean, Boolean),
it will also contain the schema information in the header that the loader can read even if Columns are not specified.
In order to use the schema defined in the file, all other TextLoader.Options sould be left with their default values.
Returns
Applies to
CreateTextLoader(DataOperationsCatalog, TextLoader+Column[], Char, Boolean, IMultiStreamSource, Boolean, Boolean, Boolean)
Create a text loader TextLoader.
public static Microsoft.ML.Data.TextLoader CreateTextLoader (this Microsoft.ML.DataOperationsCatalog catalog, Microsoft.ML.Data.TextLoader.Column[] columns, char separatorChar = '\t', bool hasHeader = false, Microsoft.ML.Data.IMultiStreamSource dataSample = default, bool allowQuoting = false, bool trimWhitespace = false, bool allowSparse = false);
static member CreateTextLoader : Microsoft.ML.DataOperationsCatalog * Microsoft.ML.Data.TextLoader.Column[] * char * bool * Microsoft.ML.Data.IMultiStreamSource * bool * bool * bool -> Microsoft.ML.Data.TextLoader
<Extension()>
Public Function CreateTextLoader (catalog As DataOperationsCatalog, columns As TextLoader.Column(), Optional separatorChar As Char = '\t', Optional hasHeader As Boolean = false, Optional dataSample As IMultiStreamSource = Nothing, Optional allowQuoting As Boolean = false, Optional trimWhitespace As Boolean = false, Optional allowSparse As Boolean = false) As TextLoader
Parameters
- catalog
- DataOperationsCatalog
The DataOperationsCatalog catalog.
- columns
- TextLoader.Column[]
Array of columns TextLoader.Column defining the schema.
- separatorChar
- Char
The character used as separator between data points in a row. By default the tab character is used as separator.
- hasHeader
- Boolean
Whether the file has a header with feature names. When a is provided, true
indicates that the first line in the will be used for feature names, and that when Load(IMultiStreamSource)
is called, the first line will be skipped. When there is no provided, true
just indicates that the loader should
skip the first line when Load(IMultiStreamSource) is called, but columns will not have slot names annotations. This is
because the output schema is made when the loader is created, and not when Load(IMultiStreamSource) is called.
- dataSample
- IMultiStreamSource
The optional location of a data sample. The sample can be used to infer slot name annotations if present, and also the number
of slots in a column defined with TextLoader.Range with null
maximum index.
If the sample has been saved with ML.NET's SaveAsText(DataOperationsCatalog, IDataView, Stream, Char, Boolean, Boolean, Boolean, Boolean),
it will also contain the schema information in the header that the loader can read even if columns
is null
.
In order to use the schema defined in the file, all other arguments sould be left with their default values.
- allowQuoting
- Boolean
Whether the input may include double-quoted values. This parameter is used to distinguish separator characters
in an input value from actual separators. When true
, separators within double quotes are treated as part of the
input value. When false
, all separators, even those within quotes, are treated as delimiting a new column.
- trimWhitespace
- Boolean
Remove trailing whitespace from lines.
- allowSparse
- Boolean
Whether the input may include sparse representations. For example, a row containing "5 2:6 4:3" means that there are 5 columns, and the only non-zero are columns 2 and 4, which have values 6 and 3, respectively. Column indices are zero-based, so columns 2 and 4 represent the 3rd and 5th columns. A column may also have dense values followed by sparse values represented in this fashion. For example, a row containing "1 2 5 2:6 4:3" represents two dense columns with values 1 and 2, followed by 5 sparsely represented columns with values 0, 0, 6, 0, and 3. The indices of the sparse columns start from 0, even though 0 represents the third column.
Returns
Examples
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using Microsoft.ML;
using Microsoft.ML.Data;
namespace Samples.Dynamic.DataOperations
{
public static class LoadingText
{
// This examples shows all the ways to load data with TextLoader.
public static void Example()
{
// Create 5 data files to illustrate different loading methods.
var dataFiles = new List<string>();
var random = new Random(1);
var dataDirectoryName = "DataDir";
Directory.CreateDirectory(dataDirectoryName);
for (int i = 0; i < 5; i++)
{
var fileName = Path.Combine(dataDirectoryName, $"Data_{i}.csv");
dataFiles.Add(fileName);
using (var fs = File.CreateText(fileName))
{
// Write without header with 10 random columns, forcing
// approximately 80% of values to be 0.
for (int line = 0; line < 10; line++)
{
var sb = new StringBuilder();
for (int pos = 0; pos < 10; pos++)
{
var value = random.NextDouble();
sb.Append((value < 0.8 ? 0 : value).ToString() + '\t');
}
fs.WriteLine(sb.ToString(0, sb.Length - 1));
}
}
}
// Create a TextLoader.
var mlContext = new MLContext();
var loader = mlContext.Data.CreateTextLoader(
columns: new[]
{
new TextLoader.Column("Features", DataKind.Single, 0, 9)
},
hasHeader: false
);
// Load a single file from path.
var singleFileData = loader.Load(dataFiles[0]);
PrintRowCount(singleFileData);
// Expected Output:
// 10
// Load all 5 files from path.
var multipleFilesData = loader.Load(dataFiles.ToArray());
PrintRowCount(multipleFilesData);
// Expected Output:
// 50
// Load all files using path wildcard.
var multipleFilesWildcardData =
loader.Load(Path.Combine(dataDirectoryName, "Data_*.csv"));
PrintRowCount(multipleFilesWildcardData);
// Expected Output:
// 50
// Create a TextLoader with user defined type.
var loaderWithCustomType =
mlContext.Data.CreateTextLoader<Data>(hasHeader: false);
// Load a single file from path.
var singleFileCustomTypeData = loaderWithCustomType.Load(dataFiles[0]);
PrintRowCount(singleFileCustomTypeData);
// Expected Output:
// 10
// Create a TextLoader with unknown column length to illustrate
// how a data sample may be used to infer column size.
var dataSample = new MultiFileSource(dataFiles[0]);
var loaderWithUnknownLength = mlContext.Data.CreateTextLoader(
columns: new[]
{
new TextLoader.Column("Features",
DataKind.Single,
new[] { new TextLoader.Range(0, null) })
},
dataSample: dataSample
);
var dataWithInferredLength = loaderWithUnknownLength.Load(dataFiles[0]);
var featuresColumn = dataWithInferredLength.Schema.GetColumnOrNull("Features");
if (featuresColumn.HasValue)
Console.WriteLine(featuresColumn.Value.ToString());
// Expected Output:
// Features: Vector<Single, 10>
//
// ML.NET infers the correct length of 10 for the Features column,
// which is of type Vector<Single>.
PrintRowCount(dataWithInferredLength);
// Expected Output:
// 10
// Save the data with 10 rows to a text file to illustrate the use of
// sparse format.
var sparseDataFileName = Path.Combine(dataDirectoryName, "saved_data.tsv");
using (FileStream stream = new FileStream(sparseDataFileName, FileMode.Create))
mlContext.Data.SaveAsText(singleFileData, stream);
// Since there are many zeroes in the data, it will be saved in a sparse
// representation to save disk space. The data may be forced to be saved
// in a dense representation by setting forceDense to true. The sparse
// data will look like the following:
//
// 10 7:0.943862259
// 10 3:0.989767134
// 10 0:0.949778438 8:0.823028445 9:0.886469543
//
// The sparse representation of the first row indicates that there are
// 10 columns, the column 7 (8-th column) has value 0.943862259, and other
// omitted columns have value 0.
// Create a TextLoader that allows sparse input.
var sparseLoader = mlContext.Data.CreateTextLoader(
columns: new[]
{
new TextLoader.Column("Features", DataKind.Single, 0, 9)
},
allowSparse: true
);
// Load the saved sparse data.
var sparseData = sparseLoader.Load(sparseDataFileName);
PrintRowCount(sparseData);
// Expected Output:
// 10
// Create a TextLoader without any column schema using TextLoader.Options.
// Since the sparse data file was saved with ML.NET, it has the schema
// enoded in its header that the loader can understand:
//
// #@ TextLoader{
// #@ sep=tab
// #@ col=Features:R4:0-9
// #@ }
//
// The schema syntax is unimportant since it is only used internally. In
// short, it tells the loader that the values are separated by tabs, and
// that columns 0-9 in the text file are to be read into one column named
// "Features" of type Single (internal type R4).
var options = new TextLoader.Options()
{
AllowSparse = true,
};
var dataSampleWithSchema = new MultiFileSource(sparseDataFileName);
var sparseLoaderWithSchema =
mlContext.Data.CreateTextLoader(options, dataSample: dataSampleWithSchema);
// Load the saved sparse data.
var sparseDataWithSchema = sparseLoaderWithSchema.Load(sparseDataFileName);
PrintRowCount(sparseDataWithSchema);
// Expected Output:
// 10
}
private static void PrintRowCount(IDataView idv)
{
// IDataView is lazy so we need to iterate through it
// to get the number of rows.
long rowCount = 0;
using (var cursor = idv.GetRowCursor(idv.Schema))
while (cursor.MoveNext())
rowCount++;
Console.WriteLine(rowCount);
}
private class Data
{
[LoadColumn(0, 9)]
public float[] Features { get; set; }
}
}
}
Applies to
CreateTextLoader<TInput>(DataOperationsCatalog, TextLoader+Options, IMultiStreamSource)
Create a text loader TextLoader by inferencing the dataset schema from a data model type.
public static Microsoft.ML.Data.TextLoader CreateTextLoader<TInput> (this Microsoft.ML.DataOperationsCatalog catalog, Microsoft.ML.Data.TextLoader.Options options, Microsoft.ML.Data.IMultiStreamSource dataSample = default);
static member CreateTextLoader : Microsoft.ML.DataOperationsCatalog * Microsoft.ML.Data.TextLoader.Options * Microsoft.ML.Data.IMultiStreamSource -> Microsoft.ML.Data.TextLoader
<Extension()>
Public Function CreateTextLoader(Of TInput) (catalog As DataOperationsCatalog, options As TextLoader.Options, Optional dataSample As IMultiStreamSource = Nothing) As TextLoader
Type Parameters
- TInput
Parameters
- catalog
- DataOperationsCatalog
The DataOperationsCatalog catalog.
- options
- TextLoader.Options
Defines the settings of the load operation. Defines the settings of the load operation. No need to specify a Columns field, as columns will be infered by this method.
- dataSample
- IMultiStreamSource
The optional location of a data sample. The sample can be used to infer information about the columns, such as slot names.
Returns
Applies to
CreateTextLoader<TInput>(DataOperationsCatalog, Char, Boolean, IMultiStreamSource, Boolean, Boolean, Boolean)
Create a text loader TextLoader by inferencing the dataset schema from a data model type.
public static Microsoft.ML.Data.TextLoader CreateTextLoader<TInput> (this Microsoft.ML.DataOperationsCatalog catalog, char separatorChar = '\t', bool hasHeader = false, Microsoft.ML.Data.IMultiStreamSource dataSample = default, bool allowQuoting = false, bool trimWhitespace = false, bool allowSparse = false);
static member CreateTextLoader : Microsoft.ML.DataOperationsCatalog * char * bool * Microsoft.ML.Data.IMultiStreamSource * bool * bool * bool -> Microsoft.ML.Data.TextLoader
<Extension()>
Public Function CreateTextLoader(Of TInput) (catalog As DataOperationsCatalog, Optional separatorChar As Char = '\t', Optional hasHeader As Boolean = false, Optional dataSample As IMultiStreamSource = Nothing, Optional allowQuoting As Boolean = false, Optional trimWhitespace As Boolean = false, Optional allowSparse As Boolean = false) As TextLoader
Type Parameters
- TInput
Defines the schema of the data to be loaded. Use public fields or properties decorated with LoadColumnAttribute (and possibly other attributes) to specify the column names and their data types in the schema of the loaded data.
Parameters
- catalog
- DataOperationsCatalog
The DataOperationsCatalog catalog.
- separatorChar
- Char
Column separator character. Default is '\t'
- hasHeader
- Boolean
Whether the file has a header with feature names. When a is provided, true
indicates that the first line in the will be used for feature names, and that when Load(IMultiStreamSource)
is called, the first line will be skipped. When there is no provided, true
just indicates that the loader should
skip the first line when Load(IMultiStreamSource) is called, but columns will not have slot names annotations. This is
because the output schema is made when the loader is created, and not when Load(IMultiStreamSource) is called.
- dataSample
- IMultiStreamSource
The optional location of a data sample. The sample can be used to infer slot name annotations if present.
- allowQuoting
- Boolean
Whether the input may include double-quoted values. This parameter is used to distinguish separator characters
in an input value from actual separators. When true
, separators within double quotes are treated as part of the
input value. When false
, all separators, even those whitin quotes, are treated as delimiting a new column.
- trimWhitespace
- Boolean
Remove trailing whitespace from lines.
- allowSparse
- Boolean
Whether the input may include sparse representations. For example, a row containing "5 2:6 4:3" means that there are 5 columns, and the only non-zero are columns 2 and 4, which have values 6 and 3, respectively. Column indices are zero-based, so columns 2 and 4 represent the 3rd and 5th columns. A column may also have dense values followed by sparse values represented in this fashion. For example, a row containing "1 2 5 2:6 4:3" represents two dense columns with values 1 and 2, followed by 5 sparsely represented columns with values 0, 0, 6, 0, and 3. The indices of the sparse columns start from 0, even though 0 represents the third column.