TextLoaderSaverCatalog.CreateTextLoader Method

Definition

Overloads

CreateTextLoader(DataOperationsCatalog, TextLoader+Options, IMultiStreamSource)

Create a text loader TextLoader.

CreateTextLoader(DataOperationsCatalog, TextLoader+Column[], Char, Boolean, IMultiStreamSource, Boolean, Boolean, Boolean)

Create a text loader TextLoader.

CreateTextLoader<TInput>(DataOperationsCatalog, TextLoader+Options, IMultiStreamSource)

Create a text loader TextLoader by inferencing the dataset schema from a data model type.

CreateTextLoader<TInput>(DataOperationsCatalog, Char, Boolean, IMultiStreamSource, Boolean, Boolean, Boolean)

Create a text loader TextLoader by inferencing the dataset schema from a data model type.

CreateTextLoader(DataOperationsCatalog, TextLoader+Options, IMultiStreamSource)

Create a text loader TextLoader.

public static Microsoft.ML.Data.TextLoader CreateTextLoader (this Microsoft.ML.DataOperationsCatalog catalog, Microsoft.ML.Data.TextLoader.Options options, Microsoft.ML.Data.IMultiStreamSource dataSample = default);
static member CreateTextLoader : Microsoft.ML.DataOperationsCatalog * Microsoft.ML.Data.TextLoader.Options * Microsoft.ML.Data.IMultiStreamSource -> Microsoft.ML.Data.TextLoader
<Extension()>
Public Function CreateTextLoader (catalog As DataOperationsCatalog, options As TextLoader.Options, Optional dataSample As IMultiStreamSource = Nothing) As TextLoader

Parameters

options
TextLoader.Options

Defines the settings of the load operation.

dataSample
IMultiStreamSource

The optional location of a data sample. The sample can be used to infer slot name annotations if present, and also the number of slots in Columns defined with TextLoader.Range with null maximum index. If the sample has been saved with ML.NET's SaveAsText(DataOperationsCatalog, IDataView, Stream, Char, Boolean, Boolean, Boolean, Boolean), it will also contain the schema information in the header that the loader can read even if Columns are not specified. In order to use the schema defined in the file, all other TextLoader.Options sould be left with their default values.

Returns

Applies to

CreateTextLoader(DataOperationsCatalog, TextLoader+Column[], Char, Boolean, IMultiStreamSource, Boolean, Boolean, Boolean)

Create a text loader TextLoader.

public static Microsoft.ML.Data.TextLoader CreateTextLoader (this Microsoft.ML.DataOperationsCatalog catalog, Microsoft.ML.Data.TextLoader.Column[] columns, char separatorChar = '\t', bool hasHeader = false, Microsoft.ML.Data.IMultiStreamSource dataSample = default, bool allowQuoting = false, bool trimWhitespace = false, bool allowSparse = false);
static member CreateTextLoader : Microsoft.ML.DataOperationsCatalog * Microsoft.ML.Data.TextLoader.Column[] * char * bool * Microsoft.ML.Data.IMultiStreamSource * bool * bool * bool -> Microsoft.ML.Data.TextLoader
<Extension()>
Public Function CreateTextLoader (catalog As DataOperationsCatalog, columns As TextLoader.Column(), Optional separatorChar As Char = '\t', Optional hasHeader As Boolean = false, Optional dataSample As IMultiStreamSource = Nothing, Optional allowQuoting As Boolean = false, Optional trimWhitespace As Boolean = false, Optional allowSparse As Boolean = false) As TextLoader

Parameters

columns
TextLoader.Column[]

Array of columns TextLoader.Column defining the schema.

separatorChar
Char

The character used as separator between data points in a row. By default the tab character is used as separator.

hasHeader
Boolean

Whether the file has a header with feature names. When a is provided, true indicates that the first line in the will be used for feature names, and that when Load(IMultiStreamSource) is called, the first line will be skipped. When there is no provided, true just indicates that the loader should skip the first line when Load(IMultiStreamSource) is called, but columns will not have slot names annotations. This is because the output schema is made when the loader is created, and not when Load(IMultiStreamSource) is called.

dataSample
IMultiStreamSource

The optional location of a data sample. The sample can be used to infer slot name annotations if present, and also the number of slots in a column defined with TextLoader.Range with null maximum index. If the sample has been saved with ML.NET's SaveAsText(DataOperationsCatalog, IDataView, Stream, Char, Boolean, Boolean, Boolean, Boolean), it will also contain the schema information in the header that the loader can read even if columns is null. In order to use the schema defined in the file, all other arguments sould be left with their default values.

allowQuoting
Boolean

Whether the input may include double-quoted values. This parameter is used to distinguish separator characters in an input value from actual separators. When true, separators within double quotes are treated as part of the input value. When false, all separators, even those within quotes, are treated as delimiting a new column.

trimWhitespace
Boolean

Remove trailing whitespace from lines.

allowSparse
Boolean

Whether the input may include sparse representations. For example, a row containing "5 2:6 4:3" means that there are 5 columns, and the only non-zero are columns 2 and 4, which have values 6 and 3, respectively. Column indices are zero-based, so columns 2 and 4 represent the 3rd and 5th columns. A column may also have dense values followed by sparse values represented in this fashion. For example, a row containing "1 2 5 2:6 4:3" represents two dense columns with values 1 and 2, followed by 5 sparsely represented columns with values 0, 0, 6, 0, and 3. The indices of the sparse columns start from 0, even though 0 represents the third column.

Returns

Examples

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using Microsoft.ML;
using Microsoft.ML.Data;

namespace Samples.Dynamic.DataOperations
{
    public static class LoadingText
    {
        // This examples shows all the ways to load data with TextLoader.
        public static void Example()
        {
            // Create 5 data files to illustrate different loading methods.
            var dataFiles = new List<string>();
            var random = new Random(1);
            var dataDirectoryName = "DataDir";
            Directory.CreateDirectory(dataDirectoryName);
            for (int i = 0; i < 5; i++)
            {
                var fileName = Path.Combine(dataDirectoryName, $"Data_{i}.csv");
                dataFiles.Add(fileName);
                using (var fs = File.CreateText(fileName))
                {
                    // Write without header with 10 random columns, forcing
                    // approximately 80% of values to be 0.
                    for (int line = 0; line < 10; line++)
                    {
                        var sb = new StringBuilder();
                        for (int pos = 0; pos < 10; pos++)
                        {
                            var value = random.NextDouble();
                            sb.Append((value < 0.8 ? 0 : value).ToString() + '\t');
                        }
                        fs.WriteLine(sb.ToString(0, sb.Length - 1));
                    }
                }
            }

            // Create a TextLoader.
            var mlContext = new MLContext();
            var loader = mlContext.Data.CreateTextLoader(
                columns: new[]
                {
                    new TextLoader.Column("Features", DataKind.Single, 0, 9)
                },
                hasHeader: false
            );

            // Load a single file from path.
            var singleFileData = loader.Load(dataFiles[0]);
            PrintRowCount(singleFileData);

            // Expected Output:
            //   10


            // Load all 5 files from path.
            var multipleFilesData = loader.Load(dataFiles.ToArray());
            PrintRowCount(multipleFilesData);

            // Expected Output:
            //   50


            // Load all files using path wildcard.
            var multipleFilesWildcardData =
                loader.Load(Path.Combine(dataDirectoryName, "Data_*.csv"));
            PrintRowCount(multipleFilesWildcardData);

            // Expected Output:
            //   50


            // Create a TextLoader with user defined type.
            var loaderWithCustomType =
                mlContext.Data.CreateTextLoader<Data>(hasHeader: false);

            // Load a single file from path.
            var singleFileCustomTypeData = loaderWithCustomType.Load(dataFiles[0]);
            PrintRowCount(singleFileCustomTypeData);

            // Expected Output:
            //   10


            // Create a TextLoader with unknown column length to illustrate
            // how a data sample may be used to infer column size.
            var dataSample = new MultiFileSource(dataFiles[0]);
            var loaderWithUnknownLength = mlContext.Data.CreateTextLoader(
                columns: new[]
                {
                    new TextLoader.Column("Features",
                                          DataKind.Single,
                                          new[] { new TextLoader.Range(0, null) })
                },
                dataSample: dataSample
            );

            var dataWithInferredLength = loaderWithUnknownLength.Load(dataFiles[0]);
            var featuresColumn = dataWithInferredLength.Schema.GetColumnOrNull("Features");
            if (featuresColumn.HasValue)
                Console.WriteLine(featuresColumn.Value.ToString());

            // Expected Output:
            //   Features: Vector<Single, 10>
            //
            // ML.NET infers the correct length of 10 for the Features column,
            // which is of type Vector<Single>.

            PrintRowCount(dataWithInferredLength);

            // Expected Output:
            //   10


            // Save the data with 10 rows to a text file to illustrate the use of
            // sparse format.
            var sparseDataFileName = Path.Combine(dataDirectoryName, "saved_data.tsv");
            using (FileStream stream = new FileStream(sparseDataFileName, FileMode.Create))
                mlContext.Data.SaveAsText(singleFileData, stream);

            // Since there are many zeroes in the data, it will be saved in a sparse
            // representation to save disk space. The data may be forced to be saved
            // in a dense representation by setting forceDense to true. The sparse
            // data will look like the following:
            //
            //   10 7:0.943862259
            //   10 3:0.989767134
            //   10 0:0.949778438   8:0.823028445   9:0.886469543
            //
            // The sparse representation of the first row indicates that there are
            // 10 columns, the column 7 (8-th column) has value 0.943862259, and other
            // omitted columns have value 0.

            // Create a TextLoader that allows sparse input.
            var sparseLoader = mlContext.Data.CreateTextLoader(
                columns: new[]
                {
                    new TextLoader.Column("Features", DataKind.Single, 0, 9)
                },
                allowSparse: true
            );

            // Load the saved sparse data.
            var sparseData = sparseLoader.Load(sparseDataFileName);
            PrintRowCount(sparseData);

            // Expected Output:
            //   10


            // Create a TextLoader without any column schema using TextLoader.Options.
            // Since the sparse data file was saved with ML.NET, it has the schema
            // enoded in its header that the loader can understand:
            //
            // #@ TextLoader{
            // #@   sep=tab
            // #@   col=Features:R4:0-9
            // #@ }
            //
            // The schema syntax is unimportant since it is only used internally. In
            // short, it tells the loader that the values are separated by tabs, and
            // that columns 0-9 in the text file are to be read into one column named
            // "Features" of type Single (internal type R4).

            var options = new TextLoader.Options()
            {
                AllowSparse = true,
            };
            var dataSampleWithSchema = new MultiFileSource(sparseDataFileName);
            var sparseLoaderWithSchema =
                mlContext.Data.CreateTextLoader(options, dataSample: dataSampleWithSchema);

            // Load the saved sparse data.
            var sparseDataWithSchema = sparseLoaderWithSchema.Load(sparseDataFileName);
            PrintRowCount(sparseDataWithSchema);

            // Expected Output:
            //   10
        }

        private static void PrintRowCount(IDataView idv)
        {
            // IDataView is lazy so we need to iterate through it
            // to get the number of rows.
            long rowCount = 0;
            using (var cursor = idv.GetRowCursor(idv.Schema))
                while (cursor.MoveNext())
                    rowCount++;

            Console.WriteLine(rowCount);
        }

        private class Data
        {
            [LoadColumn(0, 9)]
            public float[] Features { get; set; }
        }
    }
}

Applies to

CreateTextLoader<TInput>(DataOperationsCatalog, TextLoader+Options, IMultiStreamSource)

Create a text loader TextLoader by inferencing the dataset schema from a data model type.

public static Microsoft.ML.Data.TextLoader CreateTextLoader<TInput> (this Microsoft.ML.DataOperationsCatalog catalog, Microsoft.ML.Data.TextLoader.Options options, Microsoft.ML.Data.IMultiStreamSource dataSample = default);
static member CreateTextLoader : Microsoft.ML.DataOperationsCatalog * Microsoft.ML.Data.TextLoader.Options * Microsoft.ML.Data.IMultiStreamSource -> Microsoft.ML.Data.TextLoader
<Extension()>
Public Function CreateTextLoader(Of TInput) (catalog As DataOperationsCatalog, options As TextLoader.Options, Optional dataSample As IMultiStreamSource = Nothing) As TextLoader

Type Parameters

TInput

Parameters

options
TextLoader.Options

Defines the settings of the load operation. Defines the settings of the load operation. No need to specify a Columns field, as columns will be infered by this method.

dataSample
IMultiStreamSource

The optional location of a data sample. The sample can be used to infer information about the columns, such as slot names.

Returns

Applies to

CreateTextLoader<TInput>(DataOperationsCatalog, Char, Boolean, IMultiStreamSource, Boolean, Boolean, Boolean)

Create a text loader TextLoader by inferencing the dataset schema from a data model type.

public static Microsoft.ML.Data.TextLoader CreateTextLoader<TInput> (this Microsoft.ML.DataOperationsCatalog catalog, char separatorChar = '\t', bool hasHeader = false, Microsoft.ML.Data.IMultiStreamSource dataSample = default, bool allowQuoting = false, bool trimWhitespace = false, bool allowSparse = false);
static member CreateTextLoader : Microsoft.ML.DataOperationsCatalog * char * bool * Microsoft.ML.Data.IMultiStreamSource * bool * bool * bool -> Microsoft.ML.Data.TextLoader
<Extension()>
Public Function CreateTextLoader(Of TInput) (catalog As DataOperationsCatalog, Optional separatorChar As Char = '\t', Optional hasHeader As Boolean = false, Optional dataSample As IMultiStreamSource = Nothing, Optional allowQuoting As Boolean = false, Optional trimWhitespace As Boolean = false, Optional allowSparse As Boolean = false) As TextLoader

Type Parameters

TInput

Defines the schema of the data to be loaded. Use public fields or properties decorated with LoadColumnAttribute (and possibly other attributes) to specify the column names and their data types in the schema of the loaded data.

Parameters

separatorChar
Char

Column separator character. Default is '\t'

hasHeader
Boolean

Whether the file has a header with feature names. When a is provided, true indicates that the first line in the will be used for feature names, and that when Load(IMultiStreamSource) is called, the first line will be skipped. When there is no provided, true just indicates that the loader should skip the first line when Load(IMultiStreamSource) is called, but columns will not have slot names annotations. This is because the output schema is made when the loader is created, and not when Load(IMultiStreamSource) is called.

dataSample
IMultiStreamSource

The optional location of a data sample. The sample can be used to infer slot name annotations if present.

allowQuoting
Boolean

Whether the input may include double-quoted values. This parameter is used to distinguish separator characters in an input value from actual separators. When true, separators within double quotes are treated as part of the input value. When false, all separators, even those whitin quotes, are treated as delimiting a new column.

trimWhitespace
Boolean

Remove trailing whitespace from lines.

allowSparse
Boolean

Whether the input may include sparse representations. For example, a row containing "5 2:6 4:3" means that there are 5 columns, and the only non-zero are columns 2 and 4, which have values 6 and 3, respectively. Column indices are zero-based, so columns 2 and 4 represent the 3rd and 5th columns. A column may also have dense values followed by sparse values represented in this fashion. For example, a row containing "1 2 5 2:6 4:3" represents two dense columns with values 1 and 2, followed by 5 sparsely represented columns with values 0, 0, 6, 0, and 3. The indices of the sparse columns start from 0, even though 0 represents the third column.

Returns

Applies to