TextCatalog.ProduceHashedNgrams Method

Definition

Overloads

ProduceHashedNgrams(TransformsCatalog+TextTransforms, String, String, Int32, Int32, Int32, Boolean, UInt32, Boolean, Int32, Boolean)

Create a NgramHashingEstimator, which copies the data from the column specified in inputColumnName to a new column: outputColumnName and produces a vector of counts of hashed n-grams.

ProduceHashedNgrams(TransformsCatalog+TextTransforms, String, String[], Int32, Int32, Int32, Boolean, UInt32, Boolean, Int32, Boolean)

Create a NgramHashingEstimator, which takes the data from the multiple columns specified in inputColumnNames to a new column: outputColumnName and produces a vector of counts of hashed n-grams.

ProduceHashedNgrams(TransformsCatalog+TextTransforms, String, String, Int32, Int32, Int32, Boolean, UInt32, Boolean, Int32, Boolean)

Create a NgramHashingEstimator, which copies the data from the column specified in inputColumnName to a new column: outputColumnName and produces a vector of counts of hashed n-grams.

public static Microsoft.ML.Transforms.Text.NgramHashingEstimator ProduceHashedNgrams (this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string inputColumnName = default, int numberOfBits = 16, int ngramLength = 2, int skipLength = 0, bool useAllLengths = true, uint seed = 314489979, bool useOrderedHashing = true, int maximumNumberOfInverts = 0, bool rehashUnigrams = false);
static member ProduceHashedNgrams : Microsoft.ML.TransformsCatalog.TextTransforms * string * string * int * int * int * bool * uint32 * bool * int * bool -> Microsoft.ML.Transforms.Text.NgramHashingEstimator
<Extension()>
Public Function ProduceHashedNgrams (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnName As String = Nothing, Optional numberOfBits As Integer = 16, Optional ngramLength As Integer = 2, Optional skipLength As Integer = 0, Optional useAllLengths As Boolean = true, Optional seed As UInteger = 314489979, Optional useOrderedHashing As Boolean = true, Optional maximumNumberOfInverts As Integer = 0, Optional rehashUnigrams As Boolean = false) As NgramHashingEstimator

Parameters

catalog
TransformsCatalog.TextTransforms

The transform's catalog.

outputColumnName
String

Name of the column resulting from the transformation of inputColumnName. This column's data type will be vector of Single.

inputColumnName
String

Name of the column to copy the data from. This estimator operates over vector of key type.

numberOfBits
Int32

Number of bits to hash into. Must be between 1 and 30, inclusive.

ngramLength
Int32

Ngram length.

skipLength
Int32

Maximum number of tokens to skip when constructing an n-gram.

useAllLengths
Boolean

Whether to include all n-gram lengths up to ngramLength or only ngramLength.

seed
UInt32

Hashing seed.

useOrderedHashing
Boolean

Whether the position of each source column should be included in the hash (when there are multiple source columns).

maximumNumberOfInverts
Int32

During hashing we construct mappings between original values and the produced hash values. Text representation of original values are stored in the slot names of the annotations for the new column.Hashing, as such, can map many initial values to one. maximumNumberOfInverts specifies the upper bound of the number of distinct input values mapping to a hash that should be retained. 0 does not retain any input values. -1 retains all input values mapping to each hash.

rehashUnigrams
Boolean

Whether to rehash unigrams.

Returns

Remarks

NgramHashingEstimator is different from WordHashBagEstimator in a way that NgramHashingEstimator takes tokenized text as input while WordHashBagEstimator tokenizes text internally.

Applies to

ProduceHashedNgrams(TransformsCatalog+TextTransforms, String, String[], Int32, Int32, Int32, Boolean, UInt32, Boolean, Int32, Boolean)

Create a NgramHashingEstimator, which takes the data from the multiple columns specified in inputColumnNames to a new column: outputColumnName and produces a vector of counts of hashed n-grams.

public static Microsoft.ML.Transforms.Text.NgramHashingEstimator ProduceHashedNgrams (this Microsoft.ML.TransformsCatalog.TextTransforms catalog, string outputColumnName, string[] inputColumnNames = default, int numberOfBits = 16, int ngramLength = 2, int skipLength = 0, bool useAllLengths = true, uint seed = 314489979, bool useOrderedHashing = true, int maximumNumberOfInverts = 0, bool rehashUnigrams = false);
static member ProduceHashedNgrams : Microsoft.ML.TransformsCatalog.TextTransforms * string * string[] * int * int * int * bool * uint32 * bool * int * bool -> Microsoft.ML.Transforms.Text.NgramHashingEstimator
<Extension()>
Public Function ProduceHashedNgrams (catalog As TransformsCatalog.TextTransforms, outputColumnName As String, Optional inputColumnNames As String() = Nothing, Optional numberOfBits As Integer = 16, Optional ngramLength As Integer = 2, Optional skipLength As Integer = 0, Optional useAllLengths As Boolean = true, Optional seed As UInteger = 314489979, Optional useOrderedHashing As Boolean = true, Optional maximumNumberOfInverts As Integer = 0, Optional rehashUnigrams As Boolean = false) As NgramHashingEstimator

Parameters

catalog
TransformsCatalog.TextTransforms

The transform's catalog.

outputColumnName
String

Name of the column resulting from the transformation of inputColumnNames. This column's data type will be vector of known size of Single.

inputColumnNames
String[]

Name of the multiple columns to take the data from. This estimator operates over vector of key type.

numberOfBits
Int32

Number of bits to hash into. Must be between 1 and 30, inclusive.

ngramLength
Int32

Ngram length.

skipLength
Int32

Maximum number of tokens to skip when constructing an n-gram.

useAllLengths
Boolean

Whether to include all n-gram lengths up to ngramLength or only ngramLength.

seed
UInt32

Hashing seed.

useOrderedHashing
Boolean

Whether the position of each source column should be included in the hash (when there are multiple source columns).

maximumNumberOfInverts
Int32

During hashing we construct mappings between original values and the produced hash values. Text representation of original values are stored in the slot names of the annotations for the new column.Hashing, as such, can map many initial values to one. maximumNumberOfInverts specifies the upper bound of the number of distinct input values mapping to a hash that should be retained. 0 does not retain any input values. -1 retains all input values mapping to each hash.

rehashUnigrams
Boolean

Whether to rehash unigrams.

Returns

Examples

using System;
using System.Collections.Generic;
using Microsoft.ML;
using Microsoft.ML.Data;

namespace Samples.Dynamic
{
    public static class ProduceHashedNgrams
    {
        public static void Example()
        {
            // Create a new ML context, for ML.NET operations. It can be used for
            // exception tracking and logging, as well as the source of randomness.
            var mlContext = new MLContext();

            // Create a small dataset as an IEnumerable.
            var samples = new List<TextData>()
            {
                new TextData(){ Text = "This is an example to compute n-grams " +
                "using hashing." },

                new TextData(){ Text = "N-gram is a sequence of 'N' consecutive" +
                " words/tokens." },

                new TextData(){ Text = "ML.NET's ProduceHashedNgrams API " +
                "produces count of n-grams and hashes it as an index into a " +
                "vector of given bit length." },

                new TextData(){ Text = "The hashing reduces the size of the " +
                "output feature vector" },

                new TextData(){ Text = "which is useful in case when number of " +
                "n-grams is very large." },
            };

            // Convert training data to IDataView.
            var dataview = mlContext.Data.LoadFromEnumerable(samples);

            // A pipeline for converting text into numeric hashed n-gram features.
            // The following call to 'ProduceHashedNgrams' requires the tokenized
            // text /string as input. This is achieved by calling 
            // 'TokenizeIntoWords' first followed by 'ProduceHashedNgrams'.
            // Please note that the length of the output feature vector depends on
            // the 'numberOfBits' settings.
            var textPipeline = mlContext.Transforms.Text.TokenizeIntoWords("Tokens",
                "Text")
                .Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
                .Append(mlContext.Transforms.Text.ProduceHashedNgrams(
                    "NgramFeatures", "Tokens",
                    numberOfBits: 5,
                    ngramLength: 3,
                    useAllLengths: false,
                    maximumNumberOfInverts: 1));

            // Fit to data.
            var textTransformer = textPipeline.Fit(dataview);
            var transformedDataView = textTransformer.Transform(dataview);

            // Create the prediction engine to get the features extracted from the
            // text.
            var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData,
                TransformedTextData>(textTransformer);

            // Convert the text into numeric features.
            var prediction = predictionEngine.Predict(samples[0]);

            // Print the length of the feature vector.
            Console.WriteLine("Number of Features: " + prediction.NgramFeatures
                .Length);

            // Preview of the produced n-grams.
            // Get the slot names from the column's metadata.
            // The slot names for a vector column corresponds to the names
            // associated with each position in the vector.
            VBuffer<ReadOnlyMemory<char>> slotNames = default;
            transformedDataView.Schema["NgramFeatures"].GetSlotNames(ref slotNames);
            var NgramFeaturesColumn = transformedDataView.GetColumn<VBuffer<float>>(
                transformedDataView.Schema["NgramFeatures"]);

            var slots = slotNames.GetValues();
            Console.Write("N-grams: ");
            foreach (var featureRow in NgramFeaturesColumn)
            {
                foreach (var item in featureRow.Items())
                    Console.Write($"{slots[item.Key]}  ");
                Console.WriteLine();
            }

            // Print the first 10 feature values.
            Console.Write("Features: ");
            for (int i = 0; i < 10; i++)
                Console.Write($"{prediction.NgramFeatures[i]:F4}  ");

            //  Expected output:
            //   Number of Features:  32
            //   N-grams:   This|is|an  example|to|compute  compute|n-grams|using  n-grams|using|hashing.  an|example|to  is|an|example  a|sequence|of  of|'N'|consecutive  is|a|sequence  N-gram|is|a  ...
            //   Features:    0.0000          0.0000               2.0000               0.0000               0.0000        1.0000          0.0000        0.0000              1.0000          0.0000  ...
        }

        private class TextData
        {
            public string Text { get; set; }
        }

        private class TransformedTextData : TextData
        {
            public float[] NgramFeatures { get; set; }
        }
    }
}

Remarks

NgramHashingEstimator is different from WordHashBagEstimator in a way that NgramHashingEstimator takes tokenized text as input while WordHashBagEstimator tokenizes text internally.

Applies to