Share via


HashingTF Class

Definition

A HashingTF Maps a sequence of terms to their term frequencies using the hashing trick. Currently we use Austin Appleby's MurmurHash 3 algorithm (MurmurHash3_x86_32) to calculate the hash code value for the term object. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

public class HashingTF : Microsoft.Spark.ML.Feature.FeatureBase<Microsoft.Spark.ML.Feature.HashingTF>
type HashingTF = class
    inherit FeatureBase<HashingTF>
Public Class HashingTF
Inherits FeatureBase(Of HashingTF)
Inheritance

Constructors

HashingTF()

Create a HashingTF without any parameters

HashingTF(String)

Create a HashingTF with a UID that is used to give the HashingTF a unique ID

Methods

Clear(Param)

Clears any value that was previously set for this Microsoft.Spark.ML.Feature.Param. The value is reset to the default value.

(Inherited from FeatureBase<T>)
ExplainParam(Param)

Returns a description of how a specific Microsoft.Spark.ML.Feature.Param works and is currently set.

(Inherited from FeatureBase<T>)
ExplainParams()

Returns a description of how all of the Microsoft.Spark.ML.Feature.Param's that apply to this object work and how they are currently set.

(Inherited from FeatureBase<T>)
GetBinary()

Gets the binary toggle that controls term frequency counts

GetInputCol()

Gets the column that the HashingTF should read from

GetNumFeatures()

Gets the number of features that should be used. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

GetOutputCol()

The HashingTF will create a new column in the DataFrame, this is the name of the new column.

GetParam(String)

Retrieves a Microsoft.Spark.ML.Feature.Param so that it can be used to set the value of the Microsoft.Spark.ML.Feature.Param on the object.

(Inherited from FeatureBase<T>)
Load(String)

Loads the HashingTF that was previously saved using Save

Save(String)

Saves the object so that it can be loaded later using Load. Note that these objects can be shared with Scala by Loading or Saving in Scala.

(Inherited from FeatureBase<T>)
Set(Param, Object)

Sets the value of a specific Microsoft.Spark.ML.Feature.Param.

(Inherited from FeatureBase<T>)
SetBinary(Boolean)

Binary toggle to control term frequency counts. If true, all non-zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts

SetInputCol(String)

Sets the column that the HashingTF should read from

SetNumFeatures(Int32)

Sets the number of features that should be used. Since a simple modulo is used to transform the hash function to a column index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the columns.

SetOutputCol(String)

The HashingTF will create a new column in the DataFrame, this is the name of the new column.

ToString()

Returns the JVM toString value rather than the .NET ToString default

(Inherited from FeatureBase<T>)
Transform(DataFrame)

Executes the HashingTF and transforms the DataFrame to include the new column or columns with the tokens.

Uid()

The UID that was used to create the object. If no UID is passed in when creating the object then a random UID is created when the object is created.

(Inherited from FeatureBase<T>)

Applies to