Dela via


SvmLightLoader Class

Definition

This attempts to reads data in a format close to the SVM-light format, the goal being that the majority of SVM-light formatted data should be interpretable by this loader.

public sealed class SvmLightLoader : Microsoft.ML.IDataLoader<Microsoft.ML.Data.IMultiStreamSource>
type SvmLightLoader = class
    interface IDataLoader<IMultiStreamSource>
    interface ICanSaveModel
Public NotInheritable Class SvmLightLoader
Implements IDataLoader(Of IMultiStreamSource)
Inheritance
SvmLightLoader
Implements

Remarks

The loader may also be different than SVM-light's parsing behavior, in the following general ways:

  1. As an IDataView, vectors are required to have a logical length, and for practical reasons it's helpful if the output of this loader has a fixed length vector type, since few estimators and no basic trainer estimators accept features of a variable length vector types. SVM-light had no such concept.
  2. The IDataView idiom has different behavior w.r.t. parse errors.
  3. The SVM-light has some restrictions in its format that are unnatural to attempt to restrict in the concept of this loader.
  4. Some common "extensions" of this format that have happened over the years are accommodated where sensible, often supported by specifying some options.

The SVM-light format can be summarized here. An SVM-light file can lead with any number of lines starting with '#'. These are discarded. {label} {key}:{value} {key}:{value} ... {key}:{value}[#{comment}]

Lines are not whitespace trimmed, though whitespace within the line, prior to the # comment character (if any) are ignored. SVM-light itself uses the standard C "isspace" function, while we respect only space and tab as whitespace. So, the spaces in the line above could be, say, tabs, and there could even be multiple of them in sequence. Unlike the text loader's format, for instance, there is no concept of a "blank" field having any status.

The feature vector is specified through a series of key/value pairs. SVM-light requires that the keys be positive, increasing integers, except for three special keys: cost (we interpret as Weight), qid (we interpret as GroupId) and sid (we ignore these, but might present them as a column in the future if any of our learners implement anything resembling slack id). The value for 'cost' is float, 'qid' is a long, and 'sid' is a long that must be positive. If these keys are specified multiple times, the last one wins.

SVM-light, if the tail of the value is not interpretable as a number, will ignore the tail. E.g., "5:3.14hello" will be interpreted the same as "5:3.14". This loader does not support this syntax.

We do not retain the restriction on keys needing to be increasing values in our loader, due to the way we compose our feature vectors, but it will be most efficient if this policy is still followed. If it is followed a sort will not be required.

This loader has the special option to read raw text for the keys and convert to feature indices, retaining the text key values as feature names for the resulting feature vector. The intent of this is to allow string keys, a common variant of the format, but one emphatically not allowed by the original format.

Methods

GetOutputSchema()
Load(IMultiStreamSource)

Explicit Interface Implementations

ICanSaveModel.Save(ModelSaveContext)

Extension Methods

Preview<TSource>(IDataLoader<TSource>, TSource, Int32)

Preview an effect of the loader on a given source.

Append<TSource,TTrans>(IDataLoader<TSource>, TTrans)

Create a new composite loader, by appending a transformer to this data loader.

Append<TSource,TTrans>(IDataLoader<TSource>, IEstimator<TTrans>)

Create a new composite loader estimator, by appending an estimator to this data loader.

Applies to