SvmLightLoader Class
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
This attempts to reads data in a format close to the SVM-light format, the goal being that the majority of SVM-light formatted data should be interpretable by this loader.
public sealed class SvmLightLoader : Microsoft.ML.IDataLoader<Microsoft.ML.Data.IMultiStreamSource>
type SvmLightLoader = class
interface IDataLoader<IMultiStreamSource>
interface ICanSaveModel
Public NotInheritable Class SvmLightLoader
Implements IDataLoader(Of IMultiStreamSource)
- Inheritance
-
SvmLightLoader
- Implements
Remarks
The loader may also be different than SVM-light's parsing behavior, in the following general ways:
- As an IDataView, vectors are required to have a logical length, and for practical reasons it's helpful if the output of this loader has a fixed length vector type, since few estimators and no basic trainer estimators accept features of a variable length vector types. SVM-light had no such concept.
- The IDataView idiom has different behavior w.r.t. parse errors.
- The SVM-light has some restrictions in its format that are unnatural to attempt to restrict in the concept of this loader.
- Some common "extensions" of this format that have happened over the years are accommodated where sensible, often supported by specifying some options.
The SVM-light format can be summarized here. An SVM-light file can lead with any number of lines starting with '#'. These are discarded. {label} {key}:{value} {key}:{value} ... {key}:{value}[#{comment}]
Lines are not whitespace trimmed, though whitespace within the line, prior to the # comment character (if any) are ignored. SVM-light itself uses the standard C "isspace" function, while we respect only space and tab as whitespace. So, the spaces in the line above could be, say, tabs, and there could even be multiple of them in sequence. Unlike the text loader's format, for instance, there is no concept of a "blank" field having any status.
The feature vector is specified through a series of key/value pairs. SVM-light requires that the keys be positive, increasing integers, except for three special keys: cost (we interpret as Weight), qid (we interpret as GroupId) and sid (we ignore these, but might present them as a column in the future if any of our learners implement anything resembling slack id). The value for 'cost' is float, 'qid' is a long, and 'sid' is a long that must be positive. If these keys are specified multiple times, the last one wins.
SVM-light, if the tail of the value is not interpretable as a number, will ignore the tail. E.g., "5:3.14hello" will be interpreted the same as "5:3.14". This loader does not support this syntax.
We do not retain the restriction on keys needing to be increasing values in our loader, due to the way we compose our feature vectors, but it will be most efficient if this policy is still followed. If it is followed a sort will not be required.
This loader has the special option to read raw text for the keys and convert to feature indices, retaining the text key values as feature names for the resulting feature vector. The intent of this is to allow string keys, a common variant of the format, but one emphatically not allowed by the original format.
Methods
GetOutputSchema() | |
Load(IMultiStreamSource) |
Explicit Interface Implementations
ICanSaveModel.Save(ModelSaveContext) |
Extension Methods
Preview<TSource>(IDataLoader<TSource>, TSource, Int32) |
Preview an effect of the |
Append<TSource,TTrans>(IDataLoader<TSource>, TTrans) |
Create a new composite loader, by appending a transformer to this data loader. |
Append<TSource,TTrans>(IDataLoader<TSource>, IEstimator<TTrans>) |
Create a new composite loader estimator, by appending an estimator to this data loader. |