FileDataStream Class

Reference

Data view from a file.

Inheritance: nimbusml.internal.utils.data_stream.DataStream

FileDataStream

Constructor

FileDataStream(filename, schema, roles=None)

Examples


   from nimbusml import FileDataStream
   from nimbusml import Pipeline
   from nimbusml.ensemble import LightGbmRegressor
   from nimbusml.feature_extraction.categorical import OneHotVectorizer
   import numpy as np
   import pandas as pd

   data = pd.DataFrame(dict(real = [0.1, 2.2],
                            text = ['word','class'],
                            y = [1,3]))
   data.to_csv('data.csv', index = False, header = True)

   ds = FileDataStream.read_csv('data.csv', collapse = False,
                               numeric_dtype = np.float32, sep = ',')
   ds.head()
   #   real   text    y
   #0   0.1   word  1.0
   #1   2.2  class  3.0
   exp = Pipeline([
                OneHotVectorizer(columns = ['text']),
                LightGbmRegressor(minimum_example_count_per_leaf = 1)
               ])

   exp.fit(ds, 'y')

Remarks

FileDataStream enables training from files by streaming the examples sequentially. Some trainers require the full data matrix to be resident in memory, and will cache the data if required. For trainers that implement online or batch techniques, using FileDataStream will substantially reduce overall memory utilization. Runtime efficiency is also increased and data copying is minimized for nimbusml trainers/transforms when used in conjunction with FileDataStream text reader.

A schema of the data is required to describe the column names, positions, types and delimiters. This can be provided explicitly to FileDataStream by using the DataSchema class to construct it, or optionally the read_csv method can be used to infer the schema automatically. For more control over column names and index ranges, especially Vector Type columns, the schema can be designed manually.

For more details of the schema format, refer to Schema and DataSchema.

Methods

clone

Copy/clone the object.

read_csv

Creates a FileDataStream from a filename or a buffer. For more details of the schema format for a FileDataStream, refer to Schema all the arguments that DataSchema.read_schema() uses applies to this method as well.

read_csv_pandas

Creates a FileDataStream from a filename or a buffer.

The method leverages read_csv to guess the schema of a filename with the first nrows of a file.

clone

Copy/clone the object.

clone()

read_csv

read_csv(filepath_or_buffer, tool=None, nrows=100, **kwargs)

Parameters

Name	Description
filepath_or_buffer Required	filename or stream
tool Required	parser to choose to guess the schema, this module `'internal'` or `'pandas'`, if None, the function chooses the most relevant one given the additional arguments given to the function
nrows Required	number of rows used to guess the schema
numeric_dtype Required	changes all numeric types into the same one, recommended to use numpy.float32 in many cases
collapse Required	(False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If `collapse* == 'all'`, the method collapses all columns not specified in parameter names.
sep Required	seperation of the data columns, such as ',', or '/t'
header Required	if the input data has a header, can be True or False
names Required	rename the data columns, users can specify a dictionary with column number as the key, such as {0:'Label', 1:'GroupId', (2,None):'Features'} It renames columns 0, 1, as Label and GroupId. It renames columns 2:end with Features_0, ..., Features_2040.
dtype Required	overwrite the data column types, users can specify a dictionary with column name as the key, such as {'column1':numpy.float32}
kwargs Required	additional parameters sent to read_csv or the internal parser.

Returns

Type	Description
	a FileDataStream instance

read_csv_pandas

Creates a FileDataStream from a filename or a buffer.

The method leverages read_csv to guess the schema of a filename with the first nrows of a file.

read_csv_pandas(filepath_or_buffer, nrows=100, collapse=False, numeric_dtype=None, **kwargs)

Parameters

Name	Description
filepath_or_buffer Required	filename or stream
nrows Required	number of rows used to guess the schema
kwargs Required	additional parameters sent to read_csv or the internal
numeric_dtype Required	changes all numeric types into the same one
collapse Required	collapse into one vector column all columns sharing the same type

Returns

Type	Description
	a FileDataStream instance

Share via

FileDataStream Class

Constructor

Examples

Remarks

Methods

clone

read_csv

Parameters

Returns

read_csv_pandas

Parameters

Returns

Additional resources