FileDataStream Class
Data view from a file.
- Inheritance
-
nimbusml.internal.utils.data_stream.DataStreamFileDataStream
Constructor
FileDataStream(filename, schema, roles=None)
Examples
from nimbusml import FileDataStream
from nimbusml import Pipeline
from nimbusml.ensemble import LightGbmRegressor
from nimbusml.feature_extraction.categorical import OneHotVectorizer
import numpy as np
import pandas as pd
data = pd.DataFrame(dict(real = [0.1, 2.2],
text = ['word','class'],
y = [1,3]))
data.to_csv('data.csv', index = False, header = True)
ds = FileDataStream.read_csv('data.csv', collapse = False,
numeric_dtype = np.float32, sep = ',')
ds.head()
# real text y
#0 0.1 word 1.0
#1 2.2 class 3.0
exp = Pipeline([
OneHotVectorizer(columns = ['text']),
LightGbmRegressor(minimum_example_count_per_leaf = 1)
])
exp.fit(ds, 'y')
Remarks
FileDataStream enables training from files by streaming the
examples sequentially. Some trainers require the
full data matrix to be resident in memory, and will cache the
data if required. For trainers that implement
online or batch techniques, using FileDataStream will substantially
reduce overall memory utilization. Runtime
efficiency is also increased and data copying is minimized for
nimbusml
trainers/transforms when used in
conjunction with FileDataStream text reader.
A schema of the data is required to describe the column names, positions, types and delimiters. This can be provided explicitly to FileDataStream by using the DataSchema class to construct it, or optionally the read_csv method can be used to infer the schema automatically. For more control over column names and index ranges, especially Vector Type columns, the schema can be designed manually.
For more details of the schema format, refer to Schema and DataSchema.
Methods
clone |
Copy/clone the object. |
read_csv |
Creates a FileDataStream from a filename or a buffer. For more
details of the schema format for
a FileDataStream, refer to
Schema
all the arguments that |
read_csv_pandas |
Creates a FileDataStream from a filename or a buffer. The method leverages read_csv to guess the schema of a filename with the first nrows of a file. |
clone
Copy/clone the object.
clone()
read_csv
Creates a FileDataStream from a filename or a buffer. For more
details of the schema format for
a FileDataStream, refer to
Schema
all the arguments that DataSchema.read_schema()
uses applies to
this method as well.
read_csv(filepath_or_buffer, tool=None, nrows=100, **kwargs)
Parameters
Name | Description |
---|---|
filepath_or_buffer
Required
|
filename or stream |
tool
Required
|
parser to choose to guess the schema,
this module |
nrows
Required
|
number of rows used to guess the schema |
numeric_dtype
Required
|
changes all numeric types into the same one, recommended to use numpy.float32 in many cases |
collapse
Required
|
(False by default), collapse columns for of the same
type if it follows
read_csv function. Use internal structure of a dataframe.
If |
sep
Required
|
seperation of the data columns, such as ',', or '/t' |
header
Required
|
if the input data has a header, can be True or False |
names
Required
|
rename the data columns, users can specify a dictionary with column number as the key, such as {0:'Label', 1:'GroupId', (2,None):'Features'} It renames columns 0, 1, as Label and GroupId. It renames columns 2:end with Features_0, ..., Features_2040. |
dtype
Required
|
overwrite the data column types, users can specify a dictionary with column name as the key, such as {'column1':numpy.float32} |
kwargs
Required
|
additional parameters sent to read_csv or the internal parser. |
Returns
Type | Description |
---|---|
a FileDataStream instance |
read_csv_pandas
Creates a FileDataStream from a filename or a buffer.
The method leverages read_csv to guess the schema of a filename with the first nrows of a file.
read_csv_pandas(filepath_or_buffer, nrows=100, collapse=False, numeric_dtype=None, **kwargs)
Parameters
Name | Description |
---|---|
filepath_or_buffer
Required
|
filename or stream |
nrows
Required
|
number of rows used to guess the schema |
kwargs
Required
|
additional parameters sent to read_csv or the internal |
numeric_dtype
Required
|
changes all numeric types into the same one |
collapse
Required
|
collapse into one vector column all columns sharing the same type |
Returns
Type | Description |
---|---|
a FileDataStream instance |