DataSchema Class
Defines a schema for a datasets.
- Inheritance
-
builtins.objectDataSchema
Constructor
DataSchema(schema, **options)
Examples
from nimbusml import DataSchema, FileDataStream
from nimbusml import Pipeline
from nimbusml.ensemble import LightGbmRegressor
from nimbusml.feature_extraction.categorical import OneHotVectorizer
import numpy as np
import pandas as pd
data = pd.DataFrame(dict(real = [0.1, 2.2],
text = ['word','class'],
y = [1,3]))
data.to_csv('data.csv', index = False, header = True)
schema = DataSchema.read_schema('data.csv', collapse = False,
numeric_dtype = np.float32,
sep = ',')
print(schema)
#col=real:R4:0 col=text:TX:1 col=y:R4:2 header=+ sep=,
exp = Pipeline([
OneHotVectorizer(columns = ['text']),
LightGbmRegressor(minimum_example_count_per_leaf = 1)
])
exp.fit(FileDataStream('data.csv', schema = schema), 'y')
Remarks
The DataSchema class automatically generates a description of
the data schema from various data sources. The
data source may be a list, array, dataframe or a file. A schema
is required for all nimbusml
trainers and
transforms, and when not provided explicitly, it needs to be
inferred automatically before any data processing
can occur. In the case of list, array or dataframes, the schema
inference is usually straightforward, but when
the data source is a file, it may require further inspection to
ensure it matches the data, and that the types
are aligned as needed (e.g. R4 vs I4).
For more details on the schema format, refer to Schema, Types and Vector Type.
Methods
clone | |
extract_idv_schema_from_file | |
format_options |
Formats the options for the parser from the core library. |
parse |
Parses a schema defined as a string. |
read_schema |
Infers the schema of a data view. |
read_schema_file |
Infers the schema of a file. Additional options:
|
rename |
Renames a column. |
to_string |
Converts the schema into a string. |
clone
clone()
extract_idv_schema_from_file
extract_idv_schema_from_file(path)
Parameters
Name | Description |
---|---|
path
Required
|
|
format_options
Formats the options for the parser from the core library.
format_options(add_sep=False)
Parameters
Name | Description |
---|---|
add_sep
|
the code library usually requires the separator, it is not added if the user does not explicitely specify it unless add_sep is True, in that case, the default value is added. Default value: False
|
Returns
Type | Description |
---|---|
formatted options as a string |
parse
Parses a schema defined as a string.
parse(schema)
Parameters
Name | Description |
---|---|
schema
Required
|
|
read_schema
Infers the schema of a data view.
read_schema(*data, **options)
Parameters
Name | Description |
---|---|
data
Required
|
features, labels, weights, groups |
collapse
Required
|
(False by default), collapse columns for of the same type
if it follows read_csv function. Use internal structure of a
dataframe. If |
sep
Required
|
string value of file seperation character (for example: ',') |
header
Required
|
whether the data has a header row; defaults to True |
dtype
Required
|
change dtype of specific columns; takes dictionary of column names mapped to desired dtype |
numeric_dtype
Required
|
if not None, changes all numeric types into this type |
names
Required
|
specify new names for columns; takes dictionary of column index mapped to desired name |
ind
Required
|
first column index (in case DataFrame are concatenated) |
tool
Required
|
'pandas' or 'nimbusml' |
Returns
Type | Description |
---|---|
schema as a string |
read_schema_file
Infers the schema of a file.
Additional options:
collapse: (False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If
collapse* == 'all'
, the method collapses all columns not specified in parameter names.numeric_dtype: if not None, changes all numeric types into this type
read_schema_file(filepath_or_buffer, tool='pandas', nrows=100, **options)
Parameters
Name | Description |
---|---|
filepath_or_buffer
Required
|
stream or filename |
tool
|
'pandas' or 'nimbusml' Default value: pandas
|
nrows
|
use the first top rows only Default value: 100
|
options
Required
|
additional options for read_csv from pandas or internal reader |
Returns
Type | Description |
---|---|
schema |
rename
Renames a column.
rename(old_name, new_name)
Parameters
Name | Description |
---|---|
old_name
Required
|
old name |
new_name
Required
|
new_name |
Returns
Type | Description |
---|---|
self |
to_string
Converts the schema into a string.
to_string(add_sep=False)
Parameters
Name | Description |
---|---|
add_sep
|
sep is not added if the user does not specify it, but it is required by the core library, the method adds the default value if not specified. Default value: False
|
Returns
Type | Description |
---|---|
formatted schema as a string |