DataSchema Class

Defines a schema for a datasets.

Constructor

DataSchema(schema, **options)

Examples


   from nimbusml import DataSchema, FileDataStream
   from nimbusml import Pipeline
   from nimbusml.ensemble import LightGbmRegressor
   from nimbusml.feature_extraction.categorical import OneHotVectorizer
   import numpy as np
   import pandas as pd

   data = pd.DataFrame(dict(real = [0.1, 2.2],
                           text = ['word','class'],
                           y = [1,3]))
   data.to_csv('data.csv', index = False, header = True)

   schema = DataSchema.read_schema('data.csv', collapse = False,
                                   numeric_dtype = np.float32,
                                   sep = ',')
   print(schema)
   #col=real:R4:0 col=text:TX:1 col=y:R4:2 header=+ sep=,

   exp = Pipeline([
                OneHotVectorizer(columns = ['text']),
                LightGbmRegressor(minimum_example_count_per_leaf = 1)
               ])

   exp.fit(FileDataStream('data.csv', schema = schema), 'y')

Remarks

The DataSchema class automatically generates a description of the data schema from various data sources. The data source may be a list, array, dataframe or a file. A schema is required for all nimbusml trainers and transforms, and when not provided explicitly, it needs to be inferred automatically before any data processing can occur. In the case of list, array or dataframes, the schema inference is usually straightforward, but when the data source is a file, it may require further inspection to ensure it matches the data, and that the types are aligned as needed (e.g. R4 vs I4).

For more details on the schema format, refer to Schema, Types and Vector Type.

Methods

clone
extract_idv_schema_from_file
format_options	Formats the options for the parser from the core library.
parse	Parses a schema defined as a string.
read_schema	Infers the schema of a data view.
read_schema_file	Infers the schema of a file. Additional options: collapse: (False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If `collapse* == 'all'`, the method collapses all columns not specified in parameter names. numeric_dtype: if not None, changes all numeric types into this type
rename	Renames a column.
to_string	Converts the schema into a string.

clone

clone()

extract_idv_schema_from_file

extract_idv_schema_from_file(path)

Parameters

Name	Description
path Required

format_options

Formats the options for the parser from the core library.

format_options(add_sep=False)

Parameters

Name	Description
add_sep	the code library usually requires the separator, it is not added if the user does not explicitely specify it unless add_sep is True, in that case, the default value is added. Default value: False

Returns

Type	Description
	formatted options as a string

parse

Parses a schema defined as a string.

parse(schema)

Parameters

Name	Description
schema Required

read_schema

Infers the schema of a data view.

read_schema(*data, **options)

Parameters

Name	Description
data Required	features, labels, weights, groups
collapse Required	(False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If `collapse* == 'all'`, the method collapses all columns not specified in parameter names.
sep Required	string value of file seperation character (for example: ',')
header Required	whether the data has a header row; defaults to True
dtype Required	change dtype of specific columns; takes dictionary of column names mapped to desired dtype
numeric_dtype Required	if not None, changes all numeric types into this type
names Required	specify new names for columns; takes dictionary of column index mapped to desired name
ind Required	first column index (in case DataFrame are concatenated)
tool Required	'pandas' or 'nimbusml'

Returns

Type	Description
	schema as a string

read_schema_file

Infers the schema of a file.

Additional options:

collapse: (False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If collapse* == 'all', the method collapses all columns not specified in parameter names.
numeric_dtype: if not None, changes all numeric types into this type

read_schema_file(filepath_or_buffer, tool='pandas', nrows=100, **options)

Parameters

Name	Description
filepath_or_buffer Required	stream or filename
tool	'pandas' or 'nimbusml' Default value: pandas
nrows	use the first top rows only Default value: 100
options Required	additional options for read_csv from pandas or internal reader

Returns

Type	Description
	schema

rename

Renames a column.

rename(old_name, new_name)

Parameters

Name	Description
old_name Required	old name
new_name Required	new_name

Returns

Type	Description
	self

to_string

Converts the schema into a string.

to_string(add_sep=False)

Parameters

Name	Description
add_sep	sep is not added if the user does not specify it, but it is required by the core library, the method adds the default value if not specified. Default value: False

Returns

Type	Description
	formatted schema as a string

Share via

DataSchema Class

Constructor

Examples

Remarks

Methods

clone

extract_idv_schema_from_file

Parameters

format_options

Parameters

Returns

parse

Parameters

read_schema

Parameters

Returns

read_schema_file

Parameters

Returns

rename

Parameters

Returns

to_string

Parameters

Returns