DataSchema Class

Defines a schema for a datasets.

Inheritance
builtins.object
DataSchema

Constructor

DataSchema(schema, **options)

Examples


   from nimbusml import DataSchema, FileDataStream
   from nimbusml import Pipeline
   from nimbusml.ensemble import LightGbmRegressor
   from nimbusml.feature_extraction.categorical import OneHotVectorizer
   import numpy as np
   import pandas as pd

   data = pd.DataFrame(dict(real = [0.1, 2.2],
                           text = ['word','class'],
                           y = [1,3]))
   data.to_csv('data.csv', index = False, header = True)

   schema = DataSchema.read_schema('data.csv', collapse = False,
                                   numeric_dtype = np.float32,
                                   sep = ',')
   print(schema)
   #col=real:R4:0 col=text:TX:1 col=y:R4:2 header=+ sep=,

   exp = Pipeline([
                OneHotVectorizer(columns = ['text']),
                LightGbmRegressor(minimum_example_count_per_leaf = 1)
               ])

   exp.fit(FileDataStream('data.csv', schema = schema), 'y')

Remarks

The DataSchema class automatically generates a description of the data schema from various data sources. The data source may be a list, array, dataframe or a file. A schema is required for all nimbusml trainers and transforms, and when not provided explicitly, it needs to be inferred automatically before any data processing can occur. In the case of list, array or dataframes, the schema inference is usually straightforward, but when the data source is a file, it may require further inspection to ensure it matches the data, and that the types are aligned as needed (e.g. R4 vs I4).

For more details on the schema format, refer to Schema, Types and Vector Type.

Methods

clone
extract_idv_schema_from_file
format_options

Formats the options for the parser from the core library.

parse

Parses a schema defined as a string.

read_schema

Infers the schema of a data view.

read_schema_file

Infers the schema of a file.

Additional options:

  • collapse: (False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If collapse* == 'all', the method collapses all columns not specified in parameter names.

  • numeric_dtype: if not None, changes all numeric types into this type

rename

Renames a column.

to_string

Converts the schema into a string.

clone

clone()

extract_idv_schema_from_file

extract_idv_schema_from_file(path)

Parameters

Name Description
path
Required

format_options

Formats the options for the parser from the core library.

format_options(add_sep=False)

Parameters

Name Description
add_sep

the code library usually requires the separator, it is not added if the user does not explicitely specify it unless add_sep is True, in that case, the default value is added.

Default value: False

Returns

Type Description

formatted options as a string

parse

Parses a schema defined as a string.

parse(schema)

Parameters

Name Description
schema
Required

read_schema

Infers the schema of a data view.

read_schema(*data, **options)

Parameters

Name Description
data
Required

features, labels, weights, groups

collapse
Required

(False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If collapse* == 'all', the method collapses all columns not specified in parameter names.

sep
Required

string value of file seperation character (for example: ',')

header
Required

whether the data has a header row; defaults to True

dtype
Required

change dtype of specific columns; takes dictionary of column names mapped to desired dtype

numeric_dtype
Required

if not None, changes all numeric types into this type

names
Required

specify new names for columns; takes dictionary of column index mapped to desired name

ind
Required

first column index (in case DataFrame are concatenated)

tool
Required

'pandas' or 'nimbusml'

Returns

Type Description

schema as a string

read_schema_file

Infers the schema of a file.

Additional options:

  • collapse: (False by default), collapse columns for of the same type if it follows read_csv function. Use internal structure of a dataframe. If collapse* == 'all', the method collapses all columns not specified in parameter names.

  • numeric_dtype: if not None, changes all numeric types into this type

read_schema_file(filepath_or_buffer, tool='pandas', nrows=100, **options)

Parameters

Name Description
filepath_or_buffer
Required

stream or filename

tool

'pandas' or 'nimbusml'

Default value: pandas
nrows

use the first top rows only

Default value: 100
options
Required

additional options for read_csv from pandas or internal reader

Returns

Type Description

schema

rename

Renames a column.

rename(old_name, new_name)

Parameters

Name Description
old_name
Required

old name

new_name
Required

new_name

Returns

Type Description

self

to_string

Converts the schema into a string.

to_string(add_sep=False)

Parameters

Name Description
add_sep

sep is not added if the user does not specify it, but it is required by the core library, the method adds the default value if not specified.

Default value: False

Returns

Type Description

formatted schema as a string