Schema
The NimbusML data framework relies on a schema to understand the column names and mix of column
types in the dataset, which may originate from any of the supported Data Sources. It is
automatically inferred when a nimbusml.FileDataStream
or nimbusml.DataSchema
is created.
Transforms have the ability to operate on subsets of columns in the dataset, as well as alter the
resulting output schema, which effects other transforms downstream. For users, it would be very useful to
understand how NimbusML processes the data in a pipeline for debugging purposes or training the model with nimbusml.FileDataStream
.
The schema comes with two formats for its representation, (1) object representation and (2) string format. After generating a nimbusml.FileDataStream
, users can view the
object representation of the schema by using repr()
function:
from nimbusml import FileDataStream
import numpy as np
import pandas as pd
data = pd.DataFrame(dict(real = [0.1, 2.2], text = ['word','class'], y = [1,3]))
data.to_csv('data.csv', index = False, header = True)
ds = FileDataStream.read_csv('data.csv', collapse = True,
numeric_dtype = np.float32, sep = ',')
print(repr(ds.schema))
#DataSchema([DataColumn(name='real', type='R4', pos=0),
# DataColumn(name='text', type='TX', pos=1),
# DataColumn(name='y', type='R4', pos=2)],
# header=True,
# sep=',')
The name, type, position of the columns are shown as well as the information about if the data has a header or what
the seperation of the columns is. It is always useful for users to examine the Schema of a nimbusml.FileDataStream
before training
the model.
As can be seen in the above example, the arguments for the nimbusml.FileDataStream.read_csv()
are used
to modify the Schema of the generated nimbusml.FileDataStream
. More details about how
to modify the Schema is presented in DataSchema Alterations. All the arguments
discussed in this section are also applicable for nimbusml.FileDataStream
.
The schema string format looks like a series of entries shown below:
col=<name>:<type>:<position> [options]
where
col= is specified for every column in the dataset,
name is the name of the column,
position is the 0-based index (or index range) of the column(s),
type is one of the column-types. When the position is a range (i.e. start_index-end_index), the column is of VectorDataViewType.
options
header= [+-] : Specifies if there is a header present in the text file
sep= [delimiter] : the delimiter for the columns
For instance,
schema = 'sep=, col=Features:R4:0-2 col=Label:R4:3 col=Text:TX:4 header+'
schema = 'sep=tab col=Sentiment:BL:1 col=SentimentSource:TX:2 col=SentimentText:TX:3 col=rownum:R4:4 header=+'
schema = 'sep=, col=Features:R4:0-4 col=UniqueCarrier:TX:5 col=Origin:TX:6 col=Dest:TX:7 col=Label:BL:9 header=+'
The first example indicates that the data is seperated by ,
, the first three columns (with index ranging from 0 to 2) are named Features and with type R4, i.e. single precision floating-point.
The fourth column (with index 3) is named Label and with type R4. The fifth column (with index 4) is named Text and with type TX. The data has a header.
The nimbusml.DataSchema
class can be used to automatically infer the schema from the different data sources.
Lists are the simplest source of data. The schema inferred below shows that the values are treated as a single column with name Unknown of type TX, starting at index 0. The header=+ indicates that there is a header row in the data.
import numpy as np
from pandas import DataFrame
from nimbusml import DataSchema
list = [[1.0, 1.0, 2.0], [3.0, 5.0, 6.0]]
schema = DataSchema.read_schema(list)
print(repr(schema))
#DataSchema([DataColumn(name='c0', type='R8', pos=0),
# DataColumn(name='c1', type='R8', pos=1),
# DataColumn(name='c2',type='R8', pos=2)],
# header=True)
print(schema)
#col=c0:R8:0 col=c1:R8:1 col=c2:R8:2 header=+
The DataSchema class infers that there is a header row in the dataset, and there are 3 columns, all of type R4 with index range of 0 to 2. When the type is changed from float32 to int16, the schema changes accoringly.
arr = np.array(list).astype(np.float32)
schema = DataSchema.read_schema(arr)
print(repr(schema))
#DataSchema([DataColumn(name='Data', type='R4', pos=(0, 1, 2))],
# header=True)
print(schema)
#col=Data:R4:0-2 header=+
arr = np.array(list).astype(np.int16)
schema = DataSchema.read_schema(arr)
print(repr(schema))
#DataSchema([DataColumn(name='Data', type='I2', pos=(0, 1, 2))],
# header=True)
print(schema)
#col=Data:I2:0-2 header=+
The DataSchema class infers that there is a header row in the dataset, and there are 3 columns, all of types R8, I8 and TX, with column names X1, X2 and X3.
df = DataFrame(dict(X1=[0.1, 0.2], X2=[1, 2], X3=["a", "b"]))
schema = DataSchema.read_schema(df)
print(repr(schema))
#DataSchema([DataColumn(name='X1', type='R8', pos=0),
# DataColumn(name='X2', type='I8', pos=1),
# DataColumn(name='X3',type='TX', pos=2)],
# header=True)
print (schema)
#col=X1:R8:0 col=X2:I8:1 col=X3:TX:2 header=+
The transforms and trainers in NimbusML support various Data Sources as inputs.
When the data is in a pandas.DataFrame
, the schema is inferred automatically from the
dtype
of the columns.
When the data is in a file, the schema will be inferred when creating a nimbusml.FileDataStream
using read_csv()
or
using nimbusml.DataSchema.read_schema()
. [Update when methods are included in API].
Example (from file):
from nimbusml import DataSchema
from pandas import DataFrame
from collections import OrderedDict
data = DataFrame(OrderedDict(real1=[0.1, 0.2], real2=[0.1, 0.2], integer=[1, 2], text=["a", "b"]))
# write dataframe to file
data.to_csv('data.txt', index=False)
# infer schema directly from file
schema = DataSchema.read_schema('data.txt')
print(repr(schema))
#DataSchema([DataColumn(name='real1', type='R8', pos=0),
# DataColumn(name='real2', type='R8', pos=1),
# DataColumn(name='integer', type='I8', pos=2),
# DataColumn(name='text', type='TX', pos=3)], header=True)
print(schema)
#col=real1:R8:0 col=real2:R8:1 col=integer:I8:2 col=text:TX:3 header=+
Data may consist of numerous columns of the same type, and often it’s convenient to group them
under a single name. The nimbusml.DataSchema
provides the collapse
argument to shorten the schema representation by grouping homongenous types.
Example:
schema = DataSchema.read_schema('data.txt', collapse=True)
print(repr(schema))
#DataSchema([DataColumn(name='real1', type='R8', pos=(0, 1)),
# DataColumn(name='integer', type='I8', pos=2),
# DataColumn(name='text', type='TX', pos=3)], header=True)
print(schema)
#col=real1:R8:0-1 col=integer:I8:2 col=text:TX:3 header=+
We see that columns real and real2 are merged into a single one col=real1:R8:0-1
. It is not a
real anymore but a vector of two floats. Every learner uses features encoded as a vector of
features. Every transform in a pipeline would convert text, categories, floats into feature vectors. It is faster to do that
at loading time. The parameter collapse=True
forces the function to merge consecutive columns
with the same type into vectors.
If collapse == 'all'
, it merges all columns of the same type unless specified in argument names
. Let’s see an example:
from nimbusml.datasets import get_dataset
from pandas import read_csv
path = get_dataset("infert").as_filepath()
df = read_csv(path)
print(df.head(n=2))
Output:
row_num education age parity induced case spontaneous stratum pooled.stratum
0 1 0-5yrs 26 6 1 1 2 1 3
1 2 0-5yrs 42 1 1 1 0 2 1
case is the target, eveything else must be features if numeric. We want to merge every column into Features except row_num (row index), education (text) and case (target). education is not merged by default as it is not a numerical column.
Example:
import numpy as np
schema = DataSchema.read_schema(path, collapse='all', sep=',',
numeric_dtype=np.float32, #convert all numeric columns to R4
names={0:'row_num', 5:'case'})
print(repr(schema))
#DataSchema([DataColumn(name='row_num', type='R4', pos=0),
# DataColumn(name='education', type='TX', pos=1),
# DataColumn(name='age', type='R4', pos=(2, 3, 4, 6, 7, 8)),
# DataColumn(name='case', type='R4', pos=5)], header=True, sep=',')
print(schema)
#col=row_num:R4:0 col=education:TX:1 col=age:R4:2-4,6-8 col=case:R4:5 header=+ sep=,
Some datasets have many columns and it is convenient to modify the first ones and let the function handle the rest. Below is an example of how to modify column names.
Example:
schema = DataSchema.read_schema('data.txt', collapse=True, sep=',',
names={0: 'newname', 1: 'newname2'})
print(repr(schema))
#DataSchema([DataColumn(name='newname', type='R8', pos=0),
# DataColumn(name='newname2', type='R8', pos=1),
# DataColumn(name='integer', type='I8', pos=2),
# DataColumn(name='text', type='TX', pos=3)], header=True, sep=',')
print(schema)
#col=newname:R8:0 col=newname2:R8:1 col=integer:I8:2 col=text:TX:3 header=+
Next example renames from column 0 to column 1 into real_0, real_1, …
Example:
schema = DataSchema.read_schema('data.txt', collapse=False, sep=',',
names={(0,1): 'real'})
print(repr(schema))
#DataSchema([DataColumn(name='real_0', type='R8', pos=0),
# DataColumn(name='real_1', type='R8', pos=1),
# DataColumn(name='integer', type='I8', pos=2),
# DataColumn(name='text', type='TX', pos=3)], header=True, sep=',')
print(schema)
#col=real_0:R8:0 col=real_1:R8:1 col=integer:I8:2 col=text:TX:3 header=+
The read_schema()
method uses the dtype
argument to change all types or only a few.
We can also use numeric_dtype=np.float32
to change all numeric columns to R4 type.
Example:
schema = DataSchema.read_schema('data.txt', collapse=True, sep=',',
dtype={'real1': np.float32})
print(repr(schema))
#DataSchema([DataColumn(name='real1', type='R4', pos=0),
# DataColumn(name='real2', type='R8', pos=1),
# DataColumn(name='integer', type='I8', pos=2),
# DataColumn(name='text', type='TX', pos=3)], header=True, sep=',')
print(schema)
#col=real1:R4:0 col=real2:R8:1 col=integer:I8:2 col=text:TX:3 header=+
The sep
argument can be used to specify another separator besides ','
, which is the default
delimiter. The user can also manually play with the schema himself.
Example:
for col in schema:
print(type(col), col)
#<class 'nimbusml.internal.utils.data_schema.DataColumn'> col=real1:R4:0
#<class 'nimbusml.internal.utils.data_schema.DataColumn'> col=real2:R8:1
#<class 'nimbusml.internal.utils.data_schema.DataColumn'> col=integer:I8:2
#<class 'nimbusml.internal.utils.data_schema.DataColumn'> col=text:TX:3
In this section, we only show the string representation of the schema for simplicity.
Ranking models require three kind of columns. Two of the columns are the typical Features and
Label columns (of numeric type R4 == numpy.float32
) and a third GroupId column which ties
all observations to a specific ranking group. Note that all examples with the same GroupId must
appear sequentially and its type must be TX == str
. When reading the file without any additional
information, the raw schema is the following:
col=c0:I8:0 col=c1:I8:1 col=c2:I8:2 col=c3:I8:3 col=c4:I8:4 ... header=- sep=,
But we need to have this:
col=Label:R4:0 col=GroupId:TX:1 col=Features:R4:2-2109 header=- sep=,
Let’s see step by step how to get that and it starts with the raw schema generated using read_schema()
:
Example:
from nimbusml import DataSchema
from nimbusml.datasets import get_dataset
path = get_dataset('gen_tickettrain').as_filepath()
schema = DataSchema.read_schema(path, sep=',')
print(str(schema))
#col=rank:I8:0 col=group:I8:1 col=carrier:TX:2 col=price:I8:3 col=Class:I8:4
#col=dep_day:I8:5 col=nbr_stops:I8:6 col=duration:R8:7 header=+ sep=,
Let’s rename label and group id:
Example:
schema = DataSchema.read_schema(path, sep=',', header=True,
names={0:'Label', 1:'GroupId'}) # added
print(str(schema))
#col=Label:I8:0 col=GroupId:I8:1 col=carrier:TX:2...
Let’s change the column types. However, this requires to change the type of more than 2000 columns. As types can be changed given a column name and not its position, we use a regular expression to do so.
Example:
schema = DataSchema.read_schema(path, sep=',',
names={0:'Label', 1:'GroupId'},
dtype={'GroupId': str, 'Label': np.float32}) # added
print(str(schema))
#col=Label:R4:0 col=GroupId:TX:1 col=carrier:TX:2 col=price:I8:3
Let’s then merge every columns used later as features into a single name.
Example:
schema = DataSchema.read_schema(path, sep=',',
names={0:'Label', 1:'GroupId'},
dtype={'GroupId': str, 'Label': np.float32},
collapse = 'all') # added
print(str(schema))
#col=Label:R4:0 col=GroupId:TX:1 col=carrier:TX:2 col=price:I8:3-6 col=duration:R8:7 header=+ sep=,
And finally, let’s rename c2 into Features:
Example:
schema.rename('price', 'Features') # added
print(schema)
#col=Label:R4:0 col=GroupId:TX:1 col=carrier:TX:2 col=Features:I8:3-6 col=duration:R8:7 header=+ sep=, #Voila!
Most of datasets are stored in text files. It is usually more convenient to load them in memory with
pandas
. But when the datasets is too big, nimbusml
has to directly load the data from its
location. It is more efficient to tell the parser which names and types it should use than changing
them by adding transforms in the pipeline. Given the nimbusml.DataSchema
generated
above, a nimbusml.FileDataStream
can be created to train the model:
Example:
from nimbusml.datasets import get_dataset
from nimbusml import Pipeline, FileDataStream, DataSchema
from nimbusml.ensemble import LightGbmClassifier
path = get_dataset('infert').as_filepath()
schema = DataSchema.read_schema(path, sep=',')
ds = FileDataStream(path, schema = schema)
#Equivalent to
#ds = FileDataStream.read_csv(path, sep=',')
pipeline = Pipeline([
LightGbmClassifier(feature=['age', 'parity', 'induced'], label='case')
])
pipeline.fit(ds)
pipeline.predict(ds)