This article provides assistance to developers in finding the appropriate equivalents of Parquet data types in Common Data Model.
Parquet type: This column represents Parquet data type. For more details, visit here.
Common Data Model equivalent type: Each attribute in Common Data Model entities can be associated with a single data type. A Common Data Model data type is an object that represents a collection of traits. All data types should indicate the data format traits but can also add additional semantic information. For more details, visit here.
Traits included in the equivalent data type: When an attribute is defined by using a data type, the attribute will gain the traits of the data type, visit here. Traits are the fundamental mechanism in the Common Data Model metadata grammar for describing the data format, semantic meaning, and specifications for entities, attributes and other objects, such as partitions or manifests. For more details visit here.
Traits to add: These traits won't be implicitly included when specifying the Common Data Model data type. Users must add them to complete the suggested data type and match the equivalent Parquet type.
Unsupported: Common Data Model doesn't offer out-of-box equivalents. Depending on the use case, users can define new data types but it will not be standard.
The following code snippet sets integer data type to Common Data Model attribute. Follow CDM SDK API documentation for the API references.
CdmTypeAttributeDefinition artAtt = MakeObject<CdmTypeAttributeDefinition>(CdmObjectType.TypeAttributeDef, "count");
artAtt.DataType = MakeObject<CdmDataTypeReference>(CdmObjectType.DataTypeRef, "integer", true);
Here is a sample application demonstrating how data types can be set.
Parquet type |
Common Data Model equivalent type |
Traits included in the equivalent data type |
Traits to add |
BOOLEAN |
boolean |
is.dataFormat.boolean |
N/A |
FLOAT |
float |
is.dataFormat.floatingPoint |
N/A |
DOUBLE |
double |
is.dataFormat.floatingPoint, is.dataFormat.big |
N/A |
BYTE_ARRAY |
binary |
is.dataFormat.byte, is.dataFormat.array |
N/A |
STRING |
string |
is.dataFormat.character, is.dataFormat.array, is.dataFormat.big |
N/A |
UUID |
guid |
is.dataFormat.guid |
N/A |
DECIMAL |
decimal |
is.dataFormat.numeric.shaped (extends is.dataFormat.numeric) It has two parameters: - precision - The total number of significant digits and datatype is an integer.
- scale - The number of digits to the right of the decimal place and datatype is integer.
|
N/A |
DATE |
date |
is.dataFormat.date |
N/A |
INTERVAL |
binary |
is.dataFormat.byte, is.dataFormat.array |
N/A |
JSON |
json |
is.dataFormat.character, is.dataFormat.array, means.content.text.JSON |
N/A |
ENUM |
string |
is.dataFormat.character, is.dataFormat.array, is.dataFormat.big |
N/A |
MAP |
(complex object) |
N/A |
Structured Resolution Form: is.dataFormat.map, is.dataFormat.maspKey, is.dataFormat.mapValue Non-structured Resolution Form; indicates.expansionInfo.mapKey, indicates.expansionInfo.mapValue
For a detailed description and use cases refer to this page. |
LIST |
(complex object) |
N/A |
Structured Resolution Form: is.dataFormat.list Non-structured Resolution Form: has.expansionInfo.list
For a detailed description and use cases refer to this page. |
Signed integers |
|
|
|
INT(8, true) |
byte |
is.dataFormat.byte |
N/A |
INT(16, true) |
smallinteger |
is.dataFormat.integer, is.dataFormat.signed, is.dataFormat.numeric, is.dataFormat.small |
N/A |
INT(32, true) |
integer |
is.dataFormat.integer, is.dataFormat.signed, is.dataFormat.numeric |
N/A |
INT(64, true) |
biginteger |
is.dataFormat.integer, is.dataFormat.big, is.dataFormat.signed, is.dataFormat.numeric |
N/A |
Unsigned integers |
biginteger |
is.dataFormat.integer, is.dataFormat.unsigned, is.dataFormat.numeric, is.dataFormat.big |
N/A |
TIME (UTC adjustment (true/false) and precision (MILLIS/MICRO/NANO)) |
integer (MILLIS) biginteger (MICRO and NANO) |
is.dataFormat.integer, is.dataFormat.big (additional for MICRO and NANO), is.dataFormat.signed, is.dataFormat.numeric |
Only one of the listed traits below should be used: means.time.parquet.milli, means.time.parquet.micro, means.time.parquet.nano |
TIMESTAMP (UTC adjustment (true/false) and precision (MILLIS/MICRO/NANO)) |
biginteger |
is.dataFormat.integer, is.dataFormat.big, is.dataFormat.signed, is.dataFormat.numeric |
Only one of the listed traits below should be used: means.timestamp.parquet.milli, means.timestamp.parquet.micro, means.timestamp.parquet.nano |
Unsupported |
|
|
|
INT96 |
Not available |
Not available |
Not available |
BSON |
Not available |
Not available |
Not available |
Null |
Not available |
Not available |
Not available |