Mapping Parquet types to Common Data Model data types

This article provides assistance to developers in finding the appropriate equivalents of Parquet data types in Common Data Model.

  • Parquet type: This column represents Parquet data type. For more details, visit here.

  • Common Data Model equivalent type: Each attribute in Common Data Model entities can be associated with a single data type. A Common Data Model data type is an object that represents a collection of traits. All data types should indicate the data format traits but can also add additional semantic information. For more details, visit here.

  • Traits included in the equivalent data type: When an attribute is defined by using a data type, the attribute will gain the traits of the data type, visit here. Traits are the fundamental mechanism in the Common Data Model metadata grammar for describing the data format, semantic meaning, and specifications for entities, attributes and other objects, such as partitions or manifests. For more details visit here.

  • Traits to add: These traits won't be implicitly included when specifying the Common Data Model data type. Users must add them to complete the suggested data type and match the equivalent Parquet type.

  • Unsupported: Common Data Model doesn't offer out-of-box equivalents. Depending on the use case, users can define new data types but it will not be standard.

The following code snippet sets integer data type to Common Data Model attribute. Follow CDM SDK API documentation for the API references.

CdmTypeAttributeDefinition artAtt = MakeObject<CdmTypeAttributeDefinition>(CdmObjectType.TypeAttributeDef, "count"); 
artAtt.DataType = MakeObject<CdmDataTypeReference>(CdmObjectType.DataTypeRef, "integer", true); 

Here is a sample application demonstrating how data types can be set.

Parquet type Common Data Model equivalent type Traits included in the equivalent data type Traits to add
BOOLEAN boolean is.dataFormat.boolean N/A
FLOAT float is.dataFormat.floatingPoint N/A
DOUBLE double is.dataFormat.floatingPoint,
is.dataFormat.big
N/A
BYTE_ARRAY binary is.dataFormat.byte,
is.dataFormat.array
N/A
STRING string is.dataFormat.character,
is.dataFormat.array,
is.dataFormat.big
N/A
UUID guid is.dataFormat.guid N/A
DECIMAL decimal is.dataFormat.numeric.shaped (extends is.dataFormat.numeric)
It has two parameters:
  • precision - The total number of significant digits and datatype is an integer.
  • scale - The number of digits to the right of the decimal place and datatype is integer.
N/A
DATE date is.dataFormat.date N/A
INTERVAL binary is.dataFormat.byte,
is.dataFormat.array
N/A
JSON json is.dataFormat.character,
is.dataFormat.array,
means.content.text.JSON
N/A
ENUM string is.dataFormat.character,
is.dataFormat.array,
is.dataFormat.big
N/A
MAP (complex object) N/A Structured Resolution Form:
is.dataFormat.map,
is.dataFormat.maspKey,
is.dataFormat.mapValue
Non-structured Resolution Form;
indicates.expansionInfo.mapKey,
indicates.expansionInfo.mapValue

For a detailed description and use cases refer to this page.
LIST (complex object) N/A Structured Resolution Form:
is.dataFormat.list
Non-structured Resolution Form:
has.expansionInfo.list

For a detailed description and use cases refer to this page.
Signed integers
INT(8, true) byte is.dataFormat.byte N/A
INT(16, true) smallinteger is.dataFormat.integer,
is.dataFormat.signed,
is.dataFormat.numeric,
is.dataFormat.small
N/A
INT(32, true) integer is.dataFormat.integer,
is.dataFormat.signed,
is.dataFormat.numeric
N/A
INT(64, true) biginteger is.dataFormat.integer,
is.dataFormat.big,
is.dataFormat.signed,
is.dataFormat.numeric
N/A
Unsigned integers biginteger is.dataFormat.integer,
is.dataFormat.unsigned,
is.dataFormat.numeric,
is.dataFormat.big
N/A
TIME (UTC adjustment (true/false) and precision (MILLIS/MICRO/NANO)) integer (MILLIS)
biginteger (MICRO and NANO)
is.dataFormat.integer,
is.dataFormat.big (additional for MICRO and NANO),
is.dataFormat.signed,
is.dataFormat.numeric
Only one of the listed traits below should be used:
means.time.parquet.milli,
means.time.parquet.micro,
means.time.parquet.nano
TIMESTAMP (UTC adjustment (true/false) and precision (MILLIS/MICRO/NANO)) biginteger is.dataFormat.integer,
is.dataFormat.big,
is.dataFormat.signed,
is.dataFormat.numeric
Only one of the listed traits below should be used:
means.timestamp.parquet.milli,
means.timestamp.parquet.micro,
means.timestamp.parquet.nano
Unsupported
INT96 Not available Not available Not available
BSON Not available Not available Not available
Null Not available Not available Not available