How to set parquet data types in copy activity sink?

joba-2596 5

I'm trying to copy csv to parquet files using copy activity. However, I'm not succeeding in writing typed data to parquet. Whatever I do, all the fields in the parquet end up as "BYTE_ARRAY" (string):

Parquet schema viewer

The mapping I'm currently using looks like this:

"mapping": {
    "type": "TabularTranslator",
    "mappings": [           
        {
            "source": {
                "name": "REQUEST_ID",
                "type": "String",
                "physicalType": "String"
            },
            "sink": {
                "name": "request_id",
                "type": "String",
                "physicalType": "UTF8"
            }
        },
        [...]
        {
            "source": {
                "name": "t_job_id",
                "type": "INT32",
                "physicalType": "String"
            },
            "sink": {
                "name": "t_job_id",
                "type": "Int32",
                "physicalType": "INT_32"
            }
        },
        [...]
    ]
}

What am I doing wrong? The jobid and fileid fields are "additional columns" added by the copy activity, but they are definetley valid integers

User's image

The warnings in the additional columns say: Expression of type 'Int' does not match the field: 'value' but I'm not sure what that means. I can cast them to Strings in the expression to make the warning go away, but that doesn't solve the parquet problem either.

AnnuKumari-MSFT 33,476 Reputation points Microsoft Employee

2023-01-31T08:59:59.2333333+00:00
Hi joba ,

Welcome to Microsoft Q&A platform and thanks for posting your question here.

As I understand your query, you are trying to copy data from csv to parquet file, however the data type is getting overwritten as 'BYTE_ARRAY' . Please let me know if that is not the ask here.

Ideally, copy activity should be able to convert csv to parquet but I haven't yet replicated the scenario using copy activity. But I would like to suggest you to use dataflow , which has derived column transformation to create new columns with required values as well as cast transformation to typecast the data as the required datatype.

Suggested resources to go through:

https://github.com/apache/parquet-format#types

Parquet binary data type

Derived column tranformation

Cast transformation

Please let us know how it goes.
joba-2596 5 Reputation points

2023-01-31T09:24:34.5133333+00:00

Hi,

Yes, the question is "how to write correct datatypes to parquet".

I'm aware of dataflows. However, copy activity should be able to cope with this requirement. Switching to dataflows has some other re-engineering consequences in our setup, which I would like to avoid.

So the question: bug? limitation? Or should I use something else in the mapping definition? The documentation for copy activity is lacking.
joba-2596 5 Reputation points

2023-01-31T09:25:54.2666667+00:00

(duplicate comment posted by accident, ignore this one)
Satyasobhan Dasmohapatra 0 Reputation points

2023-11-08T11:24:09.2233333+00:00

Did you get an answer to the question or solve it?
joba-2596 5 Reputation points

2023-11-08T15:28:33.6533333+00:00

Unfortunately, no answers nor solutions :(

Share via

How to set parquet data types in copy activity sink?

Your answer