Delimited text format in Azure Data Factory and Azure Synapse Analytics

Статия
09/26/2024

APPLIES TO: Azure Data Factory Azure Synapse Analytics

Tip

Try out Data Factory in Microsoft Fabric, an all-in-one analytics solution for enterprises. Microsoft Fabric covers everything from data movement to data science, real-time analytics, business intelligence, and reporting. Learn how to start a new trial for free!

Follow this article when you want to parse the delimited text files or write the data into delimited text format.

Delimited text format is supported for the following connectors:

Dataset properties

For a full list of sections and properties available for defining datasets, see the Datasets article. This section provides a list of properties supported by the delimited text dataset.

Property	Description	Required
type	The type property of the dataset must be set to DelimitedText.	Yes
location	Location settings of the file(s). Each file-based connector has its own location type and supported properties under `location`.	Yes
columnDelimiter	The character(s) used to separate columns in a file. The default value is comma `,`. When the column delimiter is defined as empty string, which means no delimiter, the whole line is taken as a single column. Currently, column delimiter as empty string is only supported for mapping data flow but not Copy activity.	No
rowDelimiter	For Copy activity, the single character or "\r\n" used to separate rows in a file. The default value is any of the following values on read: ["\r\n", "\r", "\n"]; on write: "\r\n". "\r\n" is only supported in copy command. For Mapping data flow, the single or two characters used to separate rows in a file. The default value is any of the following values on read: ["\r\n", "\r", "\n"]; on write: "\n". When the row delimiter is set to no delimiter (empty string), the column delimiter must be set as no delimiter (empty string) as well, which means to treat the entire content as a single value. Currently, row delimiter as empty string is only supported for mapping data flow but not Copy activity.	No
quoteChar	The single character to quote column values if it contains column delimiter. The default value is double quotes `"`. When `quoteChar` is defined as empty string, it means there is no quote char and column value is not quoted, and `escapeChar` is used to escape the column delimiter and itself.	No
escapeChar	The single character to escape quotes inside a quoted value. The default value is backslash `\`. When `escapeChar` is defined as empty string, the `quoteChar` must be set as empty string as well, in which case make sure all column values don't contain delimiters.	No
firstRowAsHeader	Specifies whether to treat/make the first row as a header line with names of columns. Allowed values are true and false (default). When first row as header is false, note UI data preview and lookup activity output auto generate column names as Prop_{n} (starting from 0), copy activity requires explicit mapping from source to sink and locates columns by ordinal (starting from 1), and mapping data flow lists and locates columns with name as Column_{n} (starting from 1).	No
nullValue	Specifies the string representation of null value. The default value is empty string.	No
encodingName	The encoding type used to read/write test files. Allowed values are as follows: "UTF-8","UTF-8 without BOM", "UTF-16", "UTF-16BE", "UTF-32", "UTF-32BE", "US-ASCII", "UTF-7", "BIG5", "EUC-JP", "EUC-KR", "GB2312", "GB18030", "JOHAB", "SHIFT-JIS", "CP875", "CP866", "IBM00858", "IBM037", "IBM273", "IBM437", "IBM500", "IBM737", "IBM775", "IBM850", "IBM852", "IBM855", "IBM857", "IBM860", "IBM861", "IBM863", "IBM864", "IBM865", "IBM869", "IBM870", "IBM01140", "IBM01141", "IBM01142", "IBM01143", "IBM01144", "IBM01145", "IBM01146", "IBM01147", "IBM01148", "IBM01149", "ISO-2022-JP", "ISO-2022-KR", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-13", "ISO-8859-15", "WINDOWS-874", "WINDOWS-1250", "WINDOWS-1251", "WINDOWS-1252", "WINDOWS-1253", "WINDOWS-1254", "WINDOWS-1255", "WINDOWS-1256", "WINDOWS-1257", "WINDOWS-1258". Note mapping data flow doesn't support UTF-7 encoding. Note mapping data flow doesn't support UTF-8 encoding with Byte Order Mark (BOM).	No
compressionCodec	The compression codec used to read/write text files. Allowed values are bzip2, gzip, deflate, ZipDeflate, TarGzip, Tar, snappy, or lz4. Default is not compressed. Note currently Copy activity doesn't support "snappy" & "lz4", and mapping data flow doesn't support "ZipDeflate", "TarGzip" and "Tar". Note when using copy activity to decompress ZipDeflate/TarGzip/Tar file(s) and write to file-based sink data store, by default files are extracted to the folder:`<path specified in dataset>/<folder named as source compressed file>/`, use `preserveZipFileNameAsFolder`/`preserveCompressionFileNameAsFolder` on copy activity source to control whether to preserve the name of the compressed file(s) as folder structure.	No
compressionLevel	The compression ratio. Allowed values are Optimal or Fastest. - Fastest: The compression operation should complete as quickly as possible, even if the resulting file is not optimally compressed. - Optimal: The compression operation should be optimally compressed, even if the operation takes a longer time to complete. For more information, see Compression Level topic.	No

Below is an example of delimited text dataset on Azure Blob Storage:

{
    "name": "DelimitedTextDataset",
    "properties": {
        "type": "DelimitedText",
        "linkedServiceName": {
            "referenceName": "<Azure Blob Storage linked service name>",
            "type": "LinkedServiceReference"
        },
        "schema": [ < physical schema, optional, retrievable during authoring > ],
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "container": "containername",
                "folderPath": "folder/subfolder",
            },
            "columnDelimiter": ",",
            "quoteChar": "\"",
            "escapeChar": "\"",
            "firstRowAsHeader": true,
            "compressionCodec": "gzip"
        }
    }
}

Copy activity properties

For a full list of sections and properties available for defining activities, see the Pipelines article. This section provides a list of properties supported by the delimited text source and sink.

Delimited text as source

The following properties are supported in the copy activity *source* section.

Property	Description	Required
type	The type property of the copy activity source must be set to DelimitedTextSource.	Yes
formatSettings	A group of properties. Refer to Delimited text read settings table below.	No
storeSettings	A group of properties on how to read data from a data store. Each file-based connector has its own supported read settings under `storeSettings`.	No

Supported delimited text read settings under formatSettings:

Property	Description	Required
type	The type of formatSettings must be set to DelimitedTextReadSettings.	Yes
skipLineCount	Indicates the number of non-empty rows to skip when reading data from input files. If both skipLineCount and firstRowAsHeader are specified, the lines are skipped first and then the header information is read from the input file.	No
compressionProperties	A group of properties on how to decompress data for a given compression codec.	No
preserveZipFileNameAsFolder (under `compressionProperties`->`type` as `ZipDeflateReadSettings`)	Applies when input dataset is configured with ZipDeflate compression. Indicates whether to preserve the source zip file name as folder structure during copy. - When set to true (default), the service writes unzipped files to `<path specified in dataset>/<folder named as source zip file>/`. - When set to false, the service writes unzipped files directly to `<path specified in dataset>`. Make sure you don't have duplicated file names in different source zip files to avoid racing or unexpected behavior.	No
preserveCompressionFileNameAsFolder (under `compressionProperties`->`type` as `TarGZipReadSettings` or `TarReadSettings`)	Applies when input dataset is configured with TarGzip/Tar compression. Indicates whether to preserve the source compressed file name as folder structure during copy. - When set to true (default), the service writes decompressed files to `<path specified in dataset>/<folder named as source compressed file>/`. - When set to false, the service writes decompressed files directly to `<path specified in dataset>`. Make sure you don't have duplicated file names in different source files to avoid racing or unexpected behavior.	No

"activities": [
    {
        "name": "CopyFromDelimitedText",
        "type": "Copy",
        "typeProperties": {
            "source": {
                "type": "DelimitedTextSource",
                "storeSettings": {
                    "type": "AzureBlobStorageReadSettings",
                    "recursive": true
                },
                "formatSettings": {
                    "type": "DelimitedTextReadSettings",
                    "skipLineCount": 3,
                    "compressionProperties": {
                        "type": "ZipDeflateReadSettings",
                        "preserveZipFileNameAsFolder": false
                    }
                }
            },
            ...
        }
        ...
    }
]

Delimited text as sink

The following properties are supported in the copy activity *sink* section.

Property	Description	Required
type	The type property of the copy activity source must be set to DelimitedTextSink.	Yes
formatSettings	A group of properties. Refer to Delimited text write settings table below.	No
storeSettings	A group of properties on how to write data to a data store. Each file-based connector has its own supported write settings under `storeSettings`.	No

Supported delimited text write settings under formatSettings:

Property	Description	Required
type	The type of formatSettings must be set to DelimitedTextWriteSettings.	Yes
fileExtension	The file extension used to name the output files, for example, `.csv`, `.txt`. It must be specified when the `fileName` is not specified in the output DelimitedText dataset. When file name is configured in the output dataset, it will be used as the sink file name and the file extension setting will be ignored.	Yes when file name is not specified in output dataset
maxRowsPerFile	When writing data into a folder, you can choose to write to multiple files and specify the max rows per file.	No
fileNamePrefix	Applicable when `maxRowsPerFile` is configured. Specify the file name prefix when writing data to multiple files, resulted in this pattern: `<fileNamePrefix>_00000.<fileExtension>`. If not specified, file name prefix will be auto generated. This property does not apply when source is file-based store or partition-option-enabled data store.	No

Mapping data flow properties

In mapping data flows, you can read and write to delimited text format in the following data stores: Azure Blob Storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2 and SFTP, and you can read delimited text format in Amazon S3.

Inline dataset

Mapping data flows supports "inline datasets" as an option for defining your source and sink. An inline delimited dataset is defined directly inside your source and sink transformations and is not shared outside of the defined dataflow. It is useful for parameterizing dataset properties directly inside your data flow and can benefit from improved performance from shared ADF datasets.

When you are reading large numbers of source folders and files, you can improve the performance of data flow file discovery by setting the option "User projected schema" inside the Projection | Schema options dialog. This option turns off ADF's default schema auto-discovery and will greatly improve the performance of file discovery. Before setting this option, make sure to import the projection so that ADF has an existing schema for projection. This option does not work with schema drift.

Source properties

The below table lists the properties supported by a delimited text source. You can edit these properties in the Source options tab.

Name	Description	Required	Allowed values	Data flow script property
Wild card paths	All files matching the wildcard path will be processed. Overrides the folder and file path set in the dataset.	no	String[]	wildcardPaths
Partition root path	For file data that is partitioned, you can enter a partition root path in order to read partitioned folders as columns	no	String	partitionRootPath
List of files	Whether your source is pointing to a text file that lists files to process	no	`true` or `false`	fileList
Multiline rows	Does the source file contain rows that span multiple lines. Multiline values must be in quotes.	no `true` or `false`	multiLineRow
Column to store file name	Create a new column with the source file name and path	no	String	rowUrlColumn
After completion	Delete or move the files after processing. File path starts from the container root	no	Delete: `true` or `false` Move: `['<from>', '<to>']`	purgeFiles moveFiles
Filter by last modified	Choose to filter files based upon when they were last altered	no	Timestamp	modifiedAfter modifiedBefore
Allow no files found	If true, an error is not thrown if no files are found	no	`true` or `false`	ignoreNoFilesFound
Maximum columns	The default value is 20480. Customize this value when the column number is over 20480	no	Integer	maxColumns

Note

Data flow sources support for list of files is limited to 1024 entries in your file. To include more files, use wildcards in your file list.

Source example

The below image is an example of a delimited text source configuration in mapping data flows.

DelimitedText source

The associated data flow script is:

source(
    allowSchemaDrift: true,
	validateSchema: false,
	ignoreNoFilesFound: false,
	multiLineRow: true,
	wildcardPaths:['*.csv']) ~> CSVSource

Note

Data flow sources support a limited set of Linux globbing that is supported by Hadoop file systems

Sink properties

The below table lists the properties supported by a delimited text sink. You can edit these properties in the Settings tab.

Name	Description	Required	Allowed values	Data flow script property
Clear the folder	If the destination folder is cleared prior to write	no	`true` or `false`	truncate
File name option	The naming format of the data written. By default, one file per partition in format `part-#####-tid-<guid>`	no	Pattern: String Per partition: String[] Name file as column data: String Output to single file: `['<fileName>']` Name folder as column data: String	filePattern partitionFileNames rowUrlColumn partitionFileNames rowFolderUrlColumn
Quote all	Enclose all values in quotes	no	`true` or `false`	quoteAll
Header	Add customer headers to output files	no	`[<string array>]`	header

Sink example

The below image is an example of a delimited text sink configuration in mapping data flows.

DelimitedText sink

The associated data flow script is:

CSVSource sink(allowSchemaDrift: true,
    validateSchema: false,
    truncate: true,
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true) ~> CSVSink

Here are some common connectors and formats related to the delimited text format:

Споделяне чрез

Delimited text format in Azure Data Factory and Azure Synapse Analytics

Dataset properties

Copy activity properties

Delimited text as source

Delimited text as sink

Mapping data flow properties

Inline dataset

Source properties

Source example

Sink properties

Sink example

Обратна връзка

Допълнителни ресурси

Споделяне чрез

Delimited text format in Azure Data Factory and Azure Synapse Analytics

Dataset properties

Copy activity properties

Delimited text as source

Delimited text as sink

Mapping data flow properties

Inline dataset

Source properties

Source example

Sink properties

Sink example

Related connectors and formats

Related content

Обратна връзка

Допълнителни ресурси