`read_files` table-valued function

Applies to: check marked yes Databricks SQL Databricks Runtime 13.3 LTS and above

Reads files under a provided location and returns the data in tabular form.

Supports reading JSON, CSV, XML, TEXT, BINARYFILE, PARQUET, AVRO, and ORC file formats. Can detect the file format automatically and infer a unified schema across all files.

Syntax

read_files(path [, option_key => option_value ] [...])

Arguments

This function requires named parameter invocation for the option keys.

path: A STRING with the URI of the location of the data. Supports reading from Azure Data Lake Storage ('abfss://'), S3 (s3://) and Google Cloud Storage ('gs://'). Can contain globs. See File discovery for more details.
option_key: The name of the option to configure. You need to use backticks () for options that contain dots (.`).
option_value: A constant expression to set the option to. Accepts literals and scalar functions.

Returns

A table containing the data from files read under the given path. The schema depends on the file format:

BINARYFILE: Returns a fixed schema:

Column	Type	Description
`path`	`STRING`	The full path to the file.
`modificationTime`	`TIMESTAMP`	The last modification time of the file.
`length`	`LONG`	The size of the file in bytes.
`content`	`BINARY`	The binary content of the file. Use `* EXCEPT (content)` to exclude binary content when querying file metadata.

TEXT: Returns a fixed schema with a single value (STRING) column.
All other formats (JSON, CSV, XML, PARQUET, AVRO, ORC): The schema is inferred from the file contents, or provided explicitly using the schema option.

`_metadata` column

read_files exposes a _metadata column with file-level metadata. This column is not included in SELECT * results and must be explicitly selected. It contains the following fields:

Field	Type	Description
`file_path`	`STRING`	The full path to the source file.
`file_name`	`STRING`	The name of the source file.
`file_size`	`LONG`	The size of the source file in bytes.
`file_modification_time`	`TIMESTAMP`	The last modification time of the source file.
`file_block_start`	`LONG`	The start of the block of the file being read.
`file_block_length`	`LONG`	The length of the block of the file being read.

To include _metadata in results, select it explicitly:

SELECT * EXCEPT (content), _metadata
FROM read_files('/Volumes/my_catalog/my_schema/my_volume', format => 'binaryFile');

File discovery

read_files can read an individual file or read files under a provided directory. read_files discovers all files under the provided directory recursively unless a glob is provided, which instructs read_files to recurse into a specific directory pattern.

Filtering directories or files using glob patterns

Glob patterns can be used for filtering directories and files when provided in the path.

Pattern	Description
`?`	Matches any single character
`*`	Matches zero or more characters
`[abc]`	Matches a single character from character set {a,b,c}.
`[a-z]`	Matches a single character from the character range {a…z}.
`[^a]`	Matches a single character that is not from character set or range {a}. Note that the `^` character must occur immediately to the right of the opening bracket.
`{ab,cd}`	Matches a string from the string set {ab, cd}.
`{ab,c{de, fh}}`	Matches a string from the string set {ab, cde, cfh}.

read_files uses Auto Loader's strict globber when discovering files with globs. This is configured by the useStrictGlobber option. When the strict globber is disabled, trailing slashes (/) are dropped and a star pattern such as /*/ can expand into discovering multiple directories. See the examples below to see the difference in behavior.

Pattern	File path	Strict globber disabled	Strict globber enabled
`/a/b`	`/a/b/c/file.txt`	Yes	Yes
`/a/b`	`/a/b_dir/c/file.txt`	No	No
`/a/b`	`/a/b.txt`	No	No
`/a/b/`	`/a/b.txt`	No	No
`/a/*/c/`	`/a/b/c/file.txt`	Yes	Yes
`/a/*/c/`	`/a/b/c/d/file.txt`	Yes	Yes
`/a/*/d/`	`/a/b/c/d/file.txt`	Yes	No
`/a/*/c/`	`/a/b/x/y/c/file.txt`	Yes	No
`/a/*/c`	`/a/b/c_file.txt`	Yes	No
`/a/*/c/`	`/a/b/c_file.txt`	Yes	No
`/a/*/c`	`/a/b/cookie/file.txt`	Yes	No
`/a/b*`	`/a/b.txt`	Yes	Yes
`/a/b*`	`/a/b/file.txt`	Yes	Yes
`/a/{0.txt,1.txt}`	`/a/0.txt`	Yes	Yes
`/a/*/{0.txt,1.txt}`	`/a/0.txt`	No	No
`/a/b/[cde-h]/i/`	`/a/b/c/i/file.txt`	Yes	Yes

Schema inference

The schema of the files can be explicitly provided to read_files with the schema option. When the schema is not provided, read_files attempts to infer a unified schema across the discovered files, which requires reading all the files unless a LIMIT statement is used. Even when using a LIMIT query, a larger set of files than required might be read to return a more representative schema of the data. Databricks automatically adds a LIMIT statement for SELECT queries in notebooks and the SQL editor if a user hasn't provided one.

The schemaHints option can be used to fix subsets of the inferred schema. See Override schema inference with schema hints for more details.

A rescuedDataColumn is provided by default to rescue any data that doesn't match the schema. See What is the rescued data column? for more details. You can drop the rescuedDataColumn by setting the option schemaEvolutionMode => 'none'.

Partition schema inference

read_files can also infer partitioning columns if files are stored under Hive-style partitioned directories, that is /column_name=column_value/. If a schema is provided, the discovered partition columns use the types provided in the schema. If the partition columns are not part of the provided schema, then the inferred partition columns are ignored.

If a column exists in both the partition schema and in the data columns, the value that is read from the partition value is used instead of the data value. If you would like to ignore the values coming from the directory and use the data column, you can provide the list of partition columns in a comma-separated list with the partitionColumns option.

The partitionColumns option can also be used to instruct read_files on which discovered columns to include in the final inferred schema. Providing an empty string ignores all partition columns.

The schemaHints option can also be provided to override the inferred schema for a partition column.

The TEXT and BINARYFILE formats have a fixed schema, but read_files also attempts to infer partitioning for these formats when possible.

Authentication for cloud storage

read_files accesses cloud storage through Unity Catalog external locations. You must have the READ FILES privilege on the external location that contains the files you want to read. See Connect to cloud object storage using Unity Catalog.

Usage in streaming tables

read_files can be used in streaming tables to ingest files into Delta Lake. read_files leverages Auto Loader when used in a streaming table query. You must use the STREAM keyword with read_files. See What is Auto Loader? for more details.

When used in a streaming query, read_files uses a sample of the data to infer the schema, and can evolve the schema as it processes more data. See Configure schema inference and evolution in Auto Loader for more details.

Basic Options

Option
`format` Type: `String` The data file format in the source path. Auto-inferred if not provided. Allowed values include: `avro`: Avro file `binaryFile`: Binary file `csv`: Read CSV files `json`: JSON files `orc`: Work with ORC files `parquet`: Read Parquet files using Azure Databricks `text`: Text files `xml`: Read and write XML files Default value: None
`schema` Type: `String` The schema of the files to read. Provide a schema string using DDL format, for example `'id int, ts timestamp, event string'`. When the schema is not provided, `read_files` attempts to infer a unified schema across the discovered files. Default value: None
`inferColumnTypes` Type: `Boolean` Whether to infer exact column types when leveraging schema inference. By default, columns are inferred when inferring JSON and CSV datasets. See schema inference for more details. Note that this is the opposite of the default of Auto Loader. Default value: `true`
`partitionColumns` Type: `String` A comma-separated list of Hive style partition columns that you would like inferred from the directory structure of the files. Hive style partition columns are key-value pairs combined by an equality sign such as `<base-path>/a=x/b=1/c=y/file.format`. In this example, the partition columns are `a`, `b`, and `c`. By default these columns will be automatically added to your schema if you are using schema inference and provide the `<base-path>` to load data from. If you provide a schema, Auto Loader expects these columns to be included in the schema. If you do not want these columns as part of your schema, you can specify `""` to ignore these columns. In addition, you can use this option when you want columns to be inferred the file path in complex directory structures, like the example below: `<base-path>/year=2022/week=1/file1.csv` `<base-path>/year=2022/month=2/day=3/file2.csv` `<base-path>/year=2022/month=2/day=4/file3.csv` Specifying `cloudFiles.partitionColumns` as `year,month,day` will return `year=2022` for `file1.csv`, but the `month` and `day` columns will be `null`. `month` and `day` will be parsed correctly for `file2.csv` and `file3.csv`. Default value: None
`schemaHints` Type: `String` Schema information that you provide to Auto Loader during schema inference. See schema hints for more details. Default value: None
`useStrictGlobber` Type: `Boolean` Whether to use a strict globber that matches the default globbing behavior of other file sources in Apache Spark. See Common data loading patterns for more details. Available in Databricks Runtime 12.2 LTS and above. Note that this is the opposite of the default for Auto Loader. Default value: `true`

Generic options

The following options apply to all file formats.

Option
`ignoreCorruptFiles` Type: `Boolean` Whether to ignore corrupt files. If true, the Spark jobs will continue to run when encountering corrupted files and the contents that have been read will still be returned. Observable as `numSkippedCorruptFiles` in the `operationMetrics` column of the Delta Lake history. Available in Databricks Runtime 11.3 LTS and above. Default value: `false`
`ignoreMissingFiles` Type: `Boolean` Whether to ignore missing files. If true, the Spark jobs will continue to run when encountering missing files and the contents that have been read will still be returned. Available in Databricks Runtime 11.3 LTS and above. Default value: `false` for Auto Loader, `true` for `COPY INTO` (legacy)
`modifiedAfter` Type: `Timestamp String`, for example, `2021-01-01 00:00:00.000000 UTC+0` An optional timestamp as a filter to only ingest files that have a modification timestamp after the provided timestamp. Default value: None
`modifiedBefore` Type: `Timestamp String`, for example, `2021-01-01 00:00:00.000000 UTC+0` An optional timestamp as a filter to only ingest files that have a modification timestamp before the provided timestamp. Default value: None
`pathGlobFilter` or `fileNamePattern` Type: `String` A potential glob pattern to provide for choosing files. Equivalent to `PATTERN` in `COPY INTO` (legacy). `fileNamePattern` can be used in `read_files`. Default value: None
`recursiveFileLookup` Type: `Boolean` This option searches through nested directories even if their names do not follow a partition naming scheme like date=2019-07-01. Default value: `false`

`JSON` options

Option
`allowBackslashEscapingAnyCharacter` Type: `Boolean` Whether to allow backslashes to escape any character that succeeds it. If not enabled, only characters that are explicitly listed by the JSON specification can be escaped. Default value: `false`
`allowComments` Type: `Boolean` Whether to allow the use of Java, C, and C++ style comments (`'/'`, `'*'`, and `'//'` varieties) within parsed content or not. Default value: `false`
`allowNonNumericNumbers` Type: `Boolean` Whether to allow the set of not-a-number (`NaN`) tokens as legal floating number values. Default value: `true`
`allowNumericLeadingZeros` Type: `Boolean` Whether to allow integral numbers to start with additional (ignorable) zeroes (for example, `000001`). Default value: `false`
`allowSingleQuotes` Type: `Boolean` Whether to allow use of single quotes (apostrophe, character `'\'`) for quoting strings (names and String values). Default value: `true`
`allowUnquotedControlChars` Type: `Boolean` Whether to allow JSON strings to contain unescaped control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. Default value: `false`
`allowUnquotedFieldNames` Type: `Boolean` Whether to allow use of unquoted field names (which are allowed by JavaScript, but not by the JSON specification). Default value: `false`
`badRecordsPath` Type: `String` The path to store files for recording the information about bad JSON records. Using the `badRecordsPath` option in a file-based data source has the following limitations: It is non-transactional and can lead to inconsistent results. Transient errors are treated as failures. Default value: None
`columnNameOfCorruptRecord` Type: `String` The column for storing records that are malformed and cannot be parsed. If the `mode` for parsing is set as `DROPMALFORMED`, this column will be empty. Default value: `_corrupt_record`
`dateFormat` Type: `String` The format for parsing date strings. Default value: `yyyy-MM-dd`
`dropFieldIfAllNull` Type: `Boolean` Whether to ignore columns of all null values or empty arrays and structs during schema inference. Default value: `false`
`encoding` or `charset` Type: `String` The name of the encoding of the JSON files. See `java.nio.charset.Charset` for list of options. You cannot use `UTF-16` and `UTF-32` when `multiline` is `true`. Default value: `UTF-8`
`inferTimestamp` Type: `Boolean` Whether to try and infer timestamp strings as a `TimestampType`. When set to `true`, schema inference might take noticeably longer. You must enable `cloudFiles.inferColumnTypes` to use with Auto Loader. Default value: `false`
`lineSep` Type: `String` A string between two consecutive JSON records. Default value: None, which covers `\r`, `\r\n`, and `\n`
`locale` Type: `String` A `java.util.Locale` identifier. Influences default date, timestamp, and decimal parsing within the JSON. Default value: `US`
`mode` Type: `String` Parser mode around handling malformed records. One of `PERMISSIVE`, `DROPMALFORMED`, or `FAILFAST`. Default value: `PERMISSIVE`
`multiLine` Type: `Boolean` Whether the JSON records span multiple lines. Default value: `false`
`prefersDecimal` Type: `Boolean` Attempts to infer strings as `DecimalType` instead of float or double type when possible. You must also use schema inference, either by enabling `inferSchema` or using `cloudFiles.inferColumnTypes` with Auto Loader. Default value: `false`
`primitivesAsString` Type: `Boolean` Whether to infer primitive types like numbers and booleans as `StringType`. Default value: `false`
`readerCaseSensitive` Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Available in Databricks Runtime 13.3 and above. Default value: `true`
`rescuedDataColumn` Type: `String` Whether to collect all data that can’t be parsed due to a data type mismatch or schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, refer to What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. Default value: None
`singleVariantColumn` Type: `String` Whether to ingest the entire JSON document, parsed into a single Variant column with the given string as the column’s name. If disabled, the JSON fields will be ingested into their own columns. Default value: None
`timestampFormat` Type: `String` The format for parsing timestamp strings. Default value: `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`
`timeZone` Type: `String` The `java.time.ZoneId` to use when parsing timestamps and dates. Default value: None

`CSV` options

Option
`badRecordsPath` Type: `String` The path to store files for recording the information about bad CSV records. Default value: None
`charToEscapeQuoteEscaping` Type: `Char` The character used to escape the character used for escaping quotes. For example, for the following record: `[ " a\\", b ]`: If the character to escape the `'\'` is undefined, the record won't be parsed. The parser will read characters: `[a],[\],["],[,],[ ],[b]` and throw an error because it cannot find a closing quote. If the character to escape the `'\'` is defined as `'\'`, the record will be read with 2 values: `[a\]` and `[b]`. Default value: `'\0'`
`columnNameOfCorruptRecord` Supported for Auto Loader. Not supported for `COPY INTO` (legacy). Type: `String` The column for storing records that are malformed and cannot be parsed. If the `mode` for parsing is set as `DROPMALFORMED`, this column will be empty. Default value: `_corrupt_record`
`comment` Type: `Char` Defines the character that represents a line comment when found in the beginning of a line of text. Use `'\0'` to disable comment skipping. Default value: `'\u0000'`
`dateFormat` Type: `String` The format for parsing date strings. Default value: `yyyy-MM-dd`
`emptyValue` Type: `String` String representation of an empty value. Default value: `""`
`encoding` or `charset` Type: `String` The name of the encoding of the CSV files. See `java.nio.charset.Charset` for the list of options. `UTF-16` and `UTF-32` cannot be used when `multiline` is `true`. Default value: `UTF-8`
`enforceSchema` Type: `Boolean` Whether to forcibly apply the specified or inferred schema to the CSV files. If the option is enabled, headers of CSV files are ignored. This option is ignored by default when using Auto Loader to rescue data and allow schema evolution. Default value: `true`
`escape` Type: `Char` The escape character to use when parsing the data. Default value: `'\'`
`header` Type: `Boolean` Whether the CSV files contain a header. Auto Loader assumes that files have headers when inferring the schema. Default value: `false`
`ignoreLeadingWhiteSpace` Type: `Boolean` Whether to ignore leading whitespaces for each parsed value. Default value: `false`
`ignoreTrailingWhiteSpace` Type: `Boolean` Whether to ignore trailing whitespaces for each parsed value. Default value: `false`
`inferSchema` Type: `Boolean` Whether to infer the data types of the parsed CSV records or to assume all columns are of `StringType`. Requires an additional pass over the data if set to `true`. For Auto Loader, use `cloudFiles.inferColumnTypes` instead. Default value: `false`
`lineSep` Type: `String` A string between two consecutive CSV records. Default value: None, which covers `\r`, `\r\n`, and `\n`
`locale` Type: `String` A `java.util.Locale` identifier. Influences default date, timestamp, and decimal parsing within the CSV. Default value: `US`
`maxCharsPerColumn` Type: `Int` Maximum number of characters expected from a value to parse. Can be used to avoid memory errors. Defaults to `-1`, which means unlimited. Default value: `-1`
`maxColumns` Type: `Int` The hard limit of how many columns a record can have. Default value: `20480`
`mergeSchema` Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. Enabled by default for Auto Loader when inferring the schema. Default value: `false`
`mode` Type: `String` Parser mode around handling malformed records. One of `'PERMISSIVE'`, `'DROPMALFORMED'`, and `'FAILFAST'`. Default value: `PERMISSIVE`
`multiLine` Type: `Boolean` Whether the CSV records span multiple lines. Default value: `false`
`nanValue` Type: `String` The string representation of a non-a-number value when parsing `FloatType` and `DoubleType` columns. Default value: `"NaN"`
`negativeInf` Type: `String` The string representation of negative infinity when parsing `FloatType` or `DoubleType` columns. Default value: `"-Inf"`
`nullValue` Type: `String` String representation of a null value. Default value: `""`
`parserCaseSensitive` (deprecated) Type: `Boolean` While reading files, whether to align columns declared in the header with the schema case sensitively. This is `true` by default for Auto Loader. Columns that differ by case will be rescued in the `rescuedDataColumn` if enabled. This option has been deprecated in favor of `readerCaseSensitive`. Default value: `false`
`positiveInf` Type: `String` The string representation of positive infinity when parsing `FloatType` or `DoubleType` columns. Default value: `"Inf"`
`preferDate` Type: `Boolean` Attempts to infer strings as dates instead of timestamp when possible. You must also use schema inference, either by enabling `inferSchema` or using `cloudFiles.inferColumnTypes` with Auto Loader. Default value: `true`
`quote` Type: `Char` The character used for escaping values where the field delimiter is part of the value. Default value: `"`
`readerCaseSensitive` Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default value: `true`
`rescuedDataColumn` Type: `String` Whether to collect all data that can't be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. Default value: None
`sep` or `delimiter` Type: `String` The separator string between columns. Default value: `","`
`skipRows` Type: `Int` The number of rows from the beginning of the CSV file that should be ignored (including commented and empty rows). If `header` is true, the header will be the first unskipped and uncommented row. Default value: `0`
`timestampFormat` Type: `String` The format for parsing timestamp strings. Default value: `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`
`timeZone` Type: `String` The `java.time.ZoneId` to use when parsing timestamps and dates. Default value: None
`unescapedQuoteHandling` Type: `String` The strategy for handling unescaped quotes. Allowed options: `STOP_AT_CLOSING_QUOTE`: If unescaped quotes are found in the input, accumulate the quote character and proceed parsing the value as a quoted value, until a closing quote is found. `BACK_TO_DELIMITER`: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters of the current parsed value until the delimiter defined by `sep` is found. If no delimiter is found in the value, the parser will continue accumulating characters from the input until a delimiter or line ending is found. `STOP_AT_DELIMITER`: If unescaped quotes are found in the input, consider the value as an unquoted value. This will make the parser accumulate all characters until the delimiter defined by `sep`, or a line ending is found in the input. `SKIP_VALUE`: If unescaped quotes are found in the input, the content parsed for the given value will be skipped (until the next delimiter is found) and the value set in `nullValue` will be produced instead. `RAISE_ERROR`: If unescaped quotes are found in the input, a `TextParsingException` will be thrown. Default value: `STOP_AT_DELIMITER`

`XML` options

Option	Description	Scope
`rowTag`	The row tag of the XML files to treat as a row. In the example XML `<books> <book><book>...<books>`, the appropriate value is `book`. This is a required option.	read
`samplingRatio`	Defines a fraction of rows used for schema inference. XML built-in functions ignore this option. Default: `1.0`.	read
`excludeAttribute`	Whether to exclude attributes in elements. Default: `false`.	read
`mode`	Mode for dealing with corrupt records during parsing. `PERMISSIVE`: For corrupted records, puts the malformed string into a field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. To keep corrupt records, you can set a `string` type field named `columnNameOfCorruptRecord` in a user-defined schema. If a schema does not have the field, corrupt records are dropped during parsing. When inferring a schema, the parser implicitly adds a `columnNameOfCorruptRecord` field in an output schema. `DROPMALFORMED`: Ignores corrupted records. This mode is unsupported for XML built-in functions. `FAILFAST`: Throws an exception when the parser meets corrupted records.	read
`inferSchema`	If `true`, attempts to infer an appropriate type for each resulting DataFrame column. If `false`, all resulting columns are of `string` type. Default: `true`. XML built-in functions ignore this option.	read
`columnNameOfCorruptRecord`	Allows renaming the new field that contains a malformed string created by `PERMISSIVE` mode. Default: `spark.sql.columnNameOfCorruptRecord`.	read
`attributePrefix`	The prefix for attributes to differentiate attributes from elements. This will be the prefix for field names. Default is `_`. Can be empty for reading XML, but not for writing.	read, write
`valueTag`	The tag used for the character data within elements that also have attribute(s) or child element(s) elements. User can specify the `valueTag` field in the schema or it will be added automatically during schema inference when character data is present in elements with other elements or attributes. Default: `_VALUE`	read,write
`encoding`	For reading, decodes the XML files by the given encoding type. For writing, specifies encoding (charset) of saved XML files. XML built-in functions ignore this option. Default: `UTF-8`.	read, write
`ignoreSurroundingSpaces`	Defines whether surrounding white spaces from values being read should be skipped. Default: `true`. Whitespace-only character data are ignored.	read
`rowValidationXSDPath`	Path to an optional XSD file that is used to validate the XML for each row individually. Rows that fail to validate are treated like parse errors as above. The XSD does not otherwise affect the schema provided, or inferred.	read
`ignoreNamespace`	If `true`, namespaces' prefixes on XML elements and attributes are ignored. Tags `<abc:author>` and `<def:author>`, for example, are treated as if both are just `<author>`. Namespaces cannot be ignored on the `rowTag` element, only its read children. XML parsing is not namespace-aware even if `false`. Default: `false`.	read
`timestampFormat`	Custom timestamp format string that follows the datetime pattern format. This applies to `timestamp` type. Default: `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`.	read, write
`timestampNTZFormat`	Custom format string for timestamp without timezone that follows the datetime pattern format. This applies to TimestampNTZType type. Default: `yyyy-MM-dd'T'HH:mm:ss[.SSS]`	read, write
`dateFormat`	Custom date format string that follows the datetime pattern format. This applies to date type. Default: `yyyy-MM-dd`.	read, write
`locale`	Sets a locale as a language tag in IETF BCP 47 format. For instance, `locale` is used while parsing dates and timestamps. Default: `en-US`.	read
`rootTag`	Root tag of the XML files. For example, in `<books> <book><book>...</books>`, the appropriate value is `books`. You can include basic attributes by specifying a value like `books foo="bar"`. Default: `ROWS`.	write
`declaration`	Content of XML declaration to write at the start of every output XML file, before the `rootTag`. For example, a value of `foo` causes `<?xml foo?>` to be written. Set to an empty string to suppress. Default: `version="1.0"` `encoding="UTF-8" standalone="yes"`.	write
`arrayElementName`	Name of XML element that encloses each element of an array-valued column when writing. Default: `item`.	write
`nullValue`	Sets the string representation of a null value. Default: string `null`. When this is `null`, the parser does not write attributes and elements for fields.	read, write
`compression`	Compression code to use when saving to file. This can be one of the known case-insensitive shortened names (`none`, `bzip2`, `gzip`,`lz4`, `snappy`, and `deflate`). XML built-in functions ignore this option. Default: `none`.	write
`validateName`	If true, throws an error on XML element name validation failure. For example, SQL field names can have spaces, but XML element names cannot. Default: `true`.	write
`readerCaseSensitive`	Specifies the case sensitivity behavior when rescuedDataColumn is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default: `true`.	read
`rescuedDataColumn`	Whether to collect all data that can't be parsed due to a data type mismatch and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details, see What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. Default: None.	read
`singleVariantColumn`	Specifies the name of the single variant column. If this option is specified for reading, parse the entire XML record into a single Variant column with the given option string value as the column’s name. If this option is provided for writing, write the value of the single Variant column to XML files. Default: `none`.	read, write

`PARQUET` options

Option
`datetimeRebaseMode` Type: `String` Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: `EXCEPTION`, `LEGACY`, and `CORRECTED`. Default value: `LEGACY`
`int96RebaseMode` Type: `String` Controls the rebasing of the INT96 timestamp values between Julian and Proleptic Gregorian calendars. Allowed values: `EXCEPTION`, `LEGACY`, and `CORRECTED`. Default value: `LEGACY`
`mergeSchema` Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. Default value: `false`
`readerCaseSensitive` Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default value: `true`
`rescuedDataColumn` Type: `String` Whether to collect all data that can't be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. For more details refer to What is the rescued data column?. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. Default value: None

`AVRO` options

Option
`avroSchema` Type: `String` Optional schema provided by a user in Avro format. When reading Avro, this option can be set to an evolved schema, which is compatible but different with the actual Avro schema. The deserialization schema will be consistent with the evolved schema. For example, if you set an evolved schema containing one additional column with a default value, the read result will contain the new column too. Default value: None
`datetimeRebaseMode` Type: `String` Controls the rebasing of the DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Allowed values: `EXCEPTION`, `LEGACY`, and `CORRECTED`. Default value: `LEGACY`
`mergeSchema` Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. `mergeSchema` for Avro does not relax data types. Default value: `false`
`readerCaseSensitive` Type: `Boolean` Specifies the case sensitivity behavior when `rescuedDataColumn` is enabled. If true, rescue the data columns whose names differ by case from the schema; otherwise, read the data in a case-insensitive manner. Default value: `true`
`rescuedDataColumn` Type: `String` Whether to collect all data that can't be parsed due to: a data type mismatch, and schema mismatch (including column casing) to a separate column. This column is included by default when using Auto Loader. `COPY INTO` (legacy) does not support the rescued data column because you cannot manually set the schema using `COPY INTO`. Databricks recommends using Auto Loader for most ingestion scenarios. For more details refer to What is the rescued data column?. Default value: None

`BINARYFILE` options

Binary files do not have any additional configuration options.

`TEXT` options

Option
`encoding` Type: `String` The name of the encoding of the TEXT file line separator. For a list of options, see `java.nio.charset.Charset`. The content of the file is not affected by this option and is read as-is. Default value: `UTF-8`
`lineSep` Type: `String` A string between two consecutive TEXT records. Default value: None, which covers `\r`, `\r\n` and `\n`
`wholeText` Type: `Boolean` Whether to read a file as a single record. Default value: `false`

`ORC` options

Option
`mergeSchema` Type: `Boolean` Whether to infer the schema across multiple files and to merge the schema of each file. Default value: `false`

Streaming options

These options apply when using read_files inside a streaming table or streaming query.

Option
`allowOverwrites` Type: `Boolean` Whether to re-process files that have been modified after discovery. The latest available version of the file will be processed during a refresh if it has been modified since the last successful refresh query start time. Default value: `false`
`includeExistingFiles` Type: `Boolean` Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup. This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has no effect. Default value: `true`
`maxBytesPerTrigger` Type: `Byte String` The maximum number of new bytes to be processed in every trigger. You can specify a byte string such as `10g` to limit each microbatch to 10 GB of data. This is a soft maximum. If you have files that are 3 GB each, Azure Databricks processes 12 GB in a microbatch. When used together with `maxFilesPerTrigger`, Azure Databricks consumes up to the lower limit of `maxFilesPerTrigger` or `maxBytesPerTrigger`, whichever is reached first. Note: For streaming tables created on serverless SQL warehouses, this option and `maxFilesPerTrigger` should not be set to leverage dynamic admission control, which scales by workload size and serverless compute resources to give you the best latency and performance. Default value: None
`maxFilesPerTrigger` Type: `Integer` The maximum number of new files to be processed in every trigger. When used together with `maxBytesPerTrigger`, Azure Databricks consumes up to the lower limit of `maxFilesPerTrigger` or `maxBytesPerTrigger`, whichever is reached first. Note: For streaming tables created on serverless SQL warehouses, this option and `maxBytesPerTrigger` should not be set to leverage dynamic admission control, which scales by workload size and serverless compute resources to give you the best latency and performance. Default value: 1000
`schemaEvolutionMode` Type: `String` The mode for evolving the schema as new columns are discovered in the data. By default, columns are inferred as strings when inferring JSON datasets. See schema evolution for more details. This option doesn't apply to `text` and `binaryFile` files. Default value: `"addNewColumns"` when a schema is not provided. `"none"` otherwise.
`schemaLocation` Type: `String` The location to store the inferred schema and subsequent changes. See schema inference for more details. The schema location is not required when used in a streaming table query. Default value: None

Examples

-- Reads the files available in the given path. Auto-detects the format and schema of the data.
> SELECT * FROM read_files('abfss://container@storageAccount.dfs.core.windows.net/base/path');

-- Reads the headerless CSV files in the given path with the provided schema.
> SELECT * FROM read_files(
    's3://bucket/path',
    format => 'csv',
    schema => 'id int, ts timestamp, event string');

-- Infers the schema of CSV files with headers. Because the schema is not provided,
-- the CSV files are assumed to have headers.
> SELECT * FROM read_files(
    's3://bucket/path',
    format => 'csv')

-- Reads files that have a csv suffix.
> SELECT * FROM read_files('s3://bucket/path/*.csv')

-- Reads a single JSON file
> SELECT * FROM read_files(
    'abfss://container@storageAccount.dfs.core.windows.net/path/single.json')

-- Reads JSON files and overrides the data type of the column `id` to integer.
> SELECT * FROM read_files(
    's3://bucket/path',
    format => 'json',
    schemaHints => 'id int')

-- Reads files that have been uploaded or modified yesterday.
> SELECT * FROM read_files(
    'gs://my-bucket/avroData',
    modifiedAfter => date_sub(current_date(), 1),
    modifiedBefore => current_date())

-- Creates a Delta table and stores the source file path as part of the data
> CREATE TABLE my_avro_data
  AS SELECT *, _metadata.file_path
  FROM read_files('gs://my-bucket/avroData')

-- Creates a streaming table that processes files that appear only after the table's creation.
-- The table will most likely be empty (if there's no clock skew) after being first created,
-- and future refreshes will bring new data in.
> CREATE OR REFRESH STREAMING TABLE avro_data
  AS SELECT * FROM STREAM read_files('gs://my-bucket/avroData', includeExistingFiles => false);

Work with unstructured files

The following examples use BINARYFILE format to read and filter unstructured files stored in Unity Catalog volumes, and combine read_files with AI functions to process file contents.

List all files in a volume: Use * EXCEPT (content) to return file metadata without loading binary content, and select _metadata explicitly to include file-level metadata fields.

SELECT
  * EXCEPT (content),
  _metadata
FROM read_files(
  '/Volumes/<catalog>/<schema>/<volume>',
  format => 'binaryFile'
);

List image files filtered by size: Use fileNamePattern to target specific image file types and filter on _metadata.file_size to return only files within a given size range.

SELECT
  * EXCEPT (content),
  _metadata
FROM read_files(
  '/Volumes/my_catalog/my_schema/my_volume',
  format => 'binaryFile',
  fileNamePattern => '*.{jpg,jpeg,png,JPG,JPEG,PNG}'
)
WHERE _metadata.file_size BETWEEN 20000 AND 1000000;

List PDF files modified within the past day: Use fileNamePattern to target PDF files and filter on modificationTime to return only files changed within the past day.

SELECT
  * EXCEPT (content),
  _metadata
FROM read_files(
  '/Volumes/my_catalog/my_schema/my_volume',
  format => 'binaryFile',
  fileNamePattern => '*.{pdf,PDF}'
)
WHERE modificationTime >= current_timestamp() - INTERVAL 1 DAY;

Run an AI function on image files: Use ai_query to process image files read from a cloud storage path. Filter on _metadata fields to target specific files.

SELECT
  path AS file_path,
  ai_query(
    'databricks-llama-4-maverick',
    'Describe this image in ten words or less: ',
    files => content
  ) AS result
FROM read_files(
  's3://my-s3-bucket/path/to/images/',
  format => 'binaryFile',
  fileNamePattern => '*.{jpg,jpeg,png,JPG,JPEG,PNG}'
)
WHERE _metadata.file_size < 1000000
  AND _metadata.file_name LIKE '%robots%';

Parse documents matching a filename pattern: Use ai_parse_document to extract structured content from PDFs and images. Filter by _metadata.file_name to target specific files.

SELECT
  path AS file_path,
  ai_parse_document(
    content,
    map('version', '2.0')
  ) AS result
FROM read_files(
  '/Volumes/main/public/my_files/',
  format => 'binaryFile',
  fileNamePattern => '*.{jpg,jpeg,pdf,png}'
)
WHERE _metadata.file_name ILIKE '%receipt%';

Join files with a structured table: Unstructured workflows often require merging structured data stored in tables with unstructured files. The following example joins files in a cloud storage path with two structured tables, filtering by file size and a user attribute. The join with user_files is done by extracting the file ID from the file path using split and element_at.

SELECT
  users.user_id,
  user_files.file_id,
  files._metadata.file_name AS file_name,
  files.* EXCEPT (content),
  ai_parse_document(files.content, map('version', '2.0')) AS parsed_document
FROM read_files(
  's3://my-bucket-name/files/',
  format => 'binaryFile',
  fileNamePattern => '*.{pdf,doc,docx,ppt,pptx,png,jpg,jpeg}'
) AS files
JOIN user_files
  ON user_files.file_id = element_at(split(files.path, '/'), -2)
JOIN users
  ON users.user_id = user_files.user_id
WHERE users.email LIKE '%@databricks.com'
  AND files._metadata.file_size < 10000000;

Σχόλια

Ήταν χρήσιμη αυτή η σελίδα;

Last updated on 2026-04-20

Κοινή χρήση μέσω

read_files table-valued function