Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Applies to:
Databricks SQL
Databricks Runtime 13.3 LTS and above
Reads files under a provided location and returns the data in tabular form.
Supports reading JSON, CSV, XML, TEXT, BINARYFILE, PARQUET, AVRO, and ORC file formats.
Can detect the file format automatically and infer a unified schema across all files.
Syntax
read_files(path [, option_key => option_value ] [...])
Arguments
This function requires named parameter invocation for the option keys.
path: ASTRINGwith the URI of the location of the data. Supports reading from Azure Data Lake Storage ('abfss://'), S3 (s3://) and Google Cloud Storage ('gs://'). Can contain globs. See File discovery for more details.option_key: The name of the option to configure. You need to use backticks () for options that contain dots (.`).option_value: A constant expression to set the option to. Accepts literals and scalar functions.
Returns
A table containing the data from files read under the given path. The schema depends on the file format:
BINARYFILE: Returns a fixed schema:Column Type Description pathSTRINGThe full path to the file. modificationTimeTIMESTAMPThe last modification time of the file. lengthLONGThe size of the file in bytes. contentBINARYThe binary content of the file. Use * EXCEPT (content)to exclude binary content when querying file metadata.TEXT: Returns a fixed schema with a singlevalue(STRING) column.All other formats (JSON, CSV, XML, PARQUET, AVRO, ORC): The schema is inferred from the file contents, or provided explicitly using the
schemaoption.
_metadata column
read_files exposes a _metadata column with file-level metadata. This column is not included in SELECT * results and must be explicitly selected. It contains the following fields:
| Field | Type | Description |
|---|---|---|
file_path |
STRING |
The full path to the source file. |
file_name |
STRING |
The name of the source file. |
file_size |
LONG |
The size of the source file in bytes. |
file_modification_time |
TIMESTAMP |
The last modification time of the source file. |
file_block_start |
LONG |
The start of the block of the file being read. |
file_block_length |
LONG |
The length of the block of the file being read. |
To include _metadata in results, select it explicitly:
SELECT * EXCEPT (content), _metadata
FROM read_files('/Volumes/my_catalog/my_schema/my_volume', format => 'binaryFile');
File discovery
read_files can read an individual file or read files under a provided directory.
read_files discovers all files under the provided directory recursively unless a glob is provided, which instructs read_files to recurse into a specific directory pattern.
Filtering directories or files using glob patterns
Glob patterns can be used for filtering directories and files when provided in the path.
| Pattern | Description |
|---|---|
? |
Matches any single character |
* |
Matches zero or more characters |
[abc] |
Matches a single character from character set {a,b,c}. |
[a-z] |
Matches a single character from the character range {a…z}. |
[^a] |
Matches a single character that is not from character set or range {a}. Note that the ^ character must occur immediately to the right of the opening bracket. |
{ab,cd} |
Matches a string from the string set {ab, cd}. |
{ab,c{de, fh}} |
Matches a string from the string set {ab, cde, cfh}. |
read_files uses Auto Loader's strict globber when discovering files with globs. This is configured by the useStrictGlobber option. When the strict globber is disabled, trailing slashes (/) are dropped and a star pattern such as /*/ can expand into discovering multiple directories. See the examples below to see the difference in behavior.
| Pattern | File path | Strict globber disabled | Strict globber enabled |
|---|---|---|---|
/a/b |
/a/b/c/file.txt |
Yes | Yes |
/a/b |
/a/b_dir/c/file.txt |
No | No |
/a/b |
/a/b.txt |
No | No |
/a/b/ |
/a/b.txt |
No | No |
/a/*/c/ |
/a/b/c/file.txt |
Yes | Yes |
/a/*/c/ |
/a/b/c/d/file.txt |
Yes | Yes |
/a/*/d/ |
/a/b/c/d/file.txt |
Yes | No |
/a/*/c/ |
/a/b/x/y/c/file.txt |
Yes | No |
/a/*/c |
/a/b/c_file.txt |
Yes | No |
/a/*/c/ |
/a/b/c_file.txt |
Yes | No |
/a/*/c |
/a/b/cookie/file.txt |
Yes | No |
/a/b* |
/a/b.txt |
Yes | Yes |
/a/b* |
/a/b/file.txt |
Yes | Yes |
/a/{0.txt,1.txt} |
/a/0.txt |
Yes | Yes |
/a/*/{0.txt,1.txt} |
/a/0.txt |
No | No |
/a/b/[cde-h]/i/ |
/a/b/c/i/file.txt |
Yes | Yes |
Schema inference
The schema of the files can be explicitly provided to read_files with the schema option. When the schema is not provided, read_files attempts to infer a unified schema across the discovered files, which requires reading all the files unless a LIMIT statement is used. Even when using a LIMIT query, a larger set of files than required might be read to return a more representative schema of the data. Databricks automatically adds a LIMIT statement for SELECT queries in notebooks and the SQL editor if a user hasn't provided one.
The schemaHints option can be used to fix subsets of the inferred schema. See Override schema inference with schema hints for more details.
A rescuedDataColumn is provided by default to rescue any data that doesn't match the schema. See What is the rescued data column? for more details. You can drop the rescuedDataColumn by setting the option schemaEvolutionMode => 'none'.
Partition schema inference
read_files can also infer partitioning columns if files are stored under Hive-style partitioned directories, that is /column_name=column_value/. If a schema is provided, the discovered partition columns use the types provided in the schema. If the partition columns are not part of the provided schema, then the inferred partition columns are ignored.
If a column exists in both the partition schema and in the data columns, the value that is read from the partition value is used instead of the data value. If you would like to ignore the values coming from the directory and use the data column, you can provide the list of partition columns in a comma-separated list with the partitionColumns option.
The partitionColumns option can also be used to instruct read_files on which discovered columns to include in the final inferred schema. Providing an empty string ignores all partition columns.
The schemaHints option can also be provided to override the inferred schema for a partition column.
The TEXT and BINARYFILE formats have a fixed schema, but read_files also attempts to infer partitioning for these formats when possible.
Authentication for cloud storage
read_files accesses cloud storage through Unity Catalog external locations. You must have the READ FILES privilege on the external location that contains the files you want to read. See Connect to cloud object storage using Unity Catalog.
Usage in streaming tables
read_files can be used in streaming tables to ingest files into Delta Lake. read_files leverages Auto Loader when used in a streaming table query. You must use the STREAM keyword with read_files. See What is Auto Loader? for more details.
When used in a streaming query, read_files uses a sample of the data to infer the schema, and can evolve the schema as it processes more data. See Configure schema inference and evolution in Auto Loader for more details.
Options
Basic Options
| Option |
|---|
formatType: StringThe data file format in the source path. Auto-inferred if not provided. Allowed values include:
Default value: None |
schemaType: StringThe schema of the files to read. Provide a schema string using DDL format, for example 'id int, ts timestamp, event string'. When the schema is not provided, read_files attempts to infer a unified schema across the discovered files.Default value: None |
inferColumnTypesType: BooleanWhether to infer exact column types when leveraging schema inference. By default, columns are inferred when inferring JSON and CSV datasets. See schema inference for more details. Note that this is the opposite of the default of Auto Loader. Default value: true |
partitionColumnsType: StringA comma-separated list of Hive style partition columns that you would like inferred from the directory structure of the files. Hive style partition columns are key-value pairs combined by an equality sign such as <base-path>/a=x/b=1/c=y/file.format. In this example, the partition columns are a, b, and c. By default these columns will be automatically added to your schema if you are using schema inference and provide the <base-path> to load data from. If you provide a schema, Auto Loader expects these columns to be included in the schema. If you do not want these columns as part of your schema, you can specify "" to ignore these columns. In addition, you can use this option when you want columns to be inferred the file path in complex directory structures, like the example below:<base-path>/year=2022/week=1/file1.csv<base-path>/year=2022/month=2/day=3/file2.csv<base-path>/year=2022/month=2/day=4/file3.csvSpecifying cloudFiles.partitionColumns as year,month,day will returnyear=2022 for file1.csv, but the month and day columns will be null.month and day will be parsed correctly for file2.csv and file3.csv.Default value: None |
schemaHintsType: StringSchema information that you provide to Auto Loader during schema inference. See schema hints for more details. Default value: None |
useStrictGlobberType: BooleanWhether to use a strict globber that matches the default globbing behavior of other file sources in Apache Spark. See Common data loading patterns for more details. Available in Databricks Runtime 12.2 LTS and above. Note that this is the opposite of the default for Auto Loader. Default value: true |
Format-specific options
For options specific to each file format (JSON, CSV, XML, Parquet, Avro, text, ORC, and binary), see DataFrameReader options.
Streaming options
These options apply when using read_files inside a streaming table or streaming query.
| Option |
|---|
allowOverwritesType: BooleanWhether to re-process files that have been modified after discovery. The latest available version of the file will be processed during a refresh if it has been modified since the last successful refresh query start time. Default value: false |
includeExistingFilesType: BooleanWhether to include existing files in the stream processing input path or to only process new files arriving after initial setup. This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has no effect. Default value: true |
maxBytesPerTriggerType: Byte StringThe maximum number of new bytes to be processed in every trigger. You can specify a byte string such as 10g to limit each microbatch to 10 GB of data. This is a soft maximum. If you have files that are 3 GB each, Azure Databricks processes 12 GB in a microbatch. When used together with maxFilesPerTrigger, Azure Databricks consumes up to the lower limit of maxFilesPerTrigger or maxBytesPerTrigger, whichever is reached first.Note: For streaming tables created on serverless SQL warehouses, this option and maxFilesPerTrigger should not be set to leverage dynamic admission control, which scales by workload size and serverless compute resources to give you the best latency and performance.Default value: None |
maxFilesPerTriggerType: IntegerThe maximum number of new files to be processed in every trigger. When used together with maxBytesPerTrigger, Azure Databricks consumes up to the lower limit of maxFilesPerTrigger or maxBytesPerTrigger, whichever is reached first.Note: For streaming tables created on serverless SQL warehouses, this option and maxBytesPerTrigger should not be set to leverage dynamic admission control, which scales by workload size and serverless compute resources to give you the best latency and performance.Default value: 1000 |
schemaEvolutionModeType: StringThe mode for evolving the schema as new columns are discovered in the data. By default, columns are inferred as strings when inferring JSON datasets. See schema evolution for more details. This option doesn't apply to text and binaryFile files.Default value: "addNewColumns" when a schema is not provided."none" otherwise. |
schemaLocationType: StringThe location to store the inferred schema and subsequent changes. See schema inference for more details. The schema location is not required when used in a streaming table query. Default value: None |
Examples
-- Reads the files available in the given path. Auto-detects the format and schema of the data.
> SELECT * FROM read_files('abfss://container@storageAccount.dfs.core.windows.net/base/path');
-- Reads the headerless CSV files in the given path with the provided schema.
> SELECT * FROM read_files(
's3://bucket/path',
format => 'csv',
schema => 'id int, ts timestamp, event string');
-- Infers the schema of CSV files with headers. Because the schema is not provided,
-- the CSV files are assumed to have headers.
> SELECT * FROM read_files(
's3://bucket/path',
format => 'csv')
-- Reads files that have a csv suffix.
> SELECT * FROM read_files('s3://bucket/path/*.csv')
-- Reads a single JSON file
> SELECT * FROM read_files(
'abfss://container@storageAccount.dfs.core.windows.net/path/single.json')
-- Reads JSON files and overrides the data type of the column `id` to integer.
> SELECT * FROM read_files(
's3://bucket/path',
format => 'json',
schemaHints => 'id int')
-- Reads files that have been uploaded or modified yesterday.
> SELECT * FROM read_files(
'gs://my-bucket/avroData',
modifiedAfter => date_sub(current_date(), 1),
modifiedBefore => current_date())
-- Creates a Delta table and stores the source file path as part of the data
> CREATE TABLE my_avro_data
AS SELECT *, _metadata.file_path
FROM read_files('gs://my-bucket/avroData')
-- Creates a streaming table that processes files that appear only after the table's creation.
-- The table will most likely be empty (if there's no clock skew) after being first created,
-- and future refreshes will bring new data in.
> CREATE OR REFRESH STREAMING TABLE avro_data
AS SELECT * FROM STREAM read_files('gs://my-bucket/avroData', includeExistingFiles => false);
Work with unstructured files
The following examples use BINARYFILE format to read and filter unstructured files stored in Unity Catalog volumes, and combine read_files with AI functions to process file contents.
List all files in a volume: Use * EXCEPT (content) to return file metadata without loading binary content, and select _metadata explicitly to include file-level metadata fields.
SELECT
* EXCEPT (content),
_metadata
FROM read_files(
'/Volumes/<catalog>/<schema>/<volume>',
format => 'binaryFile'
);
List image files filtered by size: Use fileNamePattern to target specific image file types and filter on _metadata.file_size to return only files within a given size range.
SELECT
* EXCEPT (content),
_metadata
FROM read_files(
'/Volumes/my_catalog/my_schema/my_volume',
format => 'binaryFile',
fileNamePattern => '*.{jpg,jpeg,png,JPG,JPEG,PNG}'
)
WHERE _metadata.file_size BETWEEN 20000 AND 1000000;
List PDF files modified within the past day: Use fileNamePattern to target PDF files and filter on modificationTime to return only files changed within the past day.
SELECT
* EXCEPT (content),
_metadata
FROM read_files(
'/Volumes/my_catalog/my_schema/my_volume',
format => 'binaryFile',
fileNamePattern => '*.{pdf,PDF}'
)
WHERE modificationTime >= current_timestamp() - INTERVAL 1 DAY;
Run an AI function on image files: Use ai_query to process image files read from a cloud storage path. Filter on _metadata fields to target specific files.
SELECT
path AS file_path,
ai_query(
'databricks-llama-4-maverick',
'Describe this image in ten words or less: ',
files => content
) AS result
FROM read_files(
's3://my-s3-bucket/path/to/images/',
format => 'binaryFile',
fileNamePattern => '*.{jpg,jpeg,png,JPG,JPEG,PNG}'
)
WHERE _metadata.file_size < 1000000
AND _metadata.file_name LIKE '%robots%';
Parse documents matching a filename pattern: Use ai_parse_document to extract structured content from PDFs and images. Filter by _metadata.file_name to target specific files.
SELECT
path AS file_path,
ai_parse_document(
content,
map('version', '2.0')
) AS result
FROM read_files(
'/Volumes/main/public/my_files/',
format => 'binaryFile',
fileNamePattern => '*.{jpg,jpeg,pdf,png}'
)
WHERE _metadata.file_name ILIKE '%receipt%';
Join files with a structured table: Unstructured workflows often require merging structured data stored in tables with unstructured files. The following example joins files in a cloud storage path with two structured tables, filtering by file size and a user attribute. The join with user_files is done by extracting the file ID from the file path using split and element_at.
SELECT
users.user_id,
user_files.file_id,
files._metadata.file_name AS file_name,
files.* EXCEPT (content),
ai_parse_document(files.content, map('version', '2.0')) AS parsed_document
FROM read_files(
's3://my-bucket-name/files/',
format => 'binaryFile',
fileNamePattern => '*.{pdf,doc,docx,ppt,pptx,png,jpg,jpeg}'
) AS files
JOIN user_files
ON user_files.file_id = element_at(split(files.path, '/'), -2)
JOIN users
ON users.user_id = user_files.user_id
WHERE users.email LIKE '%@databricks.com'
AND files._metadata.file_size < 10000000;