COPY INTO

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime

Loads data from a file location into a Delta table. This is a retryable and idempotent operation — Files in the source location that have already been loaded are skipped. This is true even if the files have been modified since they were loaded. For examples, see Common data loading patterns using COPY INTO.

Syntax

COPY INTO target_table [ BY POSITION | ( col_name [ , <col_name> ... ] ) ]
  FROM { source_clause |
         ( SELECT expression_list FROM source_clause ) }
  FILEFORMAT = data_source
  [ VALIDATE [ ALL | num_rows ROWS ] ]
  [ FILES = ( file_name [, ...] ) | PATTERN = glob_pattern ]
  [ FORMAT_OPTIONS ( { data_source_reader_option = value } [, ...] ) ]
  [ COPY_OPTIONS ( { copy_option = value } [, ...] ) ]

source_clause
  source [ WITH ( [ CREDENTIAL { credential_name |
                                 (temporary_credential_options) } ]
                  [ ENCRYPTION (encryption_options) ] ) ]

Parameters

  • target_table

    Identifies an existing Delta table. The target_table must not include a temporal specification or options specification.

    If the table name is provided in the form of a location, such as: delta.`/path/to/table` , Unity Catalog can govern access to the locations that are being written to. You can write to an external location by:

    • Defining the location as an external location and having WRITE FILES permissions on that external location.
    • Having WRITE FILES permissions on a named storage credential that provide authorization to write to a location using: COPY INTO delta.`/some/location` WITH (CREDENTIAL <named-credential>)

    See Connect to cloud object storage using Unity Catalog for more details.

  • BY POSITION | ( col_name [ , <col_name> … ] )

    Matches source columns to target table columns by ordinal position. Type casting of the matched columns is done automatically.

    This parameter is only supported for headerless CSV file format. You must specify FILEFORMAT = CSV. FORMAT_OPTIONS must also be set to ("headers" = "false") (FORMAT_OPTIONS ("headers" = "false") is the default).

    Syntax option 1: BY POSITION

    • Matches source columns to target table columns by ordinal position automatically.
      • The default name matching is not used for matching.
      • IDENTITY columns and GENERATED columns of the target table are ignored when matching the source columns.
      • If the number of source columns doesn't equal the filtered target table columns, COPY INTO raises an error.

    Syntax option 2: ( col_name [ , <col_name> ... ] )

    • Matches source columns to the specified target table columns by relative ordinal position using a target table column name list in parentheses, separated by comma.
      • The original table column order and column names are not used for matching.
      • IDENTITY columns and GENERATED columns cannot be specified in the column name list, otherwise COPY INTO raises an error.
      • The specified columns cannot be duplicated.
      • When the number of source columns doesn't equal the specified table columns, COPY INTO raises an error.
      • For the columns that are not specified in the column name list, COPY INTO assigns default values, if any, and assigns NULL otherwise. If any column is not nullable, COPY INTO raises an error.
  • source

    The file location to load the data from. Files in this location must have the format specified in FILEFORMAT. The location is provided in the form of a URI.

    Access to the source location can be provided through:

    • credential_name

      Optional name of the credential used to access or write to the storage location. You use this credential only if the file location is not included in an external location. See credential_name.

    • Inline temporary credentials.

    • Defining the source location as an external location and having READ FILES permissions on the external location through Unity Catalog.
    • Using a named storage credential with READ FILES permissions that provide authorization to read from a location through Unity Catalog.

    You don't need to provide inline or named credentials if the path is already defined as an external location that you have permissions to use. See Overview of external locations for more details.

    Note

    If the source file path is a root path, please add a slash (/) at the end of the file path, for example, s3://my-bucket/.

    Accepted credential options are:

    • AZURE_SAS_TOKEN for ADLS and Azure Blob Storage
    • AWS_ACCESS_KEY, AWS_SECRET_KEY, and AWS_SESSION_TOKEN for AWS S3

    Accepted encryption options are:

    • TYPE = 'AWS_SSE_C', and MASTER_KEY for AWS S3

See Load data using COPY INTO with temporary credentials.

  • SELECT expression_list

    Selects the specified columns or expressions from the source data before copying into the Delta table. The expressions can be anything you use with SELECT statements, including window operations. You can use aggregation expressions only for global aggregates–you cannot GROUP BY on columns with this syntax.

  • FILEFORMAT = data_source

    The format of the source files to load. One of CSV, JSON, AVRO, ORC, PARQUET, TEXT, BINARYFILE.

  • VALIDATE

    Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 10.4 LTS and above

    The data that is to be loaded into a table is validated but not written to the table. These validations include:

    • Whether the data can be parsed.
    • Whether the schema matches that of the table or if the schema needs to be evolved.
    • Whether all nullability and check constraints are met.

    The default is to validate all of the data that is to be loaded. You can provide a number of rows to be validated with the ROWS keyword, such as VALIDATE 15 ROWS. The COPY INTO statement returns a preview of the data of 50 rows or less when a number of less than 50 is used with the ROWS keyword).

  • FILES

    A list of file names to load, with a limit of 1000 files. Cannot be specified with PATTERN.

  • PATTERN

    A glob pattern that identifies the files to load from the source directory. Cannot be specified with FILES.

    Pattern Description
    ? Matches any single character
    * Matches zero or more characters
    [abc] Matches a single character from character set {a,b,c}.
    [a-z] Matches a single character from the character range {a…z}.
    [^a] Matches a single character that is not from character set or range {a}. Note that the ^ character must occur immediately to the right of the opening bracket.
    {ab,cd} Matches a string from the string set {ab, cd}.
    {ab,c{de, fh}} Matches a string from the string set {ab, cde, cfh}.
  • FORMAT_OPTIONS

    Options to be passed to the Apache Spark data source reader for the specified format. See Format options for each file format.

  • COPY_OPTIONS

    Options to control the operation of the COPY INTO command.

    • force: boolean, default false. If set to true, idempotency is disabled and files are loaded regardless of whether they've been loaded before.
    • mergeSchema: boolean, default false. If set to true, the schema can be evolved according to the incoming data.

Invoke COPY INTO concurrently

COPY INTO supports concurrent invocations against the same table. As long as COPY INTO is invoked concurrently on distinct sets of input files, each invocation should eventually succeed, otherwise you get a transaction conflict. COPY INTO should not be invoked concurrently to improve performance; a single COPY INTO command with multiple files typically performs better than running concurrent COPY INTO commands with a single file each. COPY INTO can be called concurrently when:

  • Multiple data producers don't have an easy way to coordinate and cannot make a single invocation.
  • When a very large directory can be ingested sub-directory by sub-directory. When ingesting directories with a very large number of files, Databricks recommends using Auto Loader when possible.

Access file metadata

To learn how to access metadata for file-based data sources, see File metadata column.

Format options

For options specific to each file format (JSON, CSV, XML, Parquet, Avro, text, ORC, and binary), see DataFrameReader options.