Share via


Ingest Google Drive files into Azure Databricks

Important

This feature is in Beta. Workspace admins can control access to this feature from the Previews page. See Manage Azure Databricks previews.

Learn how to ingest data from Google Drive into Azure Databricks. The standard Google Drive connector in Lakeflow Connect allows you to use Databricks Spark and SQL functions (read_files, spark.read, COPY INTO, and Auto Loader) to create Spark dataframes, materialized views, and streaming tables directly from files in Google Drive.

This approach allows you to build custom pipelines for common file ingestion use cases:

  • Streaming file ingestion (unstructured): Ingesting many source files (for example, PDFs, Google Docs, and Google Slides) into a single target table as binary data, ideal for RAG pipelines.
  • Streaming file ingestion (structured): Merging many source files (for example, CSV and JSON files) into a single, structured target table.
  • Batch file ingestion: Ingesting a single, specific file (like a Google Sheet) or a batch of files into one target table.

The following interfaces are supported:

  • Databricks Asset Bundles
  • Databricks APIs
  • Databricks SDKs
  • Databricks CLI

Limitations

  • The connector is API-only and does not support pipeline creation in the Azure Databricks UI.
  • You can use the pathGlobFilter option to filter files by name (for example, pathGlobFilter => '*.csv'). Built-in Google formats (for example, Google Docs or Google Sheets) can't be filtered using this option. Folder path filtering is also not supported.
  • Unsupported formats include Google Forms, Google Sites, Google Jams, and Google Vids. Unsupported formats are skipped during ingestion.
  • Setting recursiveFileLookup=false is not supported for Spark batch reads. Using recursiveFileLookup=false will behave the same as recursiveFileLookup=true.

Requirements

Before you begin, make sure you have:

  • A Unity Catalog-enabled workspace.
  • CREATE CONNECTION permissions.
  • Databricks Runtime 17.3 or above.
  • The Excel file format Beta feature enabled, if you want to ingest Google Sheets or Excel files. See Read Excel files.
  • A Google account with the necessary permissions to create a Google Cloud project.

Configure OAuth 2.0

Set up a Google Cloud project and activate the Google Drive API

  1. Go to the Google Cloud console.
  2. Create a new project. You might be prompted to add two-factor authentication.
  3. Go to APIs & Services />Library.
  4. Search for "Google Drive API".
  5. Select Google Drive API.
  6. Click Enable.
  1. On the Google Cloud console homescreen, go to APIs & Services />OAuth Consent Screen. You will see a message that says "Google Auth Platform not configured yet".

  2. Click Get Started.

  3. Fill in the App Information section. Enter any name for the app name (for example, Databricks connection). The support email can be any email in your organization.

  4. Click Next.

  5. In the Audience section, select External, then click Next.

  6. Fill in the Contact Information section, then click Next.

  7. Review the Google API Services User Data Policy, then click Create.

  8. Navigate back to Google Auth Platform />Data Access.

  9. Click Add or remove scopes.

  10. Add the following scope in the Manually add scopes section, click Add to table, then click Update:

    https://www.googleapis.com/auth/drive.readonly

  11. Click Save.

Create OAuth 2.0 client credentials

  1. On the Google Cloud console homescreen, go to APIs & Services />Credentials.
  2. Click Create Credentials > OAuth Client ID.
  3. Choose Web Application and set a custom name.
  4. In Authorized Redirect URIs, click Add URI.
  5. Add a redirect URI to <databricks-instance-url>/login/oauth/google.html, replacing <databricks-instance-url> with your Azure Databricks instance URL. For example: https://instance-name.databricks.com/login/oauth/google.html
  6. Click Create. A pop-up that contains your credentials appears.
  7. Record the following values. Alternatively, you can download the OAuth Client JSON file, which contains this information:
    • Client ID (format: 0123******-********************************.apps.googleusercontent.com)
    • Client secret (format: ABCD**-****************************)

Add test users to your project

  1. Navigate to Google Auth Platform />Audience.
  2. Under Test users, click Add users.
  3. Add the email address of the Google account you will use to create the connection.

Create a connection

  1. In the Azure Databricks workspace, click Catalog > External locations > Connections > Create connection.

  2. On the Connection basics page of the Set up connection wizard, specify a unique Connection name.

  3. In the Connection type drop-down menu, search for and select Google Drive.

  4. (Optional) Add a comment.

  5. Click Next.

  6. On the Authentication page, enter the following:

  7. Click Sign in with Google and sign in with the Google account from Add test users to your project.

  8. Click Continue, then click Continue again.

  9. After you're redirected back to the Azure Databricks workspace, click Create connection.

Ingest files from Google Drive

This connector allows you to ingest files from Google Drive using read_files (Databricks SQL), Auto Loader (.readStream with cloudFiles), COPY INTO, and spark.read. You must provide the following values:

  • The Google Drive URL as the path.
  • The Unity Catalog connection using the databricks.connection data source option.
  • The URL of the Google Drive resource you want to access. The URL can refer to a specific file, a folder, or an entire drive. For example:
    • https://docs.google.com/spreadsheets/d/12345/edit?random_query_params_here
    • https://drive.google.com/drive/u/0/folders/12345
    • https://docs.google.com/document/d/12345/edit
    • https://drive.google.com/file/d/1kiXnHmU4Y8X66ijULky5EPDNCGtT14Ps/view?usp=drive_link
    • https://drive.google.com/drive/
    • https://drive.google.com/drive/my-drive
    • https://drive.google.com/

Auto Loader

Notebook

Get notebook

Spark batch read

Notebook

Get notebook

Single-file Spark read

Notebook

Get notebook

Spark SQL

Notebook

Get notebook

Lakeflow Spark Declarative Pipelines

Databricks SQL

Notebook

Get notebook

Python

Notebook

Get notebook

ai_parse_document

Notebook

Get notebook

Notebook

Get notebook

How built-in Google formats are handled

You do not need to export your built-in Google files (Docs, Sheets) manually. The connector automatically exports them to an open format during ingestion.

Google format Exported as (default)
Google Docs application/vnd.openxmlformats-officedocument.wordprocessingml.document (DOCX)
Google Sheets application/vnd.openxmlformats-officedocument.spreadsheetml.sheet (XLSX)
Google Slides application/vnd.openxmlformats-officedocument.presentationml.presentation (PPTX)
Google Drawings application/pdf (PDF)

Google Drive export format configuration

You can control how Google Drive native files are exported by setting Spark configurations using spark.conf.set(). These configurations determine the MIME type used when exporting Google Docs, Sheets, Slides, and Drawings.

Configuration keys

  • fs.gdrive.format.document.export: Google Docs export format.
  • fs.gdrive.format.spreadsheet.export: Google Sheets export format.
  • fs.gdrive.format.presentation.export: Google Slides export format.
  • fs.gdrive.format.drawing.export: Google Drawings export format.

For a complete list of supported export formats, see Export MIME types for Google Workspace documents in the Google Workspace documentation.

Example: Exporting Google Docs file as TXT:

spark.conf.set("fs.gdrive.format.document.export", "text/plain")
df = spark.read.text("https://docs.google.com/document/d/1a2b3c4d...")

Example: Exporting Google Sheets file as CSV:

spark.conf.set("fs.gdrive.format.spreadsheet.export", "text/csv")
df = spark.read.option("header", "true").csv("https://docs.google.com/spreadsheets/d/1a2b3c4d...")

Schemas

binaryFile format

When you use format => 'binaryFile', the resulting table has the following schema:

  • path (string): The full URL to the file.
  • modificationTime (timestamp): The last modified time of the file.
  • length (long): The size of the file in bytes.
  • content (binary): The raw binary content of the file.

_metadata column

You can get metadata information for input files with the _metadata column like file_name, file_path, file_size, and file_modification_time. The _metadata column is a hidden column, and is available for all input file formats. To include the _metadata column in the returned DataFrame, you must explicitly select it in the read query where you specify the source. For more information, see File metadata column.

For Google Drive native files (such as Google Docs, Google Sheets, and Google Slides), the file_size field refers to the size of the file stored on Google Drive, not the size of the file in the exported format (for example, DOCX, XLSX, or PPTX).

Example of selecting _metadata:

SELECT *, _metadata FROM read_files(
  "https://drive.google.com/",
  `databricks.connection` => "my_connection",
  format => "binaryFile"
);

FAQ

I have a folder of multiple structured files (for example, multiple Google Sheets). How do I load each sheet or file as its own Delta table?

You must create a new ingestion query for each file to load into its own Delta table.

My files require custom parsing. How do I provide these parsing parameters to ensure my files are read correctly?

The connector supports all file format options available in Auto Loader, COPY INTO, and Spark. For details, see the following:

Are files in subfolders recursively ingested?

When you use Auto Loader APIs (spark.readStream and read_files), all subfolders are recursively discovered and ingested. This is also true for batch spark.read, where files in the provided folder path are always read recursively.

My Google Sheet has many irregularities and requires specific parsing and cell-range extraction (for example, multiple tables per sheet). The schema cannot be inferred automatically. How do I handle this?

You can use Excel file format parsing options to parse your Google sheet file to your desired format. See Read Excel files.

Alternatively, you can turn off schema inference on Auto Loader, PySpark, or read_files. The resulting table has default column names, all data types are cast to a string, and the table might be sparse. You can then perform any necessary transformations downstream.