Share via


Microsoft SharePoint connector reference

Important

The Microsoft SharePoint connector is in Beta.

This page contains reference material for the Microsoft SharePoint connector in Databricks Lakeflow Connect.

Ingested data format

The ingested data lands in the following format. A site in SharePoint maps to a schema in Azure Databricks. A drive in the SharePoint site maps to a table in the destination schema.

Field Type Description
file_id String The unique SharePoint identifier of the file.
file_metadata Struct Contains generic file metadata:
  • name (string): The name of the file, as it appears in SharePoint.
  • size_in_bytes (bigint): The size of the file.
  • created_timestamp (timestamp): The timestamp at which the file was created in SharePoint.
  • last_modified_timestamp (timestamp): The timestamp at which the file was last modified in SharePoint.
source_metadata Struct Contains SharePoint-specific metadata for the file:
  • site_id (string): The SharePoint site identifier.
  • drive_id (string): The SharePoint drive identifier.
  • file_folder_path (string): The file path of the file in SharePoint (for example, /drives/d1/root:/folder1).
  • quick_xor_hash (string): A custom hash provided by Microsoft that can be used to validate that your downloaded content is accurate. This value can be NULL (for example, if the format does not support hashing). See Code Snippets: QuickXorHash Algorithm in the Microsoft documentation.
    mime_type (string): The MIME type (format) of the file.
  • web_url (string): A link to the file in SharePoint.
content Struct Contains file content. Databricks does not recommend accessing this struct directly. Instead, access it using the UDFs in Downstream RAG use case.
sequence_id Long A sequencing key for ordering different versions of the same file.
is_deleted Boolean Ignore this column. The value will always be false. If you need to identify deleted columns, Databricks recommends enabling SCD type 2 and using the \_\_END_AT column.