Reporting across scaled-out cloud databases (preview)

Article
11/04/2022

Query across shards

Sharded databases distribute rows across a scaled out data tier. The schema is identical on all participating databases, also known as horizontal partitioning. Using an elastic query, you can create reports that span all databases in a sharded database.

For a quickstart, see Reporting across scaled-out cloud databases.

For non-sharded databases, see Query across cloud databases with different schemas.

Prerequisites

Create a shard map using the elastic database client library. see Shard map management. Or use the sample app in Get started with elastic database tools.
Alternatively, see Migrate existing databases to scaled-out databases.
The user must possess ALTER ANY EXTERNAL DATA SOURCE permission. This permission is included with the ALTER DATABASE permission.
ALTER ANY EXTERNAL DATA SOURCE permissions are needed to refer to the underlying data source.

Overview

These statements create the metadata representation of your sharded data tier in the elastic query database.

1.1 Create database scoped master key and credentials

The credential is used by the elastic query to connect to your remote databases.

SQL

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'password';
CREATE DATABASE SCOPED CREDENTIAL [<credential_name>]  WITH IDENTITY = '<username>',  
SECRET = '<password>';

Note

Make sure that the "<username>" does not include any "@servername" suffix.

1.2 Create external data sources

Syntax:

syntaxsql

<External_Data_Source> ::=
    CREATE EXTERNAL DATA SOURCE <data_source_name> WITH
        (TYPE = SHARD_MAP_MANAGER,
                   LOCATION = '<fully_qualified_server_name>',
        DATABASE_NAME = '<shardmap_database_name>',
        CREDENTIAL = <credential_name>,
        SHARD_MAP_NAME = '<shardmapname>'
               ) [;]

Example

SQL

CREATE EXTERNAL DATA SOURCE MyExtSrc
WITH
(
    TYPE=SHARD_MAP_MANAGER,
    LOCATION='myserver.database.windows.net',
    DATABASE_NAME='ShardMapDatabase',
    CREDENTIAL= SMMUser,
    SHARD_MAP_NAME='ShardMap'
);

Retrieve the list of current external data sources:

SQL

select * from sys.external_data_sources;

The external data source references your shard map. An elastic query then uses the external data source and the underlying shard map to enumerate the databases that participate in the data tier. The same credentials are used to read the shard map and to access the data on the shards during the processing of an elastic query.

1.3 Create external tables

Syntax:

syntaxsql

CREATE EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name  
    ( { <column_definition> } [ ,...n ])
    { WITH ( <sharded_external_table_options> ) }
) [;]  

<sharded_external_table_options> ::=
  DATA_SOURCE = <External_Data_Source>,
  [ SCHEMA_NAME = N'nonescaped_schema_name',]
  [ OBJECT_NAME = N'nonescaped_object_name',]
  DISTRIBUTION = SHARDED(<sharding_column_name>) | REPLICATED |ROUND_ROBIN

Example

SQL

CREATE EXTERNAL TABLE [dbo].[order_line](
     [ol_o_id] int NOT NULL,
     [ol_d_id] tinyint NOT NULL,
     [ol_w_id] int NOT NULL,
     [ol_number] tinyint NOT NULL,
     [ol_i_id] int NOT NULL,
     [ol_delivery_d] datetime NOT NULL,
     [ol_amount] smallmoney NOT NULL,
     [ol_supply_w_id] int NOT NULL,
     [ol_quantity] smallint NOT NULL,
      [ol_dist_info] char(24) NOT NULL
)

WITH
(
    DATA_SOURCE = MyExtSrc,
     SCHEMA_NAME = 'orders',
     OBJECT_NAME = 'order_details',
    DISTRIBUTION=SHARDED(ol_w_id)
);

Retrieve the list of external tables from the current database:

SQL

SELECT * from sys.external_tables;

To drop external tables:

syntaxsql

DROP EXTERNAL TABLE [ database_name . [ schema_name ] . | schema_name. ] table_name[;]

Remarks

The DATA_SOURCE clause defines the external data source (a shard map) that is used for the external table.

The SCHEMA_NAME and OBJECT_NAME clauses map the external table definition to a table in a different schema. If omitted, the schema of the remote object is assumed to be dbo and its name is assumed to be identical to the external table name being defined. This is useful if the name of your remote table is already taken in the database where you want to create the external table. For example, you want to define an external table to get an aggregate view of catalog views or DMVs on your scaled out data tier. Since catalog views and DMVs already exist locally, you cannot use their names for the external table definition. Instead, use a different name and use the catalog view's or the DMV's name in the SCHEMA_NAME and/or OBJECT_NAME clauses. (See the example below.)

The DISTRIBUTION clause specifies the data distribution used for this table. The query processor utilizes the information provided in the DISTRIBUTION clause to build the most efficient query plans.

SHARDED means data is horizontally partitioned across the databases. The partitioning key for the data distribution is the <sharding_column_name> parameter.
REPLICATED means that identical copies of the table are present on each database. It is your responsibility to ensure that the replicas are identical across the databases.
ROUND_ROBIN means that the table is horizontally partitioned using an application-dependent distribution method.

Data tier reference: The external table DDL refers to an external data source. The external data source specifies a shard map that provides the external table with the information necessary to locate all the databases in your data tier.

Security considerations

Users with access to the external table automatically gain access to the underlying remote tables under the credential given in the external data source definition. Avoid undesired elevation of privileges through the credential of the external data source. Use GRANT or REVOKE for an external table as though it were a regular table.

Once you have defined your external data source and your external tables, you can now use full T-SQL over your external tables.

Example: querying horizontal partitioned databases

The following query performs a three-way join between warehouses, orders, and order lines and uses several aggregates and a selective filter. It assumes (1) horizontal partitioning (sharding) and (2) that warehouses, orders, and order lines are sharded by the warehouse ID column, and that the elastic query can co-locate the joins on the shards and process the expensive part of the query on the shards in parallel.

SQL

    select  
         w_id as warehouse,
         o_c_id as customer,
         count(*) as cnt_orderline,
         max(ol_quantity) as max_quantity,
         avg(ol_amount) as avg_amount,
         min(ol_delivery_d) as min_deliv_date
    from warehouse
    join orders
    on w_id = o_w_id
    join order_line
    on o_id = ol_o_id and o_w_id = ol_w_id
    where w_id > 100 and w_id < 200
    group by w_id, o_c_id

Stored procedure for remote T-SQL execution: sp_execute_remote

Elastic query also introduces a stored procedure that provides direct access to the shards. The stored procedure is called sp_execute _remote and can be used to execute remote stored procedures or T-SQL code on the remote databases. It takes the following parameters:

Data source name (nvarchar): The name of the external data source of type RDBMS.
Query (nvarchar): The T-SQL query to be executed on each shard.
Parameter declaration (nvarchar) - optional: String with data type definitions for the parameters used in the Query parameter (like sp_executesql).
Parameter value list - optional: Comma-separated list of parameter values (like sp_executesql).

The sp_execute_remote uses the external data source provided in the invocation parameters to execute the given T-SQL statement on the remote databases. It uses the credential of the external data source to connect to the shardmap manager database and the remote databases.

Example:

SQL

    EXEC sp_execute_remote
        N'MyExtSrc',
        N'select count(w_id) as foo from warehouse'

Connectivity for tools

Use regular SQL Server connection strings to connect your application, your BI, and data integration tools to the database with your external table definitions. Make sure that SQL Server is supported as a data source for your tool. Then reference the elastic query database like any other SQL Server database connected to the tool, and use external tables from your tool or application as if they were local tables.

Best practices

Ensure that the elastic query endpoint database has been given access to the shardmap database and all shards through the SQL Database firewalls.
Validate or enforce the data distribution defined by the external table. If your actual data distribution is different from the distribution specified in your table definition, your queries may yield unexpected results.
Elastic query currently does not perform shard elimination when predicates over the sharding key would allow it to safely exclude certain shards from processing.
Elastic query works best for queries where most of the computation can be done on the shards. You typically get the best query performance with selective filter predicates that can be evaluated on the shards or joins over the partitioning keys that can be performed in a partition-aligned way on all shards. Other query patterns may need to load large amounts of data from the shards to the head node and may perform poorly

Next steps

For an overview of elastic query, see Elastic query overview.
For a vertical partitioning tutorial, see Getting started with cross-database query (vertical partitioning).
For syntax and sample queries for vertically partitioned data, see Querying vertically partitioned data)
For a horizontal partitioning (sharding) tutorial, see Getting started with elastic query for horizontal partitioning (sharding).
See sp_execute _remote for a stored procedure that executes a Transact-SQL statement on a single remote Azure SQL Database or set of databases serving as shards in a horizontal partitioning scheme.

Additional resources

Documentation

Elastic Query Overview - Azure SQL Database

Elastic query enables you to run a Transact-SQL query that spans multiple databases.
Get Started With Cross-Database Queries - Azure SQL Database

How to use elastic database query with vertically partitioned databases
Query across cloud databases with different schema - Azure SQL Database

how to set up cross-database queries over vertical partitions
Report across scaled-out cloud databases - Azure SQL Database

Use cross database database queries to report across multiple databases.
sp_execute_remote (Azure SQL Database) - SQL Server

sp_execute_remote executes a Transact-SQL statement on a single remote Azure SQL Database or set of databases serving as shards in a horizontal partitioning scheme.
Managing credentials in the elastic database client library - Azure SQL Database

How to set the right level of credentials, admin to read-only, for elastic database apps
T-SQL Differences Between SQL Server and Azure SQL Database - Azure SQL Database

This article discusses the Transact-SQL (T-SQL) differences between an Azure SQL Database and SQL Server.
T-SQL Differences Between SQL Server & Azure SQL Managed Instance - Azure SQL Managed Instance

This article discusses the Transact-SQL (T-SQL) differences between an Azure SQL Managed Instance and SQL Server.

Training

Module

Introduction to SQL Server 2022 data virtualization - Training

Learn about data virtualization, how to use Polybase to access and query external data, and enhanced Polybase features in SQL Server 2022.

Certification

Microsoft Certified: Azure Database Administrator Associate - Certifications

Administer an SQL Server database infrastructure for cloud, on-premises and hybrid relational databases using the Microsoft PaaS relational database offerings.

Share via

Reporting across scaled-out cloud databases (preview)

Prerequisites

Overview

1.1 Create database scoped master key and credentials

1.2 Create external data sources

Example

1.3 Create external tables

Remarks

Security considerations

Example: querying horizontal partitioned databases

Stored procedure for remote T-SQL execution: sp_execute_remote

Connectivity for tools

Best practices

Next steps

Feedback

Additional resources