Azure SQL Server on Managed Instance not the best fit for Data Warehouse + ETL workloads! Do you agree?

Marcelo Barbieri 30 Reputation points
2023-06-07T15:01:15.14+00:00

Hi,

I currently have a Data Warehouse on an Azure Managed Instance and SSIS Packages hosted on the SSISDB Catalog of the same instance being executed via Azure-SSIS Integration Runtime.

By looking at the features offered by a Microsoft Managed Instance, it gives me the impression that it’s designed for OLTP (Online Transaction Processing) workloads, and not optimized for heavy ETL processes that involve large volumes of data transformations and loading.

Here are some arguments that supports the statement above and considering moving from Azure SQL Server MI to SQL Server on a VM:

§  Recovery Model

Staging and DW database can be set to SIMPLE or Bulk Logged Recover Model, when applicable.
A database in Full recover model generates a considerable amount of transaction log that needs to be backup up frequently, e.g. every 5 minutes. This is recommended for systems that need to recover to a certain point in time, high availability groups, log shipping, etc. This is not the case for many Data Warehouse solutions, which can be recovery by re-running jobs or restoring last night backup, i.e. RPO of 1 day.
Full recover model puts an unnecessary pressure on the system while Simple allows the transaction log to be recycled automatically.

§  Control of Backups

Currently I have Full Databases backups running once a week, differential ones every day and transaction logs every 5 minutes. I have no control of it, as they are set up automatically by the Managed Instance. I can only input the retention policy, i.e. No. of days to keep a backup. However, the Full and differential backups run in the middle of the night while the SSIS jobs are running. The impact in performance is evident, specially when the Full backup runs.  
Backups are also being replicated to another geographical location causing network waits. However, it can be configured to replicate locally.

§  Flexible Resource Configuration

My current solution runs on a SQL MI Standard Configuration, which uses magnetic drives. I know that SSDs can be order of magnitude faster, but the vCore purchasing model of the MI doesn't allow me to upgrade the disks only. I must move to a Premium Performance Configuration in order to upgrade the disks, and even though, I must increase the number of cores until I'm allowed to select the amount of disk I need, which has a high cost implication.

On a VM, I have four options of storage on SSDs to choose from, which is not tied to the number of cores. And also much more freedom to tweak the system for data integration workloads.

§  Replaces the SSIS Integration Runtime
I can kill two birds with one stone as the same SQL Server on a VM can be used to host the databases and the SSIS runtime.

§  Choose SQL Server Edition

As most features of Enterprise Edition is not required for the current solution, it could be replaced by an Standard Edition, which can be 3x cheaper.

§  Full control of SQL Server

o   Full control of SQL Server Instance and database configuration

o  I can use open source and 3rd-party monitoring and health check tools that are currently designed and tested on VMs only, which is the case for most of tools out there.

 

Some of the consideration (things you lose) when moving away from MI can be considered as wanted features for most Data Warehouse + ETL environments, such as:

§  You may use manual or automated backups.

§  You need to implement your own High-Availability solution.

§  There is a downtime while changing the resources(CPU/storage).

§  You need SQL Server License (your own or included with the VM).

So, I would really appreciate if you could share your take on this. How do the arguments above sound? Any recommendations?

Thanks so much,

Regards,

Marcelo.

SQL Server on Azure Virtual Machines
Azure SQL Database
SQL Server Integration Services
SQL Server Integration Services
A Microsoft platform for building enterprise-level data integration and data transformations solutions.
2,453 questions
{count} votes

Accepted answer
  1. Alberto Morillo 32,886 Reputation points MVP
    2023-06-07T16:37:27.2066667+00:00

    I work for an organization that creates data warehouse solutions. In my experience, data warehouse environments demand a lot of RAM more than anything else (CPU or storage performance). With Managed Instance the RAM available depends of the CPU/vcores you have assigned to Managed Instance, and to have good amount of RAM available for ETL (for example OLDB Data Sources) you will need to assign a good number of vCores to the instance and that will make it very expensive.

    In addition, Managed Instance (General Purpose) storage subsystem does not have a scalable and with good enough performance for large workloads like a Data Warehouse. Managed instance offers 5-10 ms latency, 30-50 K IOPS. You can make huge performance gains on ETL processes by configuring correctly the size of the log files of the Azure Managed instance databases, but still not enough. A SQL VM with Azure Premium managed disks may have better IO performance. There are other limitations also related to the maximum storage for the instance and limitations on the number of files and databases.

    The performance of the backups is another topic that may be not good enough for a data warehouse. Backup and restore a database of 1 TB could take 50-55 minutes to finish. Automated backups can also have an impact on performance of the storage and the log rates.

    My suggestion is to go with a SQL VM for data warehouse workloads. If you still use SSRS or SSAS cubes then SQL VM is the only option.

    Hope it helps

    1 person found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. Bas Pruijn 946 Reputation points
    2023-06-08T14:16:57.11+00:00

    You could investigate the combination of Hyperscale database with Data Factory SSIS Integration runtime.

    The big advantages I see with Hyperscale is the fast scaling of the database (DWH usually busy during nights; less during days) and independant scaling of read replicas. Futhermore, backup and restore are not pressing on your compute resources, but are handled independant.

    SSIS-IR can be expensive though, depending on your requirements.

    1 person found this answer helpful.