Orica’s S/4HANA Foundational Architecture Design on Azure
This blog is a customer success story detailing how Cognizant and Orica have successfully deployed and gone live with a global S/4HANA transformation project on Azure. This blog contains many details and analysis of key decision points taken by Cognizant and Orica over the last two years leading to their successful go live in August 2018.
This blog below written by Sivakumar Varadananjayan Siva is Global head of Cognizant SAP Cloud Practice and He Personally involves in Orica 4s Program from day 1 as Presales head and now as Chief Architect for Orica's S/4HANA on Azure Adoption
Over the last 2 years, Cognizant has partnered and engaged as a trusted technology advisor and managed cloud platform provider to build Highly Available, Scalable, Disaster Proof IT platforms for SAP S/4HANA and other SAP applications in Microsoft Azure. Our customer Orica is the world's largest provider of commercial explosives and innovative blasting systems to the mining, quarrying, oil and gas and construction markets, a leading supplier of sodium cyanide for gold extraction, and a specialist provider of ground support services in mining and tunneling. As a part of this program, Cognizant has built Orica's new SAP S/4HANA Platform on Microsoft Azure and provides a Managed Public Cloud Platform as a Service (PaaS) offering.
Cognizant started the actual cloud foundation work during December 2016. In this blog article, we will cover some of the best practices that Cognizant adopted and share key learnings which may be essential for any customer planning to deploy their SAP workloads on Azure.
The following topics will be covered:
- Target Infrastructure Architecture Design
- Choosing the right Azure Region
- Write Accelerator
- Accelerated Networking
- SAP Application Architecture Design
- Sizing Your SAP Landscape for the Dynamic Cloud
- Increasing/decreasing capacity
- HA / DR Design (SUSE HA Cluster)
- SUSE cluster
- Azure Site Recovery (ASR)
- Security on Cloud
- Network Security Groups
- Encryption – Disk, Storage account, HANA Data Volume, Backup
- Role-Based Access Control
- Locking resources to prevent deletion
- Operations & Management
- Creation of clone environments
- Backup & restore
Target Infrastructure Architecture Design
The design of a fail-proof infrastructure architecture involves visualizing the end-state with great detail. Capturing key business requirements and establishing a set of design principles will clarify objectives and help in proper prioritization while making design choices. Such design principles include but are not limited to choosing a preferred Azure Region for hosting the SAP Applications, as well as determining preferences of Operating System, database, end user access methodology, application integration strategy, high availability, disaster recovery strategy, definition of system criticality and business impacts of disruption, definition of environments, etc. During the Design phase, Cognizant involved Microsoft and SUSE along with other key program stakeholders to finalize the target architecture based on the customer's business & security requirements. As part of the infrastructure design, critical foundational aspects such as Azure Region, ExpressRoute connectivity with Orica's MPLS WAN, and integration of DNS and Active Directory domain controllers were finalized.
At the time of discussing the infrastructure preparation, various topics including VNet design (subnet IP ranges), host naming convention, storage requirements, and initial VM types based on compute requirements were derived. In the case of Orica's 4S implementation, Cognizant implemented a three tier subnet architecture – Web Tier, Application Tier and Database Tier. The three tier subnet design was applied for each of Sandpit, Development, Project Test, Quality and Production so that it provides the flexibility for Orica to deploy fine-grained NSGs at subnet levels as per security requirements. Having a clearly defined tier-based subnet architecture will also enable to avoid complex NSGs being defined for individual VM hosts.
The Web Tier subnet is intended to host the SAP Web Dispatcher VMs; the Application Tier is intended to host the Central Services Instance VMs, Primary Application Server VMs and any additional application server VMs, the Database Tier is intended to host the database VMs. This is supplemented by additional subnets for infrastructure and management components, such as jump servers, domain controllers, etc.
Choosing the Right Azure Region
Although Azure operates over several regions, it is essential to choose a primary region into which main workloads will be deployed. Choosing the right Azure region for hosting the SAP Application is a vital decision to be made. The following factors must be considered for choosing the Right Azure Region for Hosting: (1) Legal and regulatory requirements dictating physical residence, (2) Proximity to the company's WAN points of presence and end users to minimize latency, (3) Availability of VMs and other Azure Services, and (4) Cost. For more information on availability of VMs, refer to the section "Sizing Your SAP Landscape for the Dynamic Cloud" under SAP Application Architecture Design.
Accelerated Networking enables single root I/O virtualization (SR-IOV) to a VM, greatly improving its networking performance. This high-performance path bypasses the host from the data path, reducing latency, jitter, and CPU utilization, for use with the most demanding network workloads on supported VM types. Without accelerated networking, all network traffic in and out of the VM must traverse the host machine and the virtual switch. With accelerated networking, network traffic arrives at the VM's network interface (NIC), and is then forwarded directly to the guest VM. All network policies that the virtual switch applies are now offloaded and applied in hardware. While essential for good and predictable HANA performance, not all VM types and operating system versions support Accelerated Networking, and this must be taken into account for the infrastructure design. Also, it is important to note that Accelerated Networking helps to minimize latency for network communication within the same Azure Virtual Network (VNet). This technology has minimal impact to overall latency during network communication over multiple Azure VNets.
Azure provides several storage options including Azure Disk Storage – Standard, Premium and Managed (attached to VMs), Azure Blob Storage, etc. At the time of writing this article, Azure is the only public cloud service provider that offers Single VM SLA of 99.9% under the condition of operating the VM with Premium Disks attached to it. The cost value proposition of choosing Premium Disk over Standard Disk for the purpose of getting an SLA for Single VM SLA is significantly beneficial and hence Cognizant recommends provisioning all VMs with Premium Disks for application and database storage. Standard Disks are appropriate to store Backups of Databases, and Azure Blob is used for Snapshots of VMs and transferring Backups and storing them as per the Retention Policy. For achieving an SLA of > 99.9%, High Availability techniques can be used. Refer to the section 'High Availability and Disaster Recovery' in this article for more information.
Write Accelerator is a disk capability for M-Series Virtual Machines (VMs) on Azure running on Premium Storage with Azure Managed Disks exclusively. As the name states, the purpose of the functionality is to improve the I/O latency of writes against Azure Premium Storage. Write Accelerator is ideally suited to be enabled for disks to which database redo logs are written to meet the performance requirement of modern databases such as HANA. For production usage, it is essential that the final VM infrastructure thus setup should be verified using SAP HANA H/W Configuration Check Tool (HWCCT). These results should be validated with relevant subject matter experts to ensure the VM is capable of operating production workloads and is thus certified by SAP as well.
SAP Application Architecture Design
The SAP Application Architecture Design must be based on the guiding principles that must be adopted for building the SAP applications, systems and components. To have a well laid out SAP Application Architecture Design, you must determine the list of SAP Applications that are in scope for the implementation.
It is also essential to review the following SAP Notes that provide important information on deploying and operating SAP systems on public cloud infrastructure:
SAP Note 1380654 - SAP support in public cloud environments
SAP Note 1928533 - SAP Applications on Azure: Supported Products and Azure VM types
SAP Note 2316233 - SAP HANA on Microsoft Azure (Large Instances)
SAP Note 2235581 - SAP HANA: Supported Operating Systems
SAP Note 2369910 - SAP Software on Linux: General information
Choosing the OS/DB Mix of your SAP Landscape
Using this list, the SAP Product Availability Matrix can be leveraged to determine whether the preferred Operating System and Database is supported for each of the SAP application in scope. From an ease of maintenance and management perspective, you may want to consider not having more than two variants of databases for your SAP application database. SAP has started providing support for SAP HANA database for most of the applications and since SAP HANA supports multi-tenant database, you might as well want to have most of your SAP applications run on SAP HANA database platform. For some applications that do not support HANA database, other databases might be required in the mix. SAP's S/4HANA application runs only on HANA database. Orica chose to run HANA for every SAP application where supported and SQL Server otherwise – as this was in line with the design rationale and simplified database maintenance, backups, HA/DR configuration, etc.
With SAP HANA 2.0 becoming mainstream (it is also mandatory for S/4HANA 1709 and higher), fewer operating systems are supported than with SAP HANA 1.0. For example SUSE Enterprise for SAP Applications is now the only flavor of SUSE supported, while "normal" SUSE Enterprise was sufficient for HANA 1.0. This may have a licensing impact for the customers, as Azure only provides BYO Subscription images. Hence customers must supply their own operating system licenses.
Type of Architecture
SAP offers deploying its NetWeaver Platform based applications either in a Central System Architecture (Primary Application Server and Database in same host) or in a Distributed System Architecture (Primary Application Server, Additional Application Servers and Database in separate hosts). You need to choose the type of architecture based on a thorough cost value proposition, business criticality and application availability requirements. You also need to determine the number of environments that each SAP application will require such as Sandbox, Development, Quality, Production, Training, etc. This is predominantly determined based on the change governance that you plan to setup for the project. Systems that are business critical and have requirements for high availability such as the Production Environment must always be considered to be deployed in a Distributed System Architecture Scenario with High Availability Cluster. For both critical and non- critical systems, parameters should be enabled to ensure the application and database starts automatically in the event of an inadvertent server restart. Disaster Recovery is often recommended for most of the SAP Applications that are business critical based on Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
Cognizant designed 4 system landscape and distributed SAP architecture for Orica. We separated the SAP Application and DB servers, because when taken in context of HANA MDC and running everything on HANA by default, a Central System Architecture no longer makes sense. We have also named HANA Database SIDs without any correlation to the tenants that each HANA database holds. This is done with the intention of future proofing and allowing the tenants to change the HANA hosts in future if needed. In the case of Orica, we have also implemented custom scripting for automated start of SAP applications which can further be controlled (disabled or enabled) by a centrally located parameter file. High availability is designed for production and quality environments. Disaster recovery is designed as per Orica's Recovery Point Objective (RPO) and Recovery Time Objective (RTO) defined by business requirements.
Sizing Your SAP Landscape for the Dynamic Cloud
Once you have determined the type of SAP architecture, you will now have a fair idea about the number of individual Virtual Machines that will be required to deploy each of these components. From an infrastructure perspective, the next step that you will need to perform is to size the Virtual Machines. You can leverage standard SAP methodologies such as Quick Sizer using Concurrent User or Throughput Based sizing. Best Practice is doing the sizing using Throughput based sizing. This will provide you the SAPS and memory requirement for the application and database components and memory requirement in the case of HANA database. Tabulate the critical piece of sizing information in a spreadsheet and refer to the standard SAP notes to determine the equivalent VM types in Azure Cloud infrastructure. Microsoft is getting SAP certification for new VMs on regular basis so it is always advisable to check the recent SAP notes for latest information. For HANA databases, you may most often require VMs with E-Series (Memory Optimized) and
M-Series (Large Memory Optimized) based on the size of the database. At the time of writing this article, the maximum capacity supported with E-Series and M-Series are 432 GB and 3.8 TB respectively. E-series offers better cost value proposition compared to the earlier GS-series VMs offered by Azure. At this point you need to evaluate that the resulting VMs are available in the Azure region that you have preferred to host your SAP landscape. In some cases, depending upon the geography there is a possibility that some of these VM types may not be available and it is essential to be careful and choose the right Geography and Azure Region where all the required VM types are available. However, remember that Public Cloud offers great scalability and elasticity. You do not need an accurate peak sizing to provision your environments. You always have the room to scale-up or scale-down your SAP Systems based on the actual usage by monitoring the utilization metrics such as CPU, Memory and Disk Utilization. Within the same Virtual Machine series, this can be done just by powering off the VM, changing the VM Size and powering on the VM. Typically, the whole VM resizing procedure does not take more than a few minutes. Ensure that your system will fit into what's available in Azure at any point of time. For instance, spinning up a 1 TB M-series and then finding that a 1.7TB instance is needed instead does not cause much of an hassle as it can be easily re-sized. However, if you are not sure if your system will grow beyond 3.8 TB (maximum capacity of M-Series), then it puts you in a bigger risk as complications will start of creep up (Azure Large Instances may be needed for rescue in such cases). Reserved Instances are also available in Azure, and can be leveraged for further cost optimization if accurate sizing of actual hardware requirements is performed before purchasing (to avoid over-committing).
High Availability and Disaster Recovery
Making business critical systems such as SAP S/4HANA highly available with > 99.9% high availability requires a well-defined High Availability architecture design. As per Azure, VM clusters deployed in an availability set within an availability zone in a region offers 99.95% availability. Azure offers an SLA of 99.99% when the compute VMs are deployed within a region in multiple Availability Zones. For achieving this, it is recommended to look for availability of Availability Zones in the region that is chosen for hosting the SAP applications. Note that Azure Availability Zones are still being rolled out by Microsoft and they will eventually arrive in all regions over a period of time. Also, components that are Single Point of Failure (SPOF) in SAP must be deployed in a cluster such as SUSE Cluster. Such cluster must reside within an availability set to attain 99.95% availability. To achieve the High Availability for Azure Infrastructure level all the VMs are added in availability set and exposed with Azure Internal Load Balancer (ILB). These components include (A)SCS Cluster, DB Cluster and NFS. It is also recommended to provision at least two application servers within an availability set (Primary Application Server and Additional Application Server), so as to ensure the Application Servers are redundant. Cognizant, Microsoft and SUSE worked together to build a collaborative solution based on Multi-Node iSCSI server configuration. This Multi-Node iSCSI server HA configuration for SAP Applications in Orica were the first to be deployed with this configuration in Azure Platform.
As discussed earlier, in cases where SAP components are not prevented from failure using High Availability setup, it is recommended to provision such VMs with Premium Storage Disks attached to it to take advantage of the Single VM SLA. All VMs at Orica use Premium Disks for their application and database volumes because this is the only way they would be covered by the SLA, and we also found performance to be better and more consistent.
Details about SUSE Cluster is described below
SUSE Cluster Setup for HA:
(A)SCS Layer high availability is achieved using SUSE HA Extension cluster. DRBD technology for replication is not used for replication of application files such as SAP kernel files. This design is based on the recommendation from Microsoft and it is supported by SAP as well. The reason for not enabling DRBD replication is due to potential performance issues that could pop-up when a synchronous replication is configured and a recovery could not be guaranteed which such configuration when ASR is enabled for Disaster Recovery replication at Application layer. NFS Layer high availability is achieved using SUSE HA Extension cluster. DRBD technology for replication is used for data replication. It is also recommended by Microsoft to use single NFS Cluster to cater for multiple SAP Systems to reduce complexity of the overall design.
HA testing needs to be performed thoroughly, and must be simulated for many different failure situations beyond a simple clean shutdown of the VM. E.g. Usage of halt command to simulate a VM power off, adding firewall rules in the NSG to simulate problems with the VM's network stack, etc.
We are excited to announce that Orica is the first customer on the Multi-SID SUSE HA cluster configuration.
More details on the technical configuration of setting up HA for SAP is described here. The pacemaker on SLES in Azure is recommended to be setup with an SBD device, the configuration details are described here. Alternatively, if you do not want to invest in one additional virtual machine, you can also use the Azure Fence agent. The downside with Azure Fencing Agent is that a failover can take between 10 to 15 minutes if a resource stop fails or the cluster nodes cannot communicate which each other anymore.
Another important aspect on ensuring application availability is during a disaster through a well architected DR solution which can be invoked through a well-orchestrated Disaster Recovery Plan.
Azure Site Recovery (ASR):
Azure Site Recovery assists in business continuity by keeping business apps and workloads running during outages. Site Recovery replicates workloads running on physical and virtual machines (VMs) from a primary site to a secondary location. At the time of failover, apps are started in the secondary location and accessed from there after making relevant changes in the cluster configuration and DNS. After the primary location is running again, you can fail back to it. ASR was not tested for Orica at the time of current Go-Live as the support for SLES 12.3 was made GA by Microsoft too close to Cut-Over. However, we are currently evaluating this feature and we will be using this for DR at the time of Go-Live of the next phase.
Security on Cloud
Most of the traditional security concepts such as security at physical, server, hypervisor, network, compute and storage layers are applicable for overall security of the cloud. These are provided by the public cloud platform inherently and are audited as well by 3rd party IT security certification providers. Security on the Cloud will help you to protect the hosted applications on the cloud by leveraging features and customization aspects that are available through the cloud provider and those security features provided within the applications hosted on cloud.
Network Security Groups
Network Security Groups (NSGs) are rules applied in Networking Layer that will control traffic and communication with VMs hosted in Azure. In Azure, Separate NSGs can be associated for Prod, Non-Prod, Infra, management and DMZ environment. It is important to arrive at a strategy for defining the NSG rules in such a way that it is modularized and easy to comprehend and implement. Strict procedures need to be implemented to control these rules. Otherwise, you may often end-up with unnecessary redundant rules which will make it harder to troubleshoot any network communication related issues.
In the case of Orica, an initiative was implemented to optimize the number of NSG rules by adding multiple ports for the same source and destination ranges in the same rule. A change approval process was introduced once the NSG was associated. All the NSG rules are maintained in a custom formatted template (CSVs) which is utilized by a script for actual configuration in Azure. We expect it will be too difficult doing this manually for multiple VNets across multiple regions (e.g. primary, DR, etc.).
Encryption of Storage Account and Azure Disk
Azure Storage Service Encryption (SSE) is recommended to be enabled for all the Azure Storage Accounts. Through this, Azure Blobs will be encrypted in the Azure Storage. Any data that is written to the storage after enabling the SSE will be encrypted. SSE for Managed Disks is enabled by default.
Azure Disk Encryption leverages the industry standard BitLocker feature of Windows and the DM-Crypt feature of Linux to provide volume encryption for the OS and the data disks. The solution is integrated with Azure Key Vault to help you control and manage the disk-encryption keys and secrets in your key vault subscription. Encryption of the OS volume will help protect the boot volume's data at rest in your storage. Encryption of data volumes will help protect the data volumes in your storage. Azure Storage automatically encrypts your data before persisting it to Azure Storage, and decrypts the data before retrieval.
SAP Data at Rest Encryption
Data at rest is encrypted for SAP Applications by encrypting the database. SAP HANA 2.0 and SQL Server natively support data at rest encryption and they provide the additional security that is needed in case of a data theft. In addition to that the backups of both these databases are encrypted and secured by a Pass Phrase to ensure these backups are only readable and can be leveraged by authentic users.
In the case of Orica, both Azure Storage Service Encryption and Azure Disk Encryption were enabled. In addition to this, SAP Data at Rest Encryption was enabled in SAP HANA 2.0 and TDE encryption was enabled in SQL Server database.
Role-Based Access Control (RBAC)
Azure Resource Manager provides a granular Role-Based Access Control (RBAC) model for assigning administrative privileges at the resource level (VMs, Storage etc.). Using an RBAC model (For e.g. service development team, App development team) can help in segregation and control of duties and grant only the amount of access to users/groups that they need to perform their jobs in selected resources. This enforces the principle of least privilege.
An administrator may need to lock a subscription, resource group, or resource to prevent other users in organization from accidentally deleting or modifying critical resources. We can set the lock level to CanNotDelete or ReadOnly. In the portal, the locks are called Delete and Read-only respectively. Unlike RBAC, Locking Resources would prevent intentional and accidental deletion of resources for all the users including the users who have owner access as well. CanNotDelete means authorized users can still read and modify a resource, but they can't delete the resource. ReadOnly means authorized users can read a resource, but they can't delete or update the resource. Applying this lock is similar to restricting all authorized users to the permissions granted by the Reader role. For Orica, we have configured this for critical pieces of Azure infrastructure, to provide an additional layer of safety.
Operations & Management
Cognizant provides Managed Platform as a Service (mPaaS) for Orica through Microsoft Azure Cloud. Cognizant has leveraged several advantages of operating SAP systems in public cloud including scheduled automated Startup and Shutdown, automated backup management, monitoring and alerting, automated technical monitoring for optimizing the overall cost of technical operations and management. Some of the recommendations are described below.
Azure Costing and Reporting
Azure Cost Management licensed by Cloudyn, a Microsoft subsidiary, allows you to track cloud usage and expenditures for your Azure resources and other cloud providers including AWS and Google. Monitoring your usage and spending is critically important for cloud infrastructures because organizations pay for the resources they consume over time. When usage exceeds agreement thresholds, unexpected cost overages can quickly occur.
Reports help you monitor spending to analyze and track cloud usage, costs, and trends. Using Over Time reports, you can detect anomalies that differ from normal trends.
detailed, line-item level data may also be available in the EA Portal (https://ea.azure.com) which are more flexible compared to Cloudyn reports and could be more useful.
Backup & Restore
One of the primary requirements of system availability management as part of technical operations is to protect the systems from accidental data loss due to factors such as infrastructure failure, data corruption or even complete loss of the systems in the event of a disaster. While concepts such as High Availability and Disaster Recovery will help to mitigate infrastructure failures, for handling events such as data corruption, loss of data, etc. a robust backup and restore strategy is essential. Availability of backups allow us to technically restore an application back to working state in case of a system corruption and present the "last line of defense" in case of a disaster recovery scenario. The main goal of backup/restore procedure is to restore the system to a known-working state.
Some of the key requirements for Backup and Restore Strategy include:
Backup should be restorable
Prefer to use native database backup and restore tools
Backup should be secure and encrypted
Clearly defined retention requirements
VM Snapshot Backups
Azure Infrastructure offers native backup for VMs (inclusive of disks attached) using VM Snapshots. VM Snapshot backups are stored within Azure Vaults which are part of the Azure Storage architecture and are geo-redundant by default. It is to be noted that Microsoft Azure does not support traditional data retention medium such as tapes. Data retention in cloud environment is achieved using technologies such as Azure Vault and Azure Blob which are part of Azure Storage Account architecture. In general, all VMs provisioned in Microsoft Azure (including databases) should be included as part of the VM Snapshot backup plan although the frequency can vary based on the criticality of the environment and criticality of the application. Encryption should be enabled at Azure Storage Account level so that the backups when stored in the Azure Vault are also encrypted when accessed outside the Azure Subscription.
While the restorability of file system and database software can be achieved using VM Snapshot process described above, VMs containing database may not be able to restore the database to a consistent state. Hence, backup of databases is highly essential to guarantee restorability of databases. It is advisable to have all databases in the landscape to be included as part of the Full Database Backups, the schedules for must be described based on the business criticality and requirements for the application. Consistency of the database backup file should be checked after the database backup is taken. This is to ensure restorability of the database backup.
In addition to Full Database backups, it is recommended to perform transaction log backups at regular intervals. This frequency must be higher for a production environment to support point in time recovery requests and the frequency can be relatively lower for non-production environments.
Both Full Database Backups and Transaction Log Backups must be transferred to an offline device (such as Azure Blob) and retained as per data retention requirement. It is recommended to have all database backups to be encrypted using Native Database Backup Data Encryption methodology if the database supports it. SAP HANA 2.0 supports Native DB Backup Encryption.
Database Backup Monitoring and Restorability Tests
Backup Monitoring is essential to ensure the backups are occurring as per frequency and schedule. This can be automated through scripts. Restorability Test of backups will assist in guaranteeing the restorability of an application in the event of a disaster or data loss or data corruption.
Cognizant SAP Cloud Practice in collaboration with SAP, Microsoft and SUSE leveraged and built some of the best practices for deploying SAP landscape in Azure for Orica's 4S Program. Through this article, some of the key topics that are very relevant for architecting an SAP landscape on Azure are exhibited. Hope you found this blog article useful. Feel free to add your comments.