Manufacturing HPC storage
Storage access is an important part of planning for HPC workload performance. The following materials help to streamline your decision process and minimize any misunderstandings around a particular storage solution's capabilities (or lack of capabilities).
Design considerations
It's important to ensure that the required data gets to the HPC cluster machines at the right time. You also want to make sure results from those individual machines are quickly saved and available for further analysis.
Distribution of workload traffic
Account for the types of traffic your HPC environment generates and processes. This step is especially important if you plan to run multiple types of workloads and plan to use the storage for other purposes. Consider and record the following traffic types:
- Single stream versus multiple streams
- Ratio of read traffic to write traffic
- Average file sizes and counts
- Random versus sequential access patterns
Data locality
The next category accounts for the location of the data. Locality awareness helps you determine whether you can use copying, caching, or synchronization as your data movement strategy. The following are locality items to check beforehand:
- Source data on-premises, in Azure, or both?
- Results data on-premises, in Azure, or both?
- HPC workloads in Azure to be coordinated with source-data modification timelines?
- Sensitive/HIPAA data?
Performance requirements
Performance requirements for storage solutions are generally summarized as follows:
- Single-stream throughput (in Gb/ps)
- Multi-stream throughput (in Gb/ps)
- Expected maximum IOPS
- Average latency (ms)
Every consideration affects performance, so these numbers represent a guide that a particular solution should achieve. For example, you might have an HPC workload that does extensive file creation and deletion as part of the workflow. Those operations could affect the overall throughput.
Access methods
Account for the client access protocol required and be clear about what features of the protocol you need. There are different versions of NFS and SMB.
Here are some things to consider:
- NFS/SMB versions required
- Expected protocol features (ACLs, encryption)
- Parallel file system solution
Total capacity requirement
Storage capacity in Azure is the next consideration. It helps to inform the overall cost of the solution. If you plan to store a large amount of data for a long time, you might want to consider tiering as part of the storage solution. Tiering provides lower-cost storage options combined with higher-cost but higher-performance storage in a hot tier. So, evaluate the capacity requirements as follows:
- Total capacity required
- Total hot-tier capacity required
- Total warm-tier capacity required
- Total cold-tier capacity required
Authentication and authorization method
Regarding authentication and authorization requirements, like using an LDAP server or Active Directory environment, ensures you include the appropriate supporting systems for the architecture. If you need to support capabilities like UID/GID mapping to Active Directory users, confirm that the storage solution supports that capability.
Here are some things to consider:
- Local (UID/GID on file server only)
- Directory (LDAP, Active Directory)
- UID/GID mapping to Active Directory users?
Common Azure storage solutions comparison
Category | Azure Blob Storage | Azure Files | Azure Managed Lustre | Azure NetApp Files |
---|---|---|---|---|
Use cases | Azure Blob Storage is best suited for large-scale, read-heavy sequential access workloads where data is ingested once with few or no further modifications. Blob Storage offers the lowest total cost of ownership, if there's little or no maintenance. Some example scenarios are: Large scale analytical data, throughput sensitive high-performance computing, backup and archive, autonomous driving, media rendering, or genomic sequencing. |
Azure Files is a highly available service best suited for random access workloads. For NFS shares, Azure Files provides full POSIX file system support. You can easily use it from container platforms like Azure Container Instance (ACI) and Azure Kubernetes Service (AKS) with the built-in CSI driver, and VM-based platforms. Some example scenarios are: Shared files, databases, home directories, traditional applications, ERP, CMS, NAS migrations that don't require advanced management, and custom applications requiring scale-out file storage. |
Azure Managed Lustre is a fully managed parallel file system best suited to medium to large HPC workloads. Enables HPC applications in the cloud without breaking application compatibility by providing familiar Lustre parallel file system functionality, behaviors, and performance, securing long-term application investments. |
Fully managed file service in the cloud, powered by NetApp, with advanced management capabilities. NetApp Files is suited for workloads that require random access and provides broad protocol support and data protection capabilities. Some example scenarios are: On-premises enterprise NAS migration that requires rich management capabilities, latency sensitive workloads like SAP HANA, latency-sensitive or IOPS intensive high-performance compute, or workloads that require simultaneous multi-protocol access. |
Available protocols | NFS 3.0 REST Data Lake Storage Gen2 |
SMB NFS 4.1 (No interoperability between either protocol) |
Lustre | NFS 3.0 and 4.1 SMB |
Key features | Integrated with HPC cache for low-latency workloads. Integrated management, including lifecycle, immutable blobs, data failover, and metadata index. |
Zonally redundant for high availability. Consistent single-digit millisecond latency. Predictable performance and cost that scales with capacity. |
High storage capacity up to 2.5PB. Low (~2ms) latency. Spin up new clusters in minutes. Supports containerized workloads with AKS. |
Extremely low latency (as low as sub-ms). Rich NetApp ONTAP management capability such as SnapMirror in cloud. Consistent hybrid cloud experience. |
Performance (Per volume) | Up to 20,000 IOPS, up to 100 GiB/s throughput. | Up to 100,000 IOPS, up to 80 GiB/s throughput. | Up to 100,000 IOPS, up to 500 GiB/s throughput. | Up to 460,000 IOPS, up to 36 GiB/s throughput. |
Pricing | Azure Blob Storage pricing | Azure Files pricing | Azure Managed Lustre pricing | Azure NetApp Files pricing |
Roll-your-own parallel file system
As with NFS, you can create a multi-node BeeGFS or Lustre file system. Performance of such systems is largely dependent on the type of Virtual Machines you select. You can use images found in the Azure Marketplace for BeeGFS, or a Lustre implementation by DDN called Whamcloud. Using third-party images from vendors such as BeeGFS or DDN lets you purchase their support. Otherwise, you can use both BeeGFS and Lustre by way of their GPL licenses without other charges (beyond the machines and disks). These tools are easy to roll out using the Azure HPC scripts with either ephemeral local disks (for scratch) or Premium / Ultra SSD for persistent storage.
Cray ClusterStor
One of the biggest challenges with larger workloads is replicating the pure “bare-metal” performance of large compute clusters working alongside large Lustre environments (in terms of TB/s throughput, and possibly Petabytes of storage). You can now run these workloads with the Azure Cray ClusterStor solution. This approach is a pure bare-metal Lustre deployment placed in the relevant Azure data center. Parallel file systems such as BeeGFS and Lustre provide the highest performance due to their architecture. But that architecture comes with a high management price and so does the use of these technologies.
Next steps
The following articles provide guidance on each step in the cloud adoption journey for manufacturing HPC environments.
- Manufacturing HPC Azure billing and Active Directory tenants
- Azure identity and access management for HPC in manufacturing
- Management for HPC in the manufacturing industry
- Manufacturing HPC network topology and connectivity
- Platform automation and DevOps for Azure HPC in the manufacturing industry
- Manufacturing HPC resource organization
- Azure governance for manufacturing HPC
- Security for HPC in manufacturing industries
- Azure high-performance computing (HPC) landing zone accelerator