Performance and reliability
From: Developing big data solutions on Microsoft Azure HDInsight
Big data solutions typically work against vast quantities of source data, and are usually automated to some extent. The results they produce may be used to make important business decisions, and so it is vital to ensure that the processes you use to collect and load data provide both an appropriate level of performance and operate in a consistent and reliable way. This topic discusses the requirements and considerations for performance and scalability, and for reliability.
Performance and scalability
Big data solutions typically work against vast quantities of source data. In some cases this data may be progressively added to the cluster’s data store by appending it to existing files, or by adding new files, as the data is collected. This is common when collecting streaming data such as clickstreams or sensor data that will be processed later.
However, in many cases you may need to upload large volumes of data to a cluster before starting a processing job. Uploading multiple terabytes, or even petabytes, of data will take many hours. During this time, the freshness of the data is impaired and the results may even be obtained too late to actually be useful.
Therefore, it’s vital to consider how you can optimize upload processes in order to mitigate this latency. Common approaches are to use multiple parallel upload streams, compress the data, use the most efficient data transfer protocols, and ensure that data uploads can be resumed from the point where they failed should network connectivity be interrupted.
Considerations for performance and scalability
Consider the following performance and scalability factors when designing your data ingestion processes:
- Consider if you can avoid the need to upload large volumes of data as a discrete operation before you can begin processing it. For example, you might be able to append data to existing files in the cluster, or upload it in small batches on a schedule.
- If possible, choose or create a utility that can upload data in parallel using multiple threads to reduce upload time, and that can resume uploads that are interrupted by temporary network connectivity. Some utilities may be able to split the data into small blocks and upload multiple blocks or small files in parallel; and then combine them into larger files after they have been uploaded.
- Bottlenecks when loading data are often caused by lack of network bandwidth. Adding more threads may not improve throughput, and can cause additional latency due to the opening and closing of the connection for each item. In many cases, reusing the connection (which avoids the TCP ramp-up) is more important.
- You can often reduce upload time considerably by compressing the data. If the data at the destination should be uncompressed, consider compressing it before uploading it and then decompressing it within the datacenter. Alternatively, you can use one of the HDInsight-compatible compression codecs so that the data in compressed form can be read directly by HDInsight. This can also improve the efficiency and reduce the running time of jobs that use large volumes of data. Compression may be done as a discrete operation before you upload the data, or within a custom utility as part of the upload process. For more details see Pre-processing and serializing the data.
- Consider if you can reduce the volume of data to upload by pre-processing it. For example, you may be able to remove null values or empty rows, consolidate some parts of the data, or strip out unnecessary columns and values. This should be done in staging, and you should ensure that you keep a copy of the original data in case the information it contains is required elsewhere or later. What may seem unnecessary today may turn out to be useful tomorrow.
- If you are uploading data continuously or on a schedule by appending it to existing files, consider how this may impact the query and transformation processes you execute on the data. The source data folder contents for a query in HDInsight should not be modified while a query process is running against this data. This may mean saving new data to a separate location and combining it later, or scheduling processing so that it executes only when new data is not being appended to the files.
- Choose efficient transfer protocols for uploading data, and ensure that the process can resume an interrupted upload from the point where it failed. For example, some tools such as Aspera, Signiant, and File Catalyst use UDP for the data transfer and TCP working in parallel to validate the uploaded data packages by ensuring each one is complete and has not been corrupted during the process.
- If one instance of the uploader tool does not meet the performance criteria, consider using multiple instances to scale out and increase the upload velocity if the tool can support this. Tools such as Flume, Storm, Kafka, and Samza can scale to multiple instances. SSIS can also be scaled out, as described in the presentation Scaling Out SSIS with Parallelism (note that you will require additional licenses for this). Each instance of the uploader you choose might create a separate file or set of files that can be processed as a batch, or could be combined into fewer files or a single file by a process running on the cluster servers.
- Ensure that you measure the performance of upload processes to ensure that the steps you take to maximize performance are appropriate to different types and varying volumes of data. What works for one type of upload process may not provide optimum performance for other upload processes with different types of data. This is particularly the case when using serialization or compression. Balance the effects of the processes you use to maximize upload performance with the impact these have on subsequent query and transformation processing jobs within the cluster.
Reliability
Data uploads must be reliable to ensure that the data is accurately represented in the cluster. For example, you might need to validate the uploaded data before processing it. Transient failures or errors that might occur during the upload process must be prevented from corrupting the data.
However, keep in mind that validation extends beyond just comparing the uploaded data with the original files. For example, you may extend data validation to ensure that the original source data does not contain values that are obviously inaccurate or invalid, and that there are no temporal or logical gaps for which data that should be included.
To ensure reliability, and to be able to track and cure faults, you will also need to monitor the process. Using logs to record upload success and failure, and capturing any available error messages, provides a way to ensure the process is working as expected and to locate issues that may affect reliability.
Considerations for reliability
Consider the following reliability factors when designing your data ingestion processes:
- Choose a technology or create an upload tool that can handle transient connectivity and transmission failures, and can properly resume the process when the problem clears. Many of the APIs exposed by Azure and HDInsight, and SDKs such as the Azure Storage client libraries, include transient fault handling management. If you are building custom tools that do not use these libraries or APIs, you can include this capability using a framework such as the Transient Fault Handling Application Block.
- Monitor upload processes so that failures are detected early and can be fixed before they have an impact on the reliability of the solution and the accuracy or timeliness of the results. Also ensure you log all upload operations, including both successes and failures, and any error information that is available. This is invaluable when trying to trace problems. Some tools, such as Flume and CloudBerry, can generate log files. AzCopy provides a command line option to log the upload or download status. You can also enable the built-in monitoring and logging for many Azure features such as storage, and use the APIs they expose to generate logs. If you are building a custom data upload utility, you should ensure it can be configured to log all operations.
- Implement linear tracking where possible by recording each stage involved in a process so that the root cause of failures can be identified by tracing the issue back to its original source.
- Consider validating the data after it has been uploaded to ensure consistency, integrity, and accuracy of the results and to detect any loss or corruption that may have occurred during the transmission to cluster storage. You might also consider validating the data before you upload it, although a large volume of data arriving at high velocity may make this impossible. Common types of validation include counting the number of rows or records, checking for values that exceed specific minimum or maximum values, and comparing the overall totals for numeric fields. You may also apply more in-depth approaches such as using a data dictionary to ensure relevant values meet business rules and constraints, or cross-referencing fields to ensure that matching values are present in the corresponding reference tables.