Workflow dependencies and constraints
From: Developing big data solutions on Microsoft Azure HDInsight
A big data batch processing solution typically consists of multiple steps that take some source data, apply transformations to shape and filter it, and consume the results in an application or analytical system. As with all workflows, this sequence of tasks usually includes some dependencies and constraints that you must take into account when planning the solution. Typical dependencies and constraints include:
- Minimum latency toleration in consuming systems. How up-to-date does the data need to be in reports, data models, and applications that consume the data processing results?
- Volatility of source data. How frequently does the source data get updated or added to?
- Data source dependencies. Are there data processing tasks for which data from one source cannot be processed until data from another source is available?
- Duration of processing tasks. How long does it typically take to complete each task in the workflow?
- Resource contention for existing workloads. To what degree can data processing operations degrade the performance and scalability of systems that are in use for ongoing business processes?
- Cost. What is the budget for the employee time and infrastructure resources used to process the data?
An example scenario
As an example, consider a scenario in which business analysts want to use an Excel report in an Office 365 Power BI site to visualize web server activity for an online retail site. The data in Excel is in a PivotTable, which is based on a connection to Azure SQL Database. The web server log data must be processed using a Pig job in HDInsight, and then loaded into SQL Database using Sqoop. The business analysts want to be able to view daily page activity for each day the site is operational, up to and including the previous day.
To plan a workflow for this requirement, you might consider the following questions:
- How up-to-date does the data need to be in reports, data models, and applications that consume the data processing results?
- The requirement is that the Excel workbook includes all data up to and including the previous day’s web server logs, so a solution is required that refreshes the workbook as soon as possible after the last log entry of the day has been processed. In a 24x7 system this means that the data must be processed daily, just after midnight.
- How frequently does the source data get updated or added to?
- This depends on how active the website is. Many large online retailers handle thousands of requests per second, so the log files may grow extremely quickly.
- Are there data processing tasks for which data from one source cannot be processed until data from another source is available?
- If analysis in Excel is limited to just the website activity, there are no dependencies between data sources. However, if the web server log data must be combined with sales data captured by the organization’s e-commerce application (perhaps based on mapping product IDs in web request query strings to orders placed) there may be a requirement to capture and stage data from both sources before processing it.
- How long does it typically take to complete each task in the workflow?
- You will need to test samples of data to determine this. Based on the high volatility of the source data, and the requirement to include log entries right up to midnight, you might find that it takes a significantly long time to upload a single daily log file and process it with HDInsight before the Excel workbook can be refreshed with the latest data. You might therefore decide that a better approach is to use hourly log files and perform multiple uploads during the day, or capture the log data in real-time using a tool such as Flume and write it directly to Azure storage. You could also process the data periodically during the day to reduce the volume of data to be processed at midnight, enabling the Excel workbook to be refreshed within a smaller time period.
- To what degree can data processing operations degrade the performance and scalability of systems that are in use for ongoing business processes?
- There may be some impact on the web servers as the log files are read, and you should test the resource utilization overhead this causes.
- What is the budget for the employee time and infrastructure resources used to process the data?
- The process can be fully automated, which minimizes human resource costs. The main running cost is the HDInsight cluster, and you can mitigate this by only provisioning the cluster when it is needed to perform the data processing jobs. For example, you could design a workflow in which log files are uploaded to Azure storage on an hourly basis—but the HDInsight cluster is only provisioned at midnight when the last log file has been uploaded, and then deleted when the data has been processed. If processing the logs for the entire day takes too long to refresh the Excel workbook in a timely fashion, you could automate provisioning of the cluster and processing of the data twice per day.
In the previous example, based on measurements you make by experimenting with each stage of the process, you might design a workflow in which:
- The web servers are configured to create a new log each hour.
- On an hourly schedule the log files for the previous hour are uploaded to Azure storage. For the purposes of this example, uploading an hourly log file takes between three and five minutes.
- At noon each day an HDInsight cluster is provisioned and the log files for the day so far are processed. The results are then transferred to Azure SQL Database, and the cluster and the log files that have been processed are deleted. For the purposes of the example, this takes between five and ten minutes.
- The remaining logs for the day are uploaded on an hourly schedule.
- At midnight an HDInsight cluster is provisioned and the log files for the day so far are processed. The results are then transferred to SQL Database and the cluster and the log files that have been processed are deleted. For the purposes of the example, this takes between five and ten minutes.
- Fifteen minutes later the data model in the Excel workbook is refreshed to include the new data that was added to the SQL Database tables during the two data processing activities during the day.
The specific dependencies and constraints in each big data processing scenario can vary significantly. However, spending the time upfront to consider how you will accommodate them will help you plan and implement an effective end to end solution.