Identifying source data
In addition to determining the analytical goals of the project, you must identify sources of data that can be used to meet these goals. Often you will already know which data sources need to be included in the analysis. For example, if the goal is to analyze trends in sales for the past three years you can use historic sales data from internal business applications or a data warehouse. However, in some cases you may need to search for data to support the analysis you want to perform. For example, if the goal is to determine the best location to open a new store you may need to search for useful demographic data that covers the locations under consideration.
It is common in big data projects to combine data from multiple sources and create a “mash up” that enables you to analyze many different aspects of the problem within a single solution. For example, you might combine internal historic sales data with geographic data obtained from an external source to plot sales volumes on a map. You may then overlay the map with demographic data to try to correlate sales volume with particular geo-demographic attributes.
Common types of data source used in a big data solution include:
- Internal business data from existing applications or BI solutions. Often this data is historic in nature or includes demographic profile information that the business gathered from its customers. For example, you might use historic sales records to correlate customer attributes with purchasing patterns, and then use this information to support targeted advertising or predictive modeling of future product plans.
- Log files. Applications or infrastructure services often generate log data that can be useful for analysis and decision making with regard to managing IT reliability and scalability. Additionally, in some cases, combining log data with business data can reveal useful insights into how IT services support the business. For example, you might use log files generated by Internet Information Services (IIS) to assess network bandwidth utilization, or to correlate web site traffic with sales transactions in an ecommerce application.
- Sensors. Increased automation in almost every aspect of life has led to a growth in the amount of data recorded by electronic sensors (often referred to as the “Internet of Things”). For example, RFID tags in smart cards are now routinely used to track passenger progress through mass transit infrastructure, sensors in plant machinery generate huge quantities of data in production lines, and smart metering provides detailed views of energy usage. This type of data is often well suited to highly dynamic analysis and real-time reporting.
- Social media. The massive popularity of social media services such as Facebook, Twitter, and others is a major factor in the growth of data volumes on the Internet. Many social media services provide application programming interfaces (APIs) that you can use to query the data shared by users of these services, and consume this data for analysis. For example, a business might use Twitter’s query API to find tweets that mention the name of the company or its products, and analyze the data to determine how customers feel about the company’s brand.
- Data feeds. Many web sites and services provide data as a feed that can be consumed by client applications and analytical solutions. Common feed formats include RSS, ATOM, and industry defined XML formats; and the data sources themselves include blogs, news services, weather forecasts, and financial markets data.
- Governments and special interest groups. Many government organizations and special interest groups publish data that can be used for analysis. For example, the UK government publishes over 9000 downloadable datasets including statistics on population, crime, government spending, health, and more, in a variety of formats. Similarly, the US government provides census data and other statistics as downloadable datasets or in dBASE format on CD-ROM. Additionally, many international organizations provide data free of charge. For example, the United Nations makes statistical data available through its own website and in Azure Marketplace.
- Commercial data providers. There are many organizations that sell data commercially, including geographical data, historical weather data, economic indicators, and others. Azure Marketplace provides a central service through which you can locate and purchase subscriptions to many of these data sources.
Just because data is available doesn’t mean it is useful, or that the effort of using it will be viable. Think about the value the analysis can add to your business before you devote inordinate time and effort to collecting and analyzing data.
When planning data sources to use in your big data solution, consider the following factors:
- Availability. How easy is it to find and obtain the data? You may have a specific analytical goal in mind, but if the data required to support the analysis is difficult (or impossible) to find you may waste valuable time trying to obtain it. When planning a big data project it can be useful to define a schedule that allows sufficient time to research what data is available. If the data cannot be found after an agreed deadline you may need to revise the analytical goals.
- Format. In what format is the data available, and how can it be consumed? Some data is available in standard formats and can be downloaded over a network or Internet API. In other cases the data may be available only as a real-time stream that you must capture and structure for analysis. Later in the process you will consider tools and techniques for consuming the data from its source and ingesting it into your cluster, but even during this early stage you should identify the format and connectivity options for the data sources you want to use.
- Relevance. Is the data relevant to the analytical goals? You may have identified a potential data source and already be planning how you will consume it and ingest it into the analytical process. However, you should first examine the data source carefully to ensure the data it contains is relevant to the analysis you intend to perform.
- Cost. You may determine the availability of a relevant dataset, only to discover that the cost of obtaining the data outweighs the potential business benefit of using it. This can be particularly true if the analytical goal is to augment an enterprise BI solution with external data on an ongoing basis, and the external data is only available through a commercial data provider.