Share via


Pre-processing and serializing the data

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

Big Data solutions such as HDInsight are designed to meet almost any type of data processing requirement, with almost any type of source data. However, there may be times when you need to perform some transformations or other operations on the source data, before or as you load it into HDInsight. These operations may include capturing and staging streaming data, staging data from multiple sources that arrives at different velocities, and compressing or serializing the data. This topic discusses the considerations for pre-processing data and techniques for serialization and compression.

Pre-processing data

You may want to perform some pre-processing on the source data before you load it into the cluster. For example, you may decide to pre-process the data in order to simplify queries or transformations, to improve performance, or to ensure accuracy of the results. Pre-processing might also be required to cleanse and validate the data before uploading it, to serialize or compress the data, or to improve upload efficiency by removing irrelevant or unnecessary rows or values.

Considerations for pre-processing data

Consider the following pre-processing factors when designing your data ingestion processes:

  • Before you implement a mechanism to pre-process the source data, consider if this processing could be better handled within your cluster as part of a query, transformation, or workflow. Many of the data preparation tasks may not be practical, or even possible, when you have very large volumes of data. They are more likely to be possible when you stream data from your data sources, or extract it in small blocks on a regular basis. Where you have large volumes of data to process you will probably perform these preprocessing tasks within your big data solution as the initial steps in a series of transformations and queries.
  • You may need to handle data that arrives as a stream. You may choose to convert and buffer the incoming data so that it can be processed in batches, or consider a real-time stream processing technology such as Storm (see the section “Overview of Storm” in the topic Data processing tools and techniques) or StreamInsight.
  • You may need to format individual parts of the data by, for example, combining fields in an address, removing duplicates, converting numeric values to their text representation, or changing date strings to standard numerical date values.
  • You may want to perform some automated data validation and cleansing by using a technology such as SQL Server Data Quality Services before submitting the data to cluster storage. For example, you might need to convert different versions of the same value into a single leading value (such as changing “NY” and “Big Apple” into “New York”).
  • If reference data you need to combine with the source data is not already available as an appropriately formatted file, you can prepare it for upload and processing using a tool such as Excel to extract a relatively small volume of tabular data from a data source, reformat it as required, and save it as a delimited text file. Excel supports a range of data sources, including relational databases, XML documents, OData feeds, and the Azure Data Market. You can also use Excel to import a table of data from any website, including an RSS feed. In addition to the standard Excel data import capabilities, you can use add-ins such as Power Query to import and transform data from a wide range of sources.
  • Be careful when removing information from the data; if possible keep a copy of the original files. You may subsequently find the fields you removed are useful as you refine queries, or if you use the data for a different analytical task.

Serialization and compression

In many cases you can reduce data upload time for large source files by serializing and/or compressing the source data before, or as you upload it to your cluster storage. Serialization is useful when the data can be assembled as a series of objects, or it is in a semi-structured format. Although there are many different serialization formats available, the Avro serialization format is ideal because it works seamlessly with Hadoop and contains both the schema and the data. Avro allows you to use rich data structures, a binary data format that can provide fast and compact formats for data transmission using remote procedure calls (RPC), a container file to store persistent data, and it can be integrated easily with dynamic programming languages.

Other commonly used serialization formats are:

  • ProtocolBuffers. This is an open source project designed to provide a platform-neutral and language-neutral inter-process communication (IPC) and serialization framework. See ProtocolBuffers on the Hadoop Wiki for more information.
  • Optimized Row Columnar (ORC). This provides a highly efficient way to store Hive data in a way that was designed to overcome limitations of the other Hive file formats. The ORC format can improve performance when Hive is reading, writing, and processing data. See ORC File Format for more information.

Compression can improve the performance of data processing on the cluster by reducing I/O and network usage for each node in the cluster as it loads the data from storage into memory. However, compression does increase the processing overhead for each node, and so it cannot be guaranteed to reduce execution time. Compression is typically carried out using one of the standard algorithms for which a compression codec is installed by default in Hadoop.

You can combine serialization and compression to achieve optimum performance when you use Avro because, in addition to serializing the data, you can specify a codec that will compress it.

Tools for Avro serialization and compression

An SDK is available from NuGet that contains classes to help you work with Avro from programs and tools you create using .NET languages. For more information see Serialize data with the Avro Library on the Azure website and Apache Avro Documentation on the Apache website. A simple example of using the Microsoft library for Avro is included in this guide—see Serializing data with the Microsoft .NET Library for Avro.

To compress the source data if you are not using Avro or another utility that supports compression, you can usually use the tools provided by the codec supplier. For example, the downloadable libraries for both GZip and BZip2 include tools that can help you apply compression. For more details see the distribution sources for GZip and BZip2 on Source Forge.

You can also use the classes in the .NET Framework to perform GZip and DEFLATE compression on your source files, perhaps by writing command line utilities that are executed as part of an automated upload and processing sequence. For more details see the GZipStream Class and DeflateStream Class reference sections on MSDN.

Another alternative is to create a query job that is configured to write output in compressed form using one of the built-in codecs, and then execute the job against existing uncompressed data in storage so that it selects all or some part of the source data and writes it back to storage in compressed form. For an example of using Hive to do this see the Microsoft White Paper Compression in Hadoop.

Compression libraries available in HDInsight

The following table shows the class name of the codecs provided with HDInsight when this guide was written. The table shows the standard file extension for files compressed with the codec, and whether the codec supports split file compression and decompression.

Format

Codec

Extension

Splittable

DEFLATE

org.apache.hadoop.io.compress.DefaultCodec

.deflate

No

GZip

org.apache.hadoop.io.compress.GzipCodec

.gz

No

BZip2

org.apache.hadoop.io.compress.BZip2Codec

(this codec is not enabled by default in configuration)

.bz2

Yes

Snappy

org.apache.hadoop.io.compress.SnappyCodec

.snappy

Yes

A codec that supports splittable compression and decompression allows HDInsight to decompress the data in parallel across multiple mapper and node instances, which typically provides better performance. However, splittable codecs are less efficient at runtime, so there is a trade off in efficiency between each type.

There is also a difference in the size reduction (compression rate) that each codec can achieve. For the same data, BZip2 tends to produce a smaller file than GZip but takes longer to perform the decompression. The Snappy codec works best with container data formats such as Sequence Files or Avro Data Files. It is fast and typically provides a good compression ratio.

Considerations for serialization and compression

Consider the following points when you are deciding whether to compress the source data:

  • Compression may not produce any improvement in performance with small files. However, with very large files (for example, files over 100 GB) compression is likely to provide dramatic improvement. The gains in performance also depend on the contents of the file and the level of compression that was achieved.
  • When optimizing a job, enable compression within the process using the configuration settings to compress the output of the mappers and the reducers before experimenting with compression of the source data. Compression within the job stages often provides a more substantial gain in performance compared to compressing the source data.
  • Consider using a splittable algorithm for very large files so that they can be decompressed in parallel by multiple tasks.
  • Ensure that the format you choose is compatible with the processing tools you intend to use. For example, ensure the format is compatible with Hive and Pig if you intend to use these to query your data.
  • Use the default file extension for the files if possible. This allows HDInsight to detect the file type and automatically apply the correct decompression algorithm. If you use a different file extension you must set the io.compression.codec property for the job to indicate the codec used.
  • If you are serializing the source data using Avro, you can apply a codec to the process so that the serialized data is also compressed.

Note

For more information about using the compression codecs in Hadoop see the documentation for the CompressionCodecFactory and CompressionCodec classes on the Apache website.

Next Topic | Previous Topic | Home | Community