April 2011

Volume 26 Number 04

MapReduce in F# - Parsing Log Files with F#, MapReduce and Microsoft Azure

By Noah Gift | April 2011

As a longtime Python programmer, I was intrigued by an interview with Don Syme, architect of the F# language. In the interview, Don mentioned that, “some people see [F#] as a sort of strongly typed Python, up to syntactic differences.” This struck me as something worth investigating further.

As it turns out, F# is a bold and exciting new programming language that’s still a secret to many developers. F# offers the same benefits in productivity that Ruby and Python programmers have enjoyed in recent years. Like Ruby and Python, F# is a high-level language with minimal syntax and elegance of expression. What makes F# truly unique is that it combines these pragmatic features with a sophisticated type-inference system and many of the best ideas of the functional programming world. This puts F# in a class with few peers.

But a new high-productivity programming language isn’t the only interesting new technology available to you today.

Broad availability of cloud platforms such as Azure makes distributed storage and computing resources available to single developers or enterprise-sized companies. Along with cloud storage have come helpful tools like the horizontally scalable MapReduce algorithm, which lets you rapidly write code that can quickly analyze and sort potentially gigantic data sets.

Tools like this let you write a few lines of code, deploy it to the cloud and manipulate gigabytes of data in an afternoon of work. Amazing stuff.

In this article, I hope to share some of my excitement about F#, Azure and MapReduce. I’ll pull together all of the ideas to show how you can use F# and the MapReduce algorithm to parse log files on Azure. First, I’ll cover some prototyping techniques to make MapReduce programming less complex; then, I’ll take the results … to the cloud.

Hacking with F#

One of the new styles of working that’s available to .NET programmers with F# is the Interactive workflow that many Perl, Python and Ruby programmers are accustomed to. This style of programming often uses an interactive coding environment like the Python shell itself, or a tool such as IPython, which provides readline completion. This allows a developer to import a class from a module, instantiate it, and then use tab completion to discover methods and data on the object.

A common method of interactive development that will be familiar to .NET developers is to simply write your code in Visual Studio and send snippets to the F# Interactive window for execution. Snippets are sent using the Alt+Enter key combination for the currently selected text, or Alt+SingleQuote for a single line. Figure 1 shows an example of this technique in action.

image: Using the F# Interactive Window

Figure 1 Using the F# Interactive Window

This approach is useful for quick-and-dirty debugging of an F# program and both IntelliSense and Tab completion are available in the F# script you are developing.

A second approach is to again write your code in Visual Studio, but then copy a section of code from Visual Studio and paste it directly into the standalone F# Interactive Console (see Figure 2). If you employ this technique, you’ll need to remember to place two semicolons after your pasted code. This lets you interact with your code with the additional benefit of tab completion. You may find yourself using this technique as you get used to programming in a more interactive fashion.

image: The F# Interactive Console

Figure 2 The F# Interactive Console

You can also develop interactively with F# by running your code directly from Windows PowerShell—in other words, by passing a script to fsi.exe (the F# Interactive Console executable) itself. One of the advantages of this approach is that it lets you prototype quick scripts and print the results to standard output. In addition, you can edit the code iteratively with a lightweight text editor, such as Notepad++. Figure 3 shows an example of a Map-Reduce script output that I’ll be using throughout this article, as it’s run from Windows PowerShell.

Figure 3 Running an F# Script in Windows PowerShell

PS C:\Users\Administrator\Desktop> & 'C:\Program Files (x86)\FSharp-\bin\fsi.exe' mapreduce.fsscript, 11, 9, 8, 7, 6, 5, 5

These different ways to write code all come in handy in helping you deal with complex algorithms, network programming and the cloud. You can write a quick-and-dirty prototype and run it from the command line to see if it gives you the results you expected. Then you can start building your larger project back in Visual Studio.

With this background information out of the way, let’s dive into some actual code.

MapReduce-Style Log Parsing

In addition to the aforementioned benefits of interactive programming, F# code is also concise and powerful. The example in Figure 4 is less than 50 lines of code, yet it contains all of the important parts of a Map-Reduce algorithm to calculate the top 10 IP addresses in a set of log files.

Figure 4 MapReduce Algorithm to Parse a Log File

open System.IO
open System.Collections.Generic

// Map Phase
let inputFile = @"web.log"
let mapLogFileIpAddr logFile =
  let fileReader logFile = 
    seq { use fileReader = new StreamReader(File.OpenRead(logFile))
      while not fileReader.EndOfStream do
        yield fileReader.ReadLine() }    

  // Takes lines and extracts IP Address Out, 
  // filter invalid lines out first
  let cutIp = 
    let line = fileReader inputFile 
    |> Seq.filter (fun line -> not (line.StartsWith("#")))
    |> Seq.map (fun line -> line.Split [|' '|])
    |> Seq.map (fun line -> line.[8],1)
    |> Seq.toArray

// Reduce Phase
let ipMatches = mapLogFileIpAddr inputFile
let reduceFileIpAddr = 
    (fun (acc : Map<string, int>) ((ipAddr, num) : string * int) ->
      if Map.containsKey ipAddr acc then
        let ipFreq = acc.[ipAddr]
        Map.add ipAddr (ipFreq + num) acc
        Map.add ipAddr 1 acc)

// Display Top 10 Ip Addresses
let topIpAddressOutput reduceOutput = 
  let sortedResults = 
    |> Map.toSeq
    |> Seq.sortBy (fun (ip, ipFreq) -> -ipFreq) 
    |> Seq.take 10
  |> Seq.iter(fun (ip, ipFreq) ->
    printfn "%s, %d" ip ipFreq);;

reduceFileIpAddr |> topIpAddressOutput

This standalone version, which will later become a network version, can be broken into three distinct phases: the map phase, the reduce phase and the display phase.

Phase 1 is the map phase. The function mapLogFileIpAddr takes a log file as a parameter. Defined inside this function is another function, fileReader, which uses a functional programming technique to lazily yield a line of text from the log file (although languages like C# and Python also have this). Next, the cutIp function parses each line of input, discards comment lines and then returns the IP address and an integer, 1.

To see why this is lazy, highlight the whole map block of code and run it in the F# Interactive window, along with the line:

let ipMatches = mapLogFileIpAddr inputFile

You’ll see the following output:

val ipMatches : seq<string * int>

Note that nothing has actually been done yet, and the log file has not been read. The only thing that has happened is that an expression has been evaluated. By doing this, execution is delayed until it’s actually needed, without pulling data into memory just for the sake of evaluating an expression. This is a powerful technique for processing data and it becomes especially noticeable when you have gigantic log files to parse that are in the gigabyte or terabyte range.

If you want to compare the difference and evaluate the code more eagerly, then simply add a line to the cutIp function, so that it looks like this (note: the line |> Seq.toArray is entirely optional to the construction of the function; the purpose, in our example, is to intentionally make our function eager—if this was left out, the function would simply operate lazily, like the mapLogFileIpAddr function):

let cutIp = 
  let line = fileReader inputFile 
  |> Seq.filter (fun line -> not (line.StartsWith("#")))
  |> Seq.map (fun line -> line.Split [|' '|])
  |> Seq.map (fun line -> line.[8],1)
  |> Seq.toArray

If you resend this code to the F# interpreter and you give it a large log file that contains several gigabytes of data, you might want to go get a cup of coffee, because your machine will busy reading in the whole file and generating key/value mappings for it in memory.

In the next section of the data pipeline, I take the output of the map result and fold the results of the sequence into an anonymous function that counts how many occurrences of IP addresses are in the sequence. It does this by continuously adding to the Map data structure via recursion. This style of programming can be hard to understand for developers new to functional programming, so you may want to embed print statements inside the anonymous function to see exactly what it does.

In an imperative programming style, you could accomplish the same thing by updating a mutable dictionary that holds each IP address as a key, looping over the sequence of IP addresses, and then updating the count of each value.

The final phase has nothing to do with the MapReduce algorithm, but is useful in hacking scripts during the prototyping phase. The results from the map phase are pipelined from a Map data structure to a Seq. The results are sorted and the top 10 results are printed out. Note that this data pipelining style enables the results of one operation to flow seamlessly into the next operation without a for loop in sight.

MapReduce Plus Azure

With my proof-of-concept script completed—in less than 50 lines of code, remember—it’s time move this to something resembling a production environment. As an example, I’ll move the example from desktop to Azure.

As background, you might find it useful to look at the Azure F# examples at code.msdn.microsoft.com/fsharpazure and to install the Azure templates. Of particular interest is the webcrawler example in which an F# Worker Role uses both blob and queue storage endpoints. This would be a handy project to review as you further explore using F# with Azure.

I’m not going to go into too much detail about setting up a multinode MapReduce farm. Instead I’ll cover it from a higher level. For specifics, see the MSDNMagazine article “Synchronizing Multiple Nodes in Microsoft Azure” by Josh Twist (msdn.microsoft.com/magazine/gg309174).

There are several ways to set up a MapReduce farm on Azure. Figure 5 demonstrates one example using F# Worker Roles that would be equally divided between Map Workers and Reduce Workers. Going back to script, this would be almost as simple as copying and pasting the map function to the Map Worker, and the reduce function to the Reduce Worker.

image: MapReduce Farm in Azure

Figure 5 MapReduce Farm in Azure

The MapReduce presentation by Jeff Dean and Sanjay Ghemawat is a great reference for further detail on the distributed algorithm and possible implementations (labs.google.com/papers/mapreduce-osdi04-slides/). In the Figure 5 example, though, it shows that several log files are consumed in parallel by F# Worker Roles. They then return their output, consisting of IP address keys with a value of 1 via the Azure Service Bus to the Reduce Workers, or by writing it to disk.

Next, the Reduce Workers read this intermediate data and produce a summarized count of the key value pairs by writing it out to blob storage. Each Reduce Worker produces its own summarized report, which would need to be combined together before being sorted and displayed by the Master Worker.

Worker Role Creation and Publishing

With a prototype finished, and a high-level architecture planned, the next step is to create the necessary project in Visual Studio 2010 and publish it to Azure.

Creating an F# Worker Role isn’t as straightforward as it could be, so let’s walk through the steps involved. First, you’ll need to download the Azure F# templates mentioned earlier. Next, you’ll need to create a Visual C# project for Azure. I called mine AzureFSharpProject.

Next, you’ll have the option to create an F# Worker Role as shown in Figure 6.

image: Creating an F# Worker Role

Figure 6 Creating an F# Worker Role

At that point, you can place your map function in your Map Worker role, or your reduce function in your Reduce Worker role. Then create further Worker Roles for additional Map Workers or Reduce Workers, depending on the scale of your data-crunching needs. The canonical reference to refer to is the Google MapReduce paper at labs.google.com/papers/-mapreduce.html. It goes into further detail about the Map-Reduce architecture, caveats and use cases.

When you’re ready to publish to Azure, you can right-click on the project, select Publish, then Create Service Package Only, as shown in Figure 7.

image: Publishing to Azure

Figure 7 Publishing to Azure

Finally, you sign into the new Azure management portal and use the interface to create your Worker Role (see Figure 8).

image: Configuring the New Worker Role

Figure 8 Configuring the New Worker Role

At this point, you can wire up your nodes in any way you see fit, and MapReduce your logs in the cloud. Of course, this same technique could be applied easily to data sources other than simple log files. The general outline of this F# 
MapReduce algorithm—along with the interactive techniques I demonstrated in coding it—can be used for just about any parsing, mapping and reducing job.

Next Steps

F# is a powerful language that enables you to solve problems by both cranking out quick hacks and by building up more complex solutions from those hacks. In this article I used it to cut down the MapReduce algorithm into bite-sized morsels. This enabled me to demonstrate how only 50 lines of F# could then turn into an Azure-based log analyzer.

Regarding the implementation of MapReduce on Azure, you might want to take a look at two other interesting articles on the subject. First, take a look at “Building a Scalable, Multi-Tenant Application for Microsoft Azure” on MSDN for discussion of Worker Roles and MapReduce (msdn.microsoft.com/library/ff966483). Also, Juan G. Diaz has a blog entry, “Comparison of the use of Amazon’s EC2 and Azure, cloud computing and implementation of MapReduce,” that’s worthwhile reading (firatatagun.com/comparison-of-the-use-of-amazons-ec2-and-windows-azure-cloud-computing-and-implementation-of-map-reduce/).

If you haven’t looked at F# yet, I hope this article encouraged you to give it a try. And if you’re interested in hearing all of the Don Syme interview that turned me on to F#, head over to the Simple-Talk blog and give it a listen (bit.ly/eI74iO).

Noah Gift is associate director of engineering at AT&T Interactive. He has a B.S. in Nutritional Science from Cal Poly San Luis Obispo, an M.S. in Computer Information Systems from California State University, Los Angeles, and is an MBA Candidate at UC Davis, specializing in Business Analytics, Finance and Entrepreneurship.

Thanks to the following technical experts for reviewing the article: Michael Bakkemo, Don Syme and Paras Wadehra