Moving eBird to the Azure Cloud

Re-posted from the Azure Data Lake & HDInsight blog.

Hosted by the Cornell Lab of Ornithology, eBird is a citizen science project that allows birders to submit observations to a central database. Birders seek to identify and record the birds that they discover, and can also report how much effort it took to find those birds. eBird's web and mobile apps make data recording and interaction super convenient. eBird has accumulated over 350 million records of birds all over the world in the past 14 years.

What's more, birds are strong indicators of environmental health. They use a variety of habitats, respond to seasonal and environmental cues in specific ways, and undergo dramatic migrations across the globe. Understanding their distribution, abundance and movements across large geographic areas over long periods of time, researchers can build models to understand these patterns, monitor trends and identify conservation priorities.

 Species distribution model showing an abundance of tree swallows throughout an entire year. The model was generated using information collected entirely by eBirders. (Image courtesy of eBird and the Cornell Lab of Ornithology.)

Although the eBird project was providing research opportunities at a scale that would have been inconceivable otherwise, it ran into challenges to do with data growth and the time it took to run analytics models. The project, which has thus far captured 25 million hours of bird observation, faced exponential growth in data volumes. The mid-sized high performance computers being used to run these analytics models were taking as many as 3 weeks to process the results for a single species. That made it very inefficient to generate the results that the researchers needed for the 700 odd species of birds that regularly inhabit North America.

Thanks to a recent collaboration between the Cornell Lab and Microsoft, this project was migrated to the fully managed, highly scalable Azure HDInsight (Hadoop) service, a key component of the Microsoft Cortana Intelligence Suite. As a result of this partnership, researchers were able to scale their clusters sufficiently and streamline the associated machine learning workflows to reduce analysis run times to as little as 3 hours, generating results across more species dramatically faster. This, in turn, provides more timely results for conservation staff to then use in their planning process. They have also been able to run models on dozens more species than they would have otherwise.

The complete solution is built on Azure Storage, HDInsight, Microsoft R Server, Linux Ubuntu, Apache Hadoop MapReduce and Spark.

You can click on this link to read the original post, or on the architecture diagram above. 

Taking advantage of the scalability, manageability and open source support of the Microsoft Azure cloud platform, the researchers behind eBird hope to drive further innovation and accelerate their research and conversation efforts, working closely with the community. 

CIML Blog Team