opendatasets Package

Contains functionality for consuming Azure Open Datasets as dataframes and for enriching customer data.

Azure Open Datasets are curated public datasets that you can use to add scenario-specific features to machine learning solutions for more accurate models. You can convert these public datasets into Spark and pandas dataframes with filters applied. For some datasets, you can use an enricher to join the public data with your data. For example, you can join your data with weather data by longitude and latitude or zip code and time.

Included in Azure Open Datasets are public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. Open Datasets are in the cloud on Microsoft Azure and are integrated into Azure Machine Learning. For more information about working with Azure Open Datasets, see Create datasets with Azure Open Datasets.

For general information about Azure Open Datasets, see Azure Open Datasets Documentation.

Packages

accessories

Contains functionality that helps identify column types in data, including lat/long, zipcode, and time.

aggregators

Contains functionality for defining how joined data is aggregated.

Aggregators define operations that can be performed on the result of joining data from two datasets. For example, when you use one of the classes in enrichers, you can specify an aggregator as part of the operation. If no aggregation is needed, use AggregatorAll.

data

Contains the init file for data resources in the publicholidays module.

dataaccess

Contains functionality providing blob file access methods.

When you use a class from the opendatasets package like the ChicagoSafety class, the dataaccess classes and functions in this package are used internally. In general, you won't need to use the functionality in the dataaccess package directly.

enrichers

Contains functionality for enriching and joining together data from two datasets.

Generally, enrichers join together data from different sources. Specifically, enrichers enable you to to join your data (customer data) with data from Azure Open Datasets or other public datasets.

granularities

Contains functionality defining time and distance measures used by enrichers.

Granularities are measures of time or distance used by enrichers when enriching (joining) data. There are time granularities such as hourly or daily, and location granularity such as closest distance.

selectors

Contains functionality for selecting and joining data from a customer dataset with data from a public dataset.

Selectors define logic that enable you to enrich your data with public datasets based on time and distance measures. For example, with a selector you can find public data to join with your data based on nearest location, or by rounding to the same time granularity.

Specify selectors when working with one of the classes in the enrichers package.

Modules

environ

Defines runtime environment classes where Azure Open Datasets are used.

The classes in this module ensure Azure Open Datasets functionality is optimized for different environments. In general, you do not need to instantiate these environment classes or worry about their implementation. Instead, use the get_environ module function to return the environment.

Classes

BingCOVID19Data

Represents the Bing COVID-19 dataset.

This datasets contains Bing COVID-19 data from multiple trusted, reliable sources, including the World Health Organization (WHO), Centers for Disease Control and Prevention (CDC), national and state public health departments, BNO News, 24/7 Wall St., and Wikipedia. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see Bing COVID-19 Data in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

BostonSafety

Represents the Boston Safety public dataset.

This dataset contains 311 calls reported to the city of Boston. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see Boston Safety Data in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

COVID19OpenResearch

Represents COVID-19 Open Research Dataset.

For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see COVID-19 Open Research Dataset in the Microsoft Azure Open Datasets catalog.

COVIDTrackingProject

Represents the COVID Tracking Project dataset.

This datasets contains COVID Tracking Project dataset providing the latest numbers on tests, confirmed cases, hospitalizations, and patient outcomes from every US state and territory. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see COVID Tracking Project dataset in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

ChicagoSafety

Represents the Chicago Safety public dataset.

This dataset contains 311 service requests from the city of Chicago, including historical sanitation code complaints, pot holes reported, and street light issues. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see Chicago Safety Data in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

CitySafety

City safety class - this is a parent class that can be inherited by each individual city.

Initialize filtering fields.

Diabetes

Represents the Sample Diabetes public dataset.

The Diabetes dataset has 442 samples with 10 features, making it ideal for getting started with machine learning algorithms. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see Sample: Diabetes in the Microsoft Azure Open Datasets catalog.

EcdcCOVIDCases

Represents the European Centre for Disease Prevention and Control (ECDC) Covid-19 Cases.

This datasets contains from the European Center for Disease Prevention and Control (ECDC). Each row/entry contains the number of new cases reported per day and per country/region. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see European Centre for Disease Prevention and Control (ECDC) Covid-19 Cases in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

MNIST

Represents the MNIST dataset of handwritten digits.

The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see The MNIST database of handwritten digits in the Microsoft Azure Open Datasets catalog.

For an example of using the MNIST dataset, see the tutorial Train image classification models with MNIST data and scikit-learn using Azure Machine Learning.

NoParameterOpenDatasetBase

US labor base class.

Initialize.

NoaaGfsWeather

Represents the National Oceanic and Atmospheric Administration (NOAA) Global Forecast System (GFS) dataset.

This dataset contains 15-day US hourly weather forecast data (example: temperature, precipitation, wind) produced by the Global Forecast System (GFS) from the National Oceanic and Atmospheric Administration (NOAA). For information about this dataset, including column descriptions, different ways to access the dataset, and examples, see NOAA Global Forecast System in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

NoaaIsdWeather

Represents the National Oceanic and Atmospheric Administration (NOAA) Integrated Surface Dataset (ISD).

This dataset contains worldwide hourly weather history data (example: temperature, precipitation, wind) sourced from the National Oceanic and Atmospheric Administration (NOAA). For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see NOAA Integrated Surface Data in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

NycSafety

Represents the New York City Safety public dataset.

This dataset contains all New York City 311 service requests from 2010 to the present. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see New York City Safety Data in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

NycTaxiBase

New York Taxi class - this is a parent class that can be inherited.

Initialize filtering fields.

NycTlcFhv

Represents the NYC Taxi & Limousine Commission public dataset.

This dataset contains For-Hire Vechicle (FHV) trip records, which include fields capturing the dispatching base license number and the pick-up date, time, and taxi zone location ID (shape file below). These records are generated from the FHV Trip Record submissions made by bases. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see NYC Taxi & Limousine Commission - For-Hire Vehicle (FHV) trip records in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

NycTlcGreen

Represents the NYC Taxi & Limousine Commission green taxi trip public dataset.

The green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see NYC Taxi & Limousine Commission - green taxi trip records in the Microsoft Azure Open Datasets catalog.

For an example of using the NycTlcGreen class, see the tutorial Use automated machine learning to predict taxi fares.

Initialize filtering fields.

NycTlcYellow

Represents the NYC Taxi & Limousine Commission yellow taxi trip public dataset.

The yellow taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see NYC Taxi & Limousine Commission - yellow taxi trip records in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

OjSalesSimulated

Represents the Sample Orange Juice Sales Simulated data dataset.

For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see Sample: OJ Sales Simulated Data in the Microsoft Azure Open Datasets catalog.

PublicHolidays

Represents the Public Holidays public dataset.

This datasets contains worldwide public holiday data sourced from PyPI holidays package and Wikipedia, covering 38 countries or regions from 1970 to 2099. Each row indicates the holiday info for a specific date, country or region, and whether most people have paid time off. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see Public Holidays in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

PublicHolidaysOffline

Represents the Public Holidays Offline public dataset.

For a description of the rows, see the Public Holidays in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

SampleDatasetBase

Represents the Sample Dataset Base class.

SanFranciscoSafety

Represents the San Francisco Safety public dataset.

This dataset contains fire department calls for service and 311 cases in San Francisco. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see San Francisco Safety Data in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

SeattleSafety

Represents the Seattle Safety public dataset.

This dataset contains Seattle Fire Department 911 dispatch data. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see Seattle Safety Data in the Microsoft Azure Open Datasets catalog.

Initialize filtering fields.

UsLaborCPI

Represents the US Consumer Price Index public dataset.

The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see US Consumer Price Index in the Microsoft Azure Open Datasets catalog.

Initialize.

UsLaborEHENational

Represents the US National Employment Hours and Earnings public dataset.

This dataset contains industry estimates of nonfarm employment, hours, and earnings of workers on payrolls in the United States. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see US National Employment Hours and Earning in the Microsoft Azure Open Datasets catalog.

Initialize.

UsLaborEHEState

Represents the US State Employment Hours and Earnings public dataset.

This dataset contains industry estimates of nonfarm employment, hours, and earnings of workers on payrolls in the United States. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see US State Employment Hours and Earning in the Microsoft Azure Open Datasets catalog.

Initialize.

UsLaborLAUS

Represents the US Local Area Unemployment Statistics public dataset.

This dataset contains monthly and annual employment, unemployment, and labor force data for Census regions and divisions, States, counties, metropolitan areas, and many cities in the United States. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see US Local Area Unemployment Statistics in the Microsoft Azure Open Datasets catalog.

Initialize.

UsLaborLFS

Represents the US Labor Force Statistics public dataset.

This dataset contains data about the labor force in the United States, including labor force participation rates, and the civilian noninstitutional population by age, gender, race, and ethnic groups. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see US Labor Force Statistics in the Microsoft Azure Open Datasets catalog.

Initialize.

UsLaborPPICommodity

Represents the US Producer Price Index (PPI) - Commodities public dataset.

The Producer Price Index (PPI) is a measure of average change over time in the selling prices received by domestic producers for their output. The prices included in the PPI are from the first commercial transaction for products and services covered. This dataset contains PPIs for individual products and groups of products released monthly. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see US Producer Price Index - Commodities in the Microsoft Azure Open Datasets catalog.

Initialize.

UsLaborPPIIndustry

Represents the US Producer Price Index (PPI) - Industry public dataset.

The Producer Price Index (PPI) is a measure of average change over time in the selling prices received by domestic producers for their output. The prices included in the PPI are from the first commercial transaction for products and services covered. This dataset contains PPIs for a wide range of industry sectors of the U.S. economy. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see US Producer Price Index - Industry in the Microsoft Azure Open Datasets catalog.

For general information about Azure Open Datasets, see Azure Open Datasets Documentation.

Initialize.

UsPopulationCounty

Represents the US Population by County public dataset.

This dataset contains US population by gender and race for each US county sourced from 2000 and 2010 Decennial Census. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see US Population by County in the Microsoft Azure Open Datasets catalog.

Initialize.

UsPopulationZip

Represents the US Population by Zip Code public dataset.

This dataset contains US population by gender and race for each US ZIP code sourced from 2010 Decennial Census. For more information about this dataset, including column descriptions, different ways to access the dataset, and examples, see US Population by ZIP Code in the Microsoft Azure Open Datasets catalog.

Initialize.