Load the data

Completed

In this section and the next (3.2), we'll use national/regional statistics gathered by the United Nations (UN) from 2009-2011. This information came from the United Nations Statistics Division's Social indicators page on April 23, 2012. The data includes national/regional health and welfare statistics for 199 countries/regions and territories. These locations are mostly UN members, but the list also includes areas that aren't independent countries/regions (such as Hong Kong SAR).

The dataset includes 199 observations with the following features:

  • region: Region of the world.

  • group: A factor (or categorical variable) with the following levels:

    • oecd: Countries/regions that were members of the Organisation for Economic Co-operation and Development (OECD) by May 25, 2012.
    • africa: Countries/regions on the continent of Africa. (No OECD countries/regions are located in Africa.)
    • other: All other countries/regions.
  • fertility: The total number of children that are born or likely to be born to a woman in her lifetime if she were subject to the prevailing rate of age-specific fertility in the population.

  • ppgdp: Per-capita gross domestic product (GDP) in 2012 US dollars.

  • lifeExpF: Female life expectancy in years.

  • pctUrban: Percentage of the population that's urbanized.

We'll need to load several modules for this section to handle the machine learning and visualizations.

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline

We'll then load the data.

df = pd.read_csv('Data/UN11.csv')
df.head()

Output

region group fertility ppgdp lifeExpF pctUrban
0 Asia other 5.968 499.0 49.49 23
1 Europe other 1.525 3677.2 80.40 53
2 Africa africa 2.142 4473.0 75.00 67
3 Africa africa 5.135 4321.9 53.17 59
4 Caribbean other 2.000 13750.1 81.10 100
df['pctUrban'].min()
11

Note

The path name for read_csv() is case sensitive.