Open In Colab

Question 1

Question 1a

Suppose we wanted to build a histogram of our data to understand the distribution of literacy rates and income per capita individually. We can use countplot in seaborn to create bar charts from categorical data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.countplot(df['lit'])
plt.xlabel("Adult literacy rate: Female: % ages 15 and older: 2005-14")
plt.title('World Bank Female Adult Literacy Rate')

plt.subplot(1,2,2)
sns.histplot(df['inc'])
plt.xlabel('Gross national income per capita, Atlas method: $: 2016')
plt.title('World Bank Gross National Income Per Capita')
plt.show()

lab05_q1

Question 1b

In the cell below, create a plot of literacy rate and income per capita using the distplot function. As above, you should have two subplots, where the left subplot is literacy, and the right subplot is income. When you call distplot, set the kde parameter to false, e.g. distplot(s, kde=False).

Don’t forget to title the plot and label axes!

Hint: Copy and paste from above to start.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.distplot(df['lit'],kde=False)
plt.xlabel("Adult literacy rate: Female: % ages 15 and older: 2005-14")
plt.title('World Bank Female Adult Literacy Rate')

plt.subplot(1,2,2)
sns.distplot(df['inc'],kde=False)
plt.xlabel('Gross national income per capita, Atlas method: $: 2016')
plt.title('World Bank Gross National Income Per Capita')
plt.show()

q1_1

You should see histograms that show the counts of how many data points appear in each bin. distplot uses a heuristic called the Freedman-Diaconis rule to automatically identify the best bin sizes, though it is possible to set the bins yourself (we won’t).

In the cell below, try creating the exact same plot again, but this time set the kde parameter to False and the rug parameter to True.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.distplot(df['lit'],kde=False, rug=True)
plt.xlabel("Adult literacy rate: Female: % ages 15 and older: 2005-14")
plt.title('World Bank Female Adult Literacy Rate')

plt.subplot(1,2,2)
sns.distplot(df['inc'],kde=False,rug=True)
plt.xlabel('Gross national income per capita, Atlas method: $: 2016')
plt.title('World Bank Gross National Income Per Capita')
plt.show()

q1_2

Above, you should see little lines at the bottom of the plot showing the actual data points. In the cell below, let’s do one last tweak and plot with the kde parameter set to True.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.distplot(df['lit'],kde=True, rug=True)
plt.xlabel("Adult literacy rate: Female: % ages 15 and older: 2005-14")
plt.title('World Bank Female Adult Literacy Rate')

plt.subplot(1,2,2)
sns.distplot(df['inc'],kde=True,rug=True)
plt.xlabel('Gross national income per capita, Atlas method: $: 2016')
plt.title('World Bank Gross National Income Per Capita')
plt.show()

q1_3

You should see roughly the same histogram as before. However, now you should see an overlaid smooth line. This is the kernel density estimate discussed in class.

Observations:

  • You’ll also see that the y-axis value is no longer the count. Instead it is a value such that the total area under the KDE curve is 1 and the total area in the histogram is 1. The KDE is a smooth estimate of the distribution of the given variable.

  • The KDE is just an estimate, as is the histogram. Notice that it assigns a large fraction of its area to values in the 100-120% literacy rate. This is definitely an impossibility.

We’ll talk more about KDEs later in this lab.

Question 1c

Looking at the income data, it is difficult to see the distribution among low income countries because they are all scrunched up at the left side of the plot. The KDE also has a problem where the density function has a lot of area below 0.

Transforming the inc data logarithmically gives us a more symmetric distribution of values. This can make it easier to see patterns.

In addition, summary statistics like the mean and standard deviation (square-root of the variance) are more stable with symmetric distributions.

In the cell below, make a distribution plot of inc with the data transformed using np.log10 and kde=True. Be sure to correct the axis label using plt.xlabel. If you want to see the exact counts, just set kde=False.

1
2
3

sns.displot(np.log10(df['inc']),kde=True)
plt.xlabel(r'$\log_{10}inc$')

q1_4

Question 1d

If we want to examine the relationship between the female adult literacy rate and the gross national income per capita, we need to make a scatter plot.

In the cell below, create a scatter plot of untransformed income per capita and literacy rate using the sns.scatterplot function. Make sure to label both axes using plt.xlabel and plt.ylabel.

1
2
3
sns.scatterplot(df['inc'],df['lit'])
plt.xlabel('income per capita')
plt.ylabel('literacy rate')

q1_5

We can better assess the relationship between two variables when they have been straightened because it is easier for us to recognize linearity.

In the cell below, create a scatter plot of log-transformed income per capita against literacy rate. Make sure to label both axes using plt.xlabel and plt.ylabel.

1
2
3
sns.scatterplot(np.log10(df['inc']),df['lit'])
plt.xlabel('income per capita')
plt.ylabel('literacy rate')

q1_6

This scatter plot looks better. The relationship is closer to linear.

We can think of the log-linear relationship between x and y, as follows: a constant change in x corresponds to a percent (scaled) change in y.

We can also see that the long left tail of literacy is represented in this plot by a lot of the points being bunched up near 100. Try squaring literacy and taking the log of income. Does the plot look better?

1
2
3
sns.scatterplot(np.log10(df['inc']),np.sqrt(df['lit']))
plt.xlabel('income per capita')
plt.ylabel('literacy rate')

q1_7

Choosing the best transformation for a relationship is often a balance between keeping the model simple and straightening the scatter plot.

Question 2a

As mentioned above, the kernel density estimate (KDE) is just the sum of a bunch of copies of the kernel, each centered on our data points. The default kernel used by the distplot function (as well as kdeplot) is the Gaussian kernel, given by:

$$\Large K_\alpha(x, z) = \frac{1}{\sqrt{2 \pi \alpha^2}} \exp\left(-\frac{(x - z)^2}{2 \alpha ^2} \right) $$

1
2
def gaussian_kernel(alpha, x, z):
    return 1.0/np.sqrt(2. * np.pi * alpha**2) * np.exp(-(x - z) ** 2 / (2.0 * alpha**2))

For example, we can plot the Gaussian kernel centered at 9 with $\alpha$ = 0.5 as below:

1
2
3
4
xs = np.linspace(-2, 12, 200)
alpha=0.5
kde_curve = [gaussian_kernel(alpha, x, 9) for x in xs]
plt.plot(xs, kde_curve);

q2_1

Question 2c

In your answers above, you hard-coded a lot of your work. In this problem, you’ll build a more general kernel density estimator function. Implement the KDE function which computes:

$$\Large f_\alpha(x) = \frac{1}{n} \sum_{i=1}^n K_\alpha(x, z_i) $$

Where $z_i$ are the data, $\alpha$ is a parameter to control the smoothness, and $K_\alpha$ is the kernel density function passed as kernel.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def kde(kernel, alpha, x, data):
    """
    Compute the kernel density estimate for the single query point x.

    Args:
        kernel: a kernel function with 3 parameters: alpha, x, data
        alpha: the smoothing parameter to pass to the kernel
        x: a single query point (in one dimension)
        data: a numpy array of data points

    Returns:
        The smoothed estimate at the query point x
    """
    return np.mean(kernel(alpha, x, data))