Open In Colab

Question 1

Question 1a

For a DataFrame d, you can add a column with d['new column name'] = ... and assign a list or array of values to the column. Add a column of integers containing 1, 2, 3, and 4 called rank1 to the fruit_info table which expresses your personal preference about the taste ordering for each fruit (1 is tastiest; 4 is least tasty).

question1a

Question 1b

You can also add a column to d with d.loc[:, 'new column name'] = .... As discussed in lecture, the first parameter is for the rows and second is for columns. The : means change all rows and the new column name indicates the column you are modifying (or in this case, adding).

Add a column called rank2 to the fruit_info table which contains the same values in the same order as the rank1 column.

1
2
fruit_info.loc[:,'rank2'] = fruit_info['rank1']
fruit_info

Question 2

Use the .drop() method to drop both the rank1 and rank2 columns you created. (Make sure to use the axis parameter correctly.) Note that drop does not change a table, but instead returns a new table with fewer columns or rows unless you set the optional inplace parameter.

Hint: Look through the documentation to see how you can drop multiple columns of a Pandas dataframe at once using a list of column names.

1
2
fruit_info_original = fruit_info.drop(['rank1','rank2'],axis=1)
fruit_info_original

Question 3

Use the .rename() method to rename the columns of fruit_info_original so they begin with capital letters. Set this new dataframe to fruit_info_caps.

1
2
fruit_info_caps = fruit_info_original.rename(columns=str.capitalize)
fruit_info_caps

Question 4

dataset

Selecting multiple columns is easy. You just need to supply a list of column names. Select the Name and Year in that order from the baby_names table.

1
2
name_and_year = baby_names[['Name','Year']]
name_and_year[:5]

Question 5

Using a boolean array, select the names in Year 2000 (from baby_names) that have larger than 3000 counts. Keep all columns from the original baby_names dataframe.

Note: Any time you use p & q to filter the dataframe, make sure to use df[(df[p]) & (df[q])] or df.loc[(df[p]) & (df[q])]. That is, make sure to wrap conditions with parentheses.

Remember that both slicing and loc will achieve the same result, it is just that loc is typically faster in production. You are free to use whichever one you would like.

1
2
result = baby_names[(baby_names['Year']==2000)&(baby_names['Count']>3000)]
result.head()

Optionally, repeat the exercise from above, but this time using the query command from lecture.

1
2
result_using_query = baby_names.query('Year==2000&Count>3000')
result_using_query.head()

Question 6

Some names gain/lose popularity because of cultural phenomena such as a political figure coming to power. Below, we plot the popularity of the female name Hillary in Calfiornia over time. What do you notice about this plot? What might be the cause of the steep drop?

1
2
3
4
5
6
hillary_baby_name = baby_names[(baby_names['Name'] == 'Hillary') & (baby_names['State'] == 'CA') & (baby_names['Sex'] == 'F')]
plt.plot(hillary_baby_name['Year'], hillary_baby_name['Count'])
plt.title("Hillary Popularity Over Time")
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

lab03_q6