Open In Colab

Question 1

dataset

Question 1a

Above, it seems like OFFENSE is more specific than CVLEGEND, e.g. “LARCENY” vs. “THEFT FELONY (OVER $950)”. If you’re unfamiliar with the term, “larceny” is a legal term for theft of personal property.

To get a sense of how many subcategories there are for each OFFENSE, set calls_by_cvlegend_and_offense equal to a multi-indexed series where the data is first indexed on the CVLEGEND and then on the OFFENSE, and the data is equal to the number of offenses in the database that match the respective CVLEGEND and OFFENSE. For example, calls_by_cvlegend_and_offense[“LARCENY”, “THEFT FROM PERSON”] should return 24.

1
2
3

calls_by_cvlegend_and_offense = calls[['CVLEGEND','OFFENSE']].value_counts()
calls_by_cvlegend_and_offense["LARCENY", "THEFT FROM PERSON"]
1
24

Question 1b

In the cell below, set answer1b equal to a list of strings corresponding to the possible values for OFFENSE when CVLEGEND is “LARCENY”. You can type the answer manually, or you can create an expression that automatically extracts the names.

1
2
3
4
5
6
7
8
answer1b = (
    calls_by_cvlegend_and_offense.loc[
        calls_by_cvlegend_and_offense.index.get_level_values("CVLEGEND") == "LARCENY"
    ]
    .index.get_level_values("OFFENSE")
    .tolist()
)
answer1b
1
['THEFT MISD. (UNDER $950)', 'THEFT FELONY (OVER $950)', 'THEFT FROM PERSON']

Question 2

What are the five crime types of CVLEGEND that have the most crime events? You may need to use value_counts to find the answer. Save your results into answer2 as a list of strings.

Hint: The keys method of the Series class might be useful.

1
answer2 = calls['CVLEGEND'].value_counts().head().keys().tolist()

Question 3

Now it is your turn to make some plots using pandas. Let’s start by transforming the data so that it is easier to work with. We then will look at some distributions of the data.

The CVDOW field isn’t named helpfully and it is hard to see the meaning from the data alone. According to the website linked at the top of this notebook, CVDOW is actually indicating the day that events happened. 0->Sunday, 1->Monday … 6->Saturday.

Question 3a

Add a new column Day into the calls dataframe that has the string weekday (eg. ‘Sunday’) for the corresponding value in CVDOW. For example, if the first 3 values of CVDOW are [3, 6, 0], then the first 3 values of the Day column should be ["Wednesday", "Saturday", "Sunday"].

Hint: Try using the Series.map function on calls["CVDOW"]. Can you assign this to the new column calls["Day"]?

1
2
3
4
5
days = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
day_indices = range(7)
indices_to_days_dict = dict(zip(day_indices, days)) # Should look like {0:"Sunday", 1:"Monday", ..., 6:"Saturday"}

calls['Day'] = calls['CVDOW'].map(indices_to_days_dict)

Question 3b

Now let’s look at the EVENTTM column which indicates the time for events. Since it contains hour and minute information, let’s extract the hour info and create a new column named Hour in the calls dataframe. You should save the hour as an int.

Hint: Your code should only require one line

1
calls['Hour'] = pd.to_datetime(calls['EVENTTM']).dt.hour

Question 3c

Using pandas, construct a vertical bar plot with the count of the number of calls (entries in the table) for each day of the week ordered by the day of the week (eg. Sunday, Monday, …). Be sure that your axes are labeled and that your plot is titled. There should be approximately 500 calls per day.

Hint: Given a series s, and an array coolIndex that has the same entries as in s.index, s[coolIndex] will return a copy of the series in the same order as coolIndex.

1
2
3
4
5
6
7
8
order_day = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
data = calls['Day'].value_counts()[order_day]
ax = data.plot(kind='bar')
ax.set_ylabel("The day of the week")
ax.set_xlabel("Number of Calls")
ax.set_title("Number of Calls By Day of the Week");
# Leave this for grading purposes
ax_3c = plt.gca()

ouput

Question 3d

Using pandas, construct a line plot with the count of the number of calls (entries in the table) for each hour of the day ordered by the time (eg. 12:00 AM, 1:00 AM, …). Please use the provided variable hours in your answer. Be sure that your axes are labeled and that your plot is titled.

1
2
3
4
5
6
7
8
9
hours = list(range(24))
data = calls['Hour'].value_counts()[hours]
ax = data.plot(kind='line')
ax.set_xlabel("The time")
ax.set_ylabel("Number of Calls")
ax.set_title("Number of Calls By the time")

# Leave this for grading purposes
ax_3d = plt.gca()

output

Question 4

It seems weekdays generally have slightly more calls than Saturday or Sunday, but the difference does not look significant.

We can break down into some particular types of events to see their distribution. For example, let’s make a bar plot for the CVLEGEND “ROBBERY”. Which day is the peak for “ROBBERY”?

Question 4a

Use pandas to create a vertical bar plot of the number of total robberies reported on each day of the week, again ordered by the days of the week starting with Sunday. Please use the provided variable filtered in your answer.

Hint: This should be very similar to Question 3b

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
filtered = calls[calls["CVLEGEND"] == "ROBBERY"]

order_day = ['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday']
data = filtered['Day'].value_counts()[order_day]
ax = data.plot(kind='bar')
ax.set_ylabel("The day of the week")
ax.set_xlabel("Number of Robberies")
ax.set_title("Number of Robberies By Day of the Week");

# Leave this for grading purposes
ax_4a = plt.gca()

output

Question 5

In the cell below, generate a boxplot which examines the hour of day of each crime broken down by the CVLEGEND value. To construct this plot use the DataFrame.boxplot documentation. You may want to rotate the CVLEGEND labels for better readability.

Looking at your plot, which crime type appears to have the largest interquartile range? Put your results into answer5 as a string

1
2
calls.boxplot(column='Hour',by='CVLEGEND',rot=90)
answer5 = 'SEX CRIME'

output