Open In Colab

dataset

Question 1

For this question we will use SQL to extract data out of the indiv_sample_nyc table. The schema for this table is given below.

Question 1a

Let’s start by looking at 2016 election contributions made by Donald Trump, who was a NY resident during that year. Write a SQL statement that will return the cmte_id, transaction_amt, and name for every contribution made by any donor with “DONALD” and “TRUMP” in their name.

Hint: You’ll need to use WHERE name LIKE '%TRUMP%' AND name LIKE '%DONALD%'.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
query_q1a='''
SELECT
    cmte_id,
    transaction_amt,
    name
FROM indiv_sample_nyc
WHERE name like '%TRUMP%' and name like '%DONALD%'
'''
res_q1a = pd.read_sql(query_q1a, engine)
res_q1a

Question 1b

If we look at the list above, it appears that some donations were not by Donald Trump himself, but instead by an entity called “DONALD J TRUMP FOR PRESIDENT INC”. If we look at the list of donors, we see that our query only seems to have picked up one such anomalous name.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
query_q1b = '''
SELECT
    cmte_id,
    transaction_amt,
    name
FROM indiv_sample_nyc
WHERE name not like '%DONALD J TRUMP FOR PRESIDENT INC%' and name like '%TRUMP%' and name like '%DONALD%'
'''
res_q1b = pd.read_sql(query_q1b, engine)
res_q1b

Question 1c

As a review of some of the key ideas from CS61A/CS88, let’s refine our query above.

We see that many of the contributions above were to the same committee. Create a new SQL query that returns the total amount that Donald Trump contributed to each committee. Your table should have three columns cmte_id, total_amount, and num_donations, where total_amount is the total amount contributed to that committee, and num_donations is the total number of donations. Your table should be sorted in decreased order of total_amount.

Hint: Use GROUP BY as covered in CS61A and CS88.

Hint: Use COUNT(*) as covered in CS61A and CS88.

Hint: Use SUM() as covered in CS61A and CS88.

Hint: Use ORDER BY as covered in CS61A and CS88.

Hint: Use DESC as covered in CS61A and CS88.

Hint: Your WHERE clause should be the same as question 1b.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
query_q1c = '''
SELECT
    cmte_id,
    sum(transaction_amt) as total_amount,
    count(*) as num_donations
FROM indiv_sample_nyc
WHERE name not like '%DONALD J TRUMP FOR PRESIDENT INC%' and name like '%TRUMP%' and name like '%DONALD%'
group by cmte_id
order by total_amount desc
'''
res_q1c = pd.read_sql(query_q1c, engine)
res_q1c

Question 1d

Now we’d like to know the name of each committee, i.e. we want to know the name of the identity that Donald Trump was making contributions to. Modify your query from question 1c so that it also shows the name of each committee. That is, your table should have four columns cmte_id, total_amount, num_donations, and cmte_nm.

The names of the committees are given in the cmte_nm column of the comm table.

Hint: You’ll need to join the indiv_sample_nyc and comm tables.

Hint: One approach is a nested select statement (a.k.a. subquery) as discussed in lecture.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15

query_q1d = '''
select a.cmte_id, a.total_amount,a.num_donations,b.cmte_nm
from (SELECT
    cmte_id,
    sum(transaction_amt) as total_amount,
    count(*) as num_donations
FROM indiv_sample_nyc
WHERE name not like '%DONALD J TRUMP FOR PRESIDENT INC%' and name like '%TRUMP%' and name like '%DONALD%'
group by cmte_id
order by total_amount desc) a
left join comm b on a.cmte_id = b.cmte_id
'''
res_q1d = pd.read_sql(query_q1d, engine)
res_q1d

Question 2

Question 2a

Let’s select transactions in the indiv_sample_nyc table with a cmte_id (committee id) that ends in 5. Then, groupby the transactions in the indiv_sample_nyc table with the cmte_id column, and count how many transactions and how much total amount of contribution there are for each cmte_id. Select the top 5 committees ordered by the transaction count.

Your output should look similar to:

committee_idtotal_amountcount
0C0057579525099091110657
1C00000935698983536237
2C000356753160193417
3C000405354925673277
4C000122454996592128
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
query_q2a='''
SELECT
    cmte_id AS committee_id,
    sum(transaction_amt) AS total_amount,
    count(*) AS count
FROM indiv_sample_nyc
where cmte_id like '%5'
group by cmte_id
order by count(*) desc limit 5
'''
res_q2a = pd.read_sql(query_q2a, engine)
res_q2a

Question 2b

When you printed the schema for the tables, you may have noticed that both the cand and comm tables have a cand_id column. Let’s try joining these two tables on this column.

List the first 5 candidate names (cand_name) in reverse lexicographic order by cand_name, along with their corresponding committee names. Only select rows that have a matching cand_id in both tables.

Your output should look similar to:

cand_namecmte_nm
0ZUTLER, DANIEL PAUL MRCITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT
1ZUMWALT, JAMESZUMWALT FOR CONGRESS
2ZUKOWSKI, ANDREW GEORGEZUKOWSKI FOR CONGRESS
3ZUCCOLO, JOEJOE ZUCCOLO FOR CONGRESS
4ZORN, ROBERT ERWINCONSTITUTIONAL COMMITTEE
1
2
3
4
5
6
7
query_q2b='''
SELECT c1.cand_name,c2.cmte_nm
FROM cand c1 inner JOIN comm c2 ON c1.cand_id=c2.cand_id
order by  c1.cand_name desc limit 5
'''
res_q2b = pd.read_sql(query_q2b, engine)
res_q2b

Question 2c

As in the previous part, list the first 5 candidate names (cand_name) by reverse lexicographic order and their corresponding committee names (cmte_nm). This time, include all candidate names. If there is no matching cand_id in the comm table, then cmte_nm should be NULL.

Your output should look similar to:

cand_namecmte_nm
0ZUTLER, DANIEL PAUL MRCITIZENS TO ELECT DANIEL P ZUTLER FOR PRESIDENT
1ZUMWALT, JAMESZUMWALT FOR CONGRESS
2ZUKOWSKI, ANDREW GEORGEZUKOWSKI FOR CONGRESS
3ZUCCOLO, JOEJOE ZUCCOLO FOR CONGRESS
4ZORNOW, TODD MRNone
1
2
3
4
5
6
7
8
query_q2c='''
SELECT c1.cand_name,c2.cmte_nm
FROM cand c1 left JOIN comm c2 ON c1.cand_id = c2.cand_id
order by c1.cand_name desc limit 5
'''

res_q2c = pd.read_sql(query_q2c, engine)
res_q2c