2019独角兽企业重金招聘Python工程师标准>>>
https://github.com/dataquestio/solutions/blob/master/Mission210Solution.ipynb
1: Jeopardy Questions
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. If you need help at any point, you can consult our solution notebookhere.
Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.
The dataset is named jeopardy.csv
, and contains 20000
rows from the beginning of a full dataset of Jeopardy questions, which you can download here. Here's the beginning of the file:
As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:
Show Number
-- the Jeopardy episode number of the show this question was in.Air Date
-- the date the episode aired.Round
-- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.Category
-- the category of the question.Value
-- the number of dollars answering the question correctly is worth.Question
-- the text of the question.Answer
-- the text of the answer.
Instructions
- Read the dataset into a Dataframe called
jeopardy
usingPandas. - Print out the first
5
rows ofjeopardy
. - Print out the columns of
jeopardy
usingjeopardy.columns
. - Some of the column names have spaces in front.
- Remove the spaces in each item in
jeopardy.columns
. - Assign the result back to
jeopardy.columns
to fix the column names injeopardy
.
- Remove the spaces in each item in
- Make sure you pay close attention to the format of each column.
2: Normalizing Text
Before you can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the Question
and Answer
columns). We covered normalization before, but the idea is to ensure that you lowercase words and remove punctuation so Don't
and don't
aren't considered to be different words when you compare them.
Instructions
- Write a function to normalize questions and answers. It should:
- Take in a string.
- Convert the string to lowercase.
- Remove all punctuation in the string.
- Return the string.
- Normalize the
Question
column.- Use the Pandas apply method to apply the function to each item in the
Question
column. - Assign the result to the
clean_question
column.
- Use the Pandas apply method to apply the function to each item in the
- Normalize the
Answer
column.- Use the Pandas apply method to apply the function to each item in the
Answer
column. - Assign the result to the
clean_answer
column.
- Use the Pandas apply method to apply the function to each item in the
3: Normalizing Columns
Now that you've normalized the text columns, there are also some other columns to normalize.
The Value
column should also be numeric, to allow you to manipulate it more easily. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.
The Air Date
column should also be a datetime, not a string, to enable you to work with it more easily.
Instructions
- Write a function to normalize dollar values. It should:
- Take in a string.
- Remove any punctuation in the string.
- Convert the string to an integer.
- If the conversion has an error, assign
0
instead. - Return the integer.
- Normalize the
Value
column.- Use the Pandas apply method to apply the function to each item in the
Value
column. - Assign the result to the
clean_value
column.
- Use the Pandas apply method to apply the function to each item in the
- Use the pandas.to_datetime function to convert the
Air Date
column to a datetime column.
4: Answers In Questions
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.
You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.
Instructions
- Write a function that takes in a row in
jeopardy
, as a Series. It should:- Split the
clean_answer
column on the space character (), and assign to the variablesplit_answer
.- Split the
clean_question
column on the space character (), and assign to the variablesplit_question
.
- Split the
- Create a variable called
match_count
, and set it to0
. - If
the
is insplit_answer
, remove it using the removemethod on lists.The
is commonly found in answers and questions, but doesn't have any meaningful use in finding the answer. - If the length of
split_answer
is0
, return0
. This prevents a division by zero error later. - Loop through each item in
split_answer
, and see if it occurs insplit_question
. If it does, add1
tomatch_count
. - Divide
match_count
by the length ofsplit_answer
, and return the result.
- Split the
- Count how many times terms in
clean_answer
occur inclean_question
.- Use the Pandas apply method on Dataframes to apply the function to each row in
jeopardy
. - Pass the
axis=1
argument to apply the function across each row. - Assign the result to the
answer_in_question
column.
- Use the Pandas apply method on Dataframes to apply the function to each row in
- Find the mean of the
answer_in_question
column using themean method on Series. - Write up a markdown cell with a short explanation of how finding this mean might influence your studying strategy for Jeopardy.
5: Recycled Questions
Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10%
of the full Jeopardy question dataset, but you can investigate it at least.
To do this, you can:
- Sort
jeopardy
in order of ascending air date. - Maintain a set called
terms_used
that will be empty initially. - Iterate through each row of
jeopardy
. - Split
clean_question
into words, remove any word shorter than6
characters, and check if each word occurs interms_used
.- If it does, increment a counter.
- Add each word to
terms_used
.
This will enable you to check if the terms in questions have been used previously or not. Only looking at words greater than 6
characters enables you to filter out words like the
andthan
, which are commonly used, but don't tell you a lot about a question.
Instructions
- Create an empty list called
question_overlap
. - Create an empty set called
terms_used
. - Use the iterrows Dataframe method to loop through each row of
jeopardy
.- Split the
clean_question
column of the row on the space character (), and assign tosplit_question
. - Remove any words in
split_question
that are less than6
characters long. - Set
match_count
to0
. - Loop through each word in
split_question
.- If the term occurs in
terms_used
, add1
tomatch_count
.
- If the term occurs in
- Add each word in
split_question
toterms_used
using the add method on sets. - If the length of
split_question
is greater than0
, dividematch_count
by the length ofsplit_question
. - Append
match_count
toquestion_overlap
.
- Split the
- Assign
question_overlap
to thequestion_overlap
column ofjeopardy
. - Find the mean of the
question_overlap
column and print it. - Look at the value, and think about what this might mean for questions being recycled. Write up your thoughts in a markdown cell.
6: Low Value Vs High Value Questions
Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.
You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:
- Low value -- Any row where
Value
is less than800
. - High value -- Any row where
Value
is greater than800
.
You'll then be able to loop through each of the terms from the last screen, terms_used
, and:
- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.
You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.
Instructions
- Create a function that takes in a row from a Dataframe, and:
- If the
clean_value
column is greater than800
, assign1
tovalue
. - Otherwise, assign
0
tovalue
. - Return
value
.
- If the
- Determine which questions are high and low value.
- Use the Pandas apply method on Dataframes to apply the function to each row in
jeopardy
. - Pass the
axis=1
argument to apply the function across each row. - Assign the result to the
high_value
column.
- Use the Pandas apply method on Dataframes to apply the function to each row in
- Create a function that takes in a word, and:
- Assigns
0
tolow_count
. - Assigns
0
tohigh_count
. - Loops through each row in
jeopardy
using the iterrowsmethod.- Split the
clean_question
column on the space character (). - If the word is in the split question:
- If the
high_value
column is1
, add1
tohigh_count
. - Else, add
1
tolow_count
.
- If the
- Split the
- Returns
high_count
andlow_count
. You can return multiple values by separating them with a comma.
- Assigns
- Create an empty list called
observed_expected
. - Convert
terms_used
into a list using the list function, and assign the first5
elements tocomparison_terms
. - Loop through each term in
comparison_terms
, and:- Run the function on the term to get the high value and low value counts.
- Append the result of running the function (which will be a list) to
observed_expected
.
7: Applying The Chi-Squared Test
Now that you've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value.
Instructions
- Find the number of rows in
jeopardy
wherehigh_value
is1
, and assign tohigh_value_count
. - Find the number of rows in
jeopardy
wherehigh_value
is0
, and assign tolow_value_count
. - Create an empty list called
chi_squared
. - Loop through each list in
observed_expected
.- Add up both items in the list (high and low counts) to get the total count, and assign to
total
. - Divide
total
by the number of rows injeopardy
to get the proportion across the dataset. Assign tototal_prop
. - Multiply
total_prop
byhigh_value_count
to get the expected term count for high value rows. - Multiply
total_prop
bylow_value_count
to get the expected term count for low value rows. - Use the scipy.stats.chisquare function to compute the chi-squared value and p-value given the expected and observed counts.
- Append the results to
chi_squared
.
- Add up both items in the list (high and low counts) to get the total count, and assign to
- Look over the chi-squared values and the associated p-values. Are there any statistically significant results? Write up your thoughts in a markdown cell.
8: Next Steps
That's it for the guided steps! We recommend exploring the data more on your own.
Here are some potential next steps:
- Find a better way to eliminate non-informative words than just removing words that are less than
6
characters long. Some ideas:- Manually create a list of words to remove, like
the
,than
, etc. - Find a list of stopwords to remove.
- Remove words that occur in more than a certain percentage (like
5%
) of questions.
- Manually create a list of words to remove, like
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
- Use the apply method to make the code that calculates frequencies more efficient.
- Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the
Category
column and see if any interesting analysis can be done with it. Some ideas:- See which categories appear the most often.
- Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (availablehere) instead of the subset we used in this mission.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.
We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.
You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.
We hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work!
In [23]:
import pandas import csvjeopardy = pandas.read_csv("jeopardy.csv")jeopardy
Out[23]:
Show Number | Air Date | Round | Category | Value | Question | Answer | |
---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200</td> <td>For the last 8 years of his life, Galileo was ...</td> <td>Copernicus</td> </tr> <tr> <th>1</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$200 | No. 2: 1912 Olympian; football star at Carlisl... | Jim Thorpe |
2 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $200</td> <td>The city of Yuma in this state has a record av...</td> <td>Arizona</td> </tr> <tr> <th>3</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200</td> <td>Signer of the Dec. of Indep., framer of the Co...</td> <td>John Adams</td> </tr> <tr> <th>5</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$200 | In the title of an Aesop fable, this insect sh... | the ant |
6 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $400 | Built in 312 B.C. to link Rome & the South of ... | the Appian Way |
7 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $400 | No. 8: 30 steals for the Birmingham Barons; 2,... | Michael Jordan |
8 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $400</td> <td>In the winter of 1971-72, a record 1,122 inche...</td> <td>Washington</td> </tr> <tr> <th>9</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$400 | This housewares store was named for the packag... | Crate & Barrel |
10 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $400</td> <td>"And away we go"</td> <td>Jackie Gleason</td> </tr> <tr> <th>11</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$400 | Cows regurgitate this from the first stomach t... | the cud |
12 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $600</td> <td>In 1000 Rajaraja I of the Cholas battled to ta...</td> <td>Ceylon (or Sri Lanka)</td> </tr> <tr> <th>13</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$600 | No. 1: Lettered in hoops, football & lacrosse ... | Jim Brown |
14 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $600</td> <td>On June 28, 1994 the nat'l weather service beg...</td> <td>the UV index</td> </tr> <tr> <th>15</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$600 | This company's Accutron watch, introduced in 1... | Bulova |
16 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $600</td> <td>Outlaw: "Murdered by a traitor and a coward wh...</td> <td>Jesse James</td> </tr> <tr> <th>17</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$600 | A small demon, or a mischievous child (who mig... | imp |
18 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $800</td> <td>Karl led the first of these Marxist organizati...</td> <td>the International</td> </tr> <tr> <th>19</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$800 | No. 10: FB/LB for Columbia U. in the 1920s; MV... | (Lou) Gehrig |
20 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $800</td> <td>Africa's lowest temperature was 11 degrees bel...</td> <td>Morocco</td> </tr> <tr> <th>21</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$800 | Edward Teller & this man partnered in 1898 to ... | (Paul) Bonwit |
22 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $2,000</td> <td>1939 Oscar winner: "...you are a credit to you...</td> <td>Hattie McDaniel (for her role in Gone with the...</td> </tr> <tr> <th>23</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$800 | In geologic time one of these, shorter than an... | era |
24 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $1000</td> <td>This Asian political party was founded in 1885...</td> <td>the Congress Party</td> </tr> <tr> <th>25</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$1000 | No. 5: Only center to lead the NBA in assists;... | (Wilt) Chamberlain |
26 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $1000 | The Kirschner brothers, Don & Bill, named this... | K2 |
27 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $1000 | Revolutionary War hero: "His spirit is in Verm... | Ethan Allen |
28 | 4680 | 2004-12-31 | Jeopardy! | 3-LETTER WORDS | $1000</td> <td>A single layer of paper, or to perform one's c...</td> <td>ply</td> </tr> <tr> <th>29</th> <td>4680</td> <td>2004-12-31</td> <td>Double Jeopardy!</td> <td>DR. SEUSS AT THE MULTIPLEX</td> <td>$400 | <a href="http://www.j-archive.com/media/2004-1... | Horton |
... | ... | ... | ... | ... | ... | ... | ... |
19969 | 5694 | 2009-05-14 | Double Jeopardy! | AMERICAN HISTORY | $1200 | In 1960 the last of these locomotives was reti... | steam engines |
19970 | 5694 | 2009-05-14 | Double Jeopardy! | MIND YOUR SHAKESPEARE "P"s & "Q"s | $1200 | Kate: "if I be waspish, best beware my sting";... | Petruchio |
19971 | 5694 | 2009-05-14 | Double Jeopardy! | ALMA MATERS | $1,500</td> <td>This private college in Northern California bo...</td> <td>Stanford University</td> </tr> <tr> <th>19972</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ACTRESSES</td> <td>$1200 | She voiced Princess Pea in "The Tale of Desper... | Emma Watson |
19973 | 5694 | 2009-05-14 | Double Jeopardy! | 2-LETTER WORDS | $1200 | It's the name of the long-awaited new White Ho... | Bo |
19974 | 5694 | 2009-05-14 | Double Jeopardy! | ANGELS & DEMONS | $1200 | Langdon in "Angels & Demons" is looking for <a... | an antimatter bomb |
19975 | 5694 | 2009-05-14 | Double Jeopardy! | AMERICAN HISTORY | $1600 | In the 1600s most of New York State was occupi... | the Iroquois |
19976 | 5694 | 2009-05-14 | Double Jeopardy! | MIND YOUR SHAKESPEARE "P"s & "Q"s | $1600 | Marina's dad (need a hint? he rules Tyre) | Pericles |
19977 | 5694 | 2009-05-14 | Double Jeopardy! | ALMA MATERS | $1600</td> <td>Presidential kids are welcome at this New Orle...</td> <td>Tulane</td> </tr> <tr> <th>19978</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ACTRESSES</td> <td>$1600 | She didn't vamp it up & did a bella job as Em ... | Kristen Stewart |
19979 | 5694 | 2009-05-14 | Double Jeopardy! | 2-LETTER WORDS | $1600 | Third syllable intoned by the giant who smells... | fo |
19980 | 5694 | 2009-05-14 | Double Jeopardy! | ANGELS & DEMONS | $1600 | Much of "Angels & Demons" takes place at one o... | a conclave |
19981 | 5694 | 2009-05-14 | Double Jeopardy! | AMERICAN HISTORY | $1,200 | In 1899 Secretary of State John Hay proclaimed... | open-door policy |
19982 | 5694 | 2009-05-14 | Double Jeopardy! | MIND YOUR SHAKESPEARE "P"s & "Q"s | $2000 | Fruity surname of Peter in "A Midsummer Night'... | Quince |
19983 | 5694 | 2009-05-14 | Double Jeopardy! | ALMA MATERS | $2000 | Quincy Jones, Kevin Eubanks & Branford Marsali... | Berklee |
19984 | 5694 | 2009-05-14 | Double Jeopardy! | ACTRESSES | $2000 | In 2009 she returned to being "Fast & Furious"... | Michelle Rodriguez |
19985 | 5694 | 2009-05-14 | Double Jeopardy! | 2-LETTER WORDS | $2000 | The book of Genesis says this ancient city "of... | Ur |
19986 | 5694 | 2009-05-14 | Double Jeopardy! | ANGELS & DEMONS | $2000 | "Habakkuk and the Angel" is one of a series of... | Bernini |
19987 | 5694 | 2009-05-14 | Final Jeopardy! | SCIENCE TERMS | None | In medieval England, it meant the smallest uni... | atom |
19988 | 3582 | 2000-03-14 | Jeopardy! | U.S. GEOGRAPHY | $100</td> <td>This Texas city is the largest in the U.S. to ...</td> <td>Houston (Lee Brown)</td> </tr> <tr> <th>19989</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>POP MUSIC PAIRINGS</td> <td>$100 | ...& the Crickets | Buddy Holly |
19990 | 3582 | 2000-03-14 | Jeopardy! | HISTORIC PEOPLE | $100</td> <td>In the 990s this son of Erik the Red brought C...</td> <td>Leif Ericson</td> </tr> <tr> <th>19991</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>1998 QUOTATIONS</td> <td>$100 | Concerning a failed Windows 98 demonstration, ... | Bill Gates |
19992 | 3582 | 2000-03-14 | Jeopardy! | LLAMA-RAMA | $100</td> <td>This llama product is used to make hats, blank...</td> <td>Wool</td> </tr> <tr> <th>19993</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>DING DONG</td> <td>$100 | In 1967 this company introduced its chocolate-... | Hostess |
19994 | 3582 | 2000-03-14 | Jeopardy! | U.S. GEOGRAPHY | $200</td> <td>Of 8, 12 or 18, the number of U.S. states that...</td> <td>18</td> </tr> <tr> <th>19995</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>POP MUSIC PAIRINGS</td> <td>$200 | ...& the New Power Generation | Prince |
19996 | 3582 | 2000-03-14 | Jeopardy! | HISTORIC PEOPLE | $200</td> <td>In 1589 he was appointed professor of mathemat...</td> <td>Galileo</td> </tr> <tr> <th>19997</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>1998 QUOTATIONS</td> <td>$200 | Before the grand jury she said, "I'm really so... | Monica Lewinsky |
19998 | 3582 | 2000-03-14 | Jeopardy! | LLAMA-RAMA | $200 | Llamas are the heftiest South American members... | Camels |
19999 rows × 7 columns
In [26]:
jeopardy.columns
Out[26]:
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',' Question', ' Answer'],dtype='object')
In [27]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
In [31]:
import redef normalize_text(text):text = text.lower()text = re.sub("[^A-Za-z0-9\s]", "", text)return textdef normalize_values(text):text = re.sub("[^A-Za-z0-9\s]", "", text)try:text = int(text)except Exception:text = 0return text
In [40]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text) jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text) jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)
In [41]:
jeopardy
Out[41]:
Show Number | Air Date | Round | Category | Value | Question | Answer | clean_question | clean_answer | clean_value | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4680 | 2004-12-31 | Jeopardy! | HISTORY | $200 | For the last 8 years of his life, Galileo was ... | Copernicus | for the last 8 years of his life galileo was u... | copernicus | 200 |
1 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $200</td> <td>No. 2: 1912 Olympian; football star at Carlisl...</td> <td>Jim Thorpe</td> <td>no 2 1912 olympian football star at carlisle i...</td> <td>jim thorpe</td> <td>200</td> </tr> <tr> <th>2</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>EVERYBODY TALKS ABOUT IT...</td> <td>$200 | The city of Yuma in this state has a record av... | Arizona | the city of yuma in this state has a record av... | arizona | 200 |
3 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $200 | In 1963, live on "The Art Linkletter Show", th... | McDonald's | in 1963 live on the art linkletter show this c... | mcdonalds | 200 |
4 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $200 | Signer of the Dec. of Indep., framer of the Co... | John Adams | signer of the dec of indep framer of the const... | john adams | 200 |
5 | 4680 | 2004-12-31 | Jeopardy! | 3-LETTER WORDS | $200</td> <td>In the title of an Aesop fable, this insect sh...</td> <td>the ant</td> <td>in the title of an aesop fable this insect sha...</td> <td>the ant</td> <td>200</td> </tr> <tr> <th>6</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$400 | Built in 312 B.C. to link Rome & the South of ... | the Appian Way | built in 312 bc to link rome the south of ita... | the appian way | 400 |
7 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $400</td> <td>No. 8: 30 steals for the Birmingham Barons; 2,...</td> <td>Michael Jordan</td> <td>no 8 30 steals for the birmingham barons 2306 ...</td> <td>michael jordan</td> <td>400</td> </tr> <tr> <th>8</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>EVERYBODY TALKS ABOUT IT...</td> <td>$400 | In the winter of 1971-72, a record 1,122 inche... | Washington | in the winter of 197172 a record 1122 inches o... | washington | 400 |
9 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $400 | This housewares store was named for the packag... | Crate & Barrel | this housewares store was named for the packag... | crate barrel | 400 |
10 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $400 | "And away we go" | Jackie Gleason | and away we go | jackie gleason | 400 |
11 | 4680 | 2004-12-31 | Jeopardy! | 3-LETTER WORDS | $400</td> <td>Cows regurgitate this from the first stomach t...</td> <td>the cud</td> <td>cows regurgitate this from the first stomach t...</td> <td>the cud</td> <td>400</td> </tr> <tr> <th>12</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$600 | In 1000 Rajaraja I of the Cholas battled to ta... | Ceylon (or Sri Lanka) | in 1000 rajaraja i of the cholas battled to ta... | ceylon or sri lanka | 600 |
13 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $600 | No. 1: Lettered in hoops, football & lacrosse ... | Jim Brown | no 1 lettered in hoops football lacrosse at s... | jim brown | 600 |
14 | 4680 | 2004-12-31 | Jeopardy! | EVERYBODY TALKS ABOUT IT... | $600 | On June 28, 1994 the nat'l weather service beg... | the UV index | on june 28 1994 the natl weather service began... | the uv index | 600 |
15 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $600 | This company's Accutron watch, introduced in 1... | Bulova | this companys accutron watch introduced in 196... | bulova | 600 |
16 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $600 | Outlaw: "Murdered by a traitor and a coward wh... | Jesse James | outlaw murdered by a traitor and a coward whos... | jesse james | 600 |
17 | 4680 | 2004-12-31 | Jeopardy! | 3-LETTER WORDS | $600</td> <td>A small demon, or a mischievous child (who mig...</td> <td>imp</td> <td>a small demon or a mischievous child who might...</td> <td>imp</td> <td>600</td> </tr> <tr> <th>18</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$800 | Karl led the first of these Marxist organizati... | the International | karl led the first of these marxist organizati... | the international | 800 |
19 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $800</td> <td>No. 10: FB/LB for Columbia U. in the 1920s; MV...</td> <td>(Lou) Gehrig</td> <td>no 10 fblb for columbia u in the 1920s mvp for...</td> <td>lou gehrig</td> <td>800</td> </tr> <tr> <th>20</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>EVERYBODY TALKS ABOUT IT...</td> <td>$800 | Africa's lowest temperature was 11 degrees bel... | Morocco | africas lowest temperature was 11 degrees belo... | morocco | 800 |
21 | 4680 | 2004-12-31 | Jeopardy! | THE COMPANY LINE | $800 | Edward Teller & this man partnered in 1898 to ... | (Paul) Bonwit | edward teller this man partnered in 1898 to s... | paul bonwit | 800 |
22 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $2,000 | 1939 Oscar winner: "...you are a credit to you... | Hattie McDaniel (for her role in Gone with the... | 1939 oscar winner you are a credit to your cra... | hattie mcdaniel for her role in gone with the ... | 2000 |
23 | 4680 | 2004-12-31 | Jeopardy! | 3-LETTER WORDS | $800</td> <td>In geologic time one of these, shorter than an...</td> <td>era</td> <td>in geologic time one of these shorter than an ...</td> <td>era</td> <td>800</td> </tr> <tr> <th>24</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$1000 | This Asian political party was founded in 1885... | the Congress Party | this asian political party was founded in 1885... | the congress party | 1000 |
25 | 4680 | 2004-12-31 | Jeopardy! | ESPN's TOP 10 ALL-TIME ATHLETES | $1000</td> <td>No. 5: Only center to lead the NBA in assists;...</td> <td>(Wilt) Chamberlain</td> <td>no 5 only center to lead the nba in assists tr...</td> <td>wilt chamberlain</td> <td>1000</td> </tr> <tr> <th>26</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$1000 | The Kirschner brothers, Don & Bill, named this... | K2 | the kirschner brothers don bill named this sk... | k2 | 1000 |
27 | 4680 | 2004-12-31 | Jeopardy! | EPITAPHS & TRIBUTES | $1000</td> <td>Revolutionary War hero: "His spirit is in Verm...</td> <td>Ethan Allen</td> <td>revolutionary war hero his spirit is in vermon...</td> <td>ethan allen</td> <td>1000</td> </tr> <tr> <th>28</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$1000 | A single layer of paper, or to perform one's c... | ply | a single layer of paper or to perform ones cra... | ply | 1000 |
29 | 4680 | 2004-12-31 | Double Jeopardy! | DR. SEUSS AT THE MULTIPLEX | $400 | <a href="http://www.j-archive.com/media/2004-1... | Horton | a hrefhttpwwwjarchivecommedia20041231dj23mp3be... | horton | 400 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
19969 | 5694 | 2009-05-14 | Double Jeopardy! | AMERICAN HISTORY | $1200 | In 1960 the last of these locomotives was reti... | steam engines | in 1960 the last of these locomotives was reti... | steam engines | 1200 |
19970 | 5694 | 2009-05-14 | Double Jeopardy! | MIND YOUR SHAKESPEARE "P"s & "Q"s | $1200</td> <td>Kate: "if I be waspish, best beware my sting";...</td> <td>Petruchio</td> <td>kate if i be waspish best beware my sting his ...</td> <td>petruchio</td> <td>1200</td> </tr> <tr> <th>19971</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ALMA MATERS</td> <td>$1,500 | This private college in Northern California bo... | Stanford University | this private college in northern california bo... | stanford university | 1500 |
19972 | 5694 | 2009-05-14 | Double Jeopardy! | ACTRESSES | $1200</td> <td>She voiced Princess Pea in "The Tale of Desper...</td> <td>Emma Watson</td> <td>she voiced princess pea in the tale of despere...</td> <td>emma watson</td> <td>1200</td> </tr> <tr> <th>19973</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>2-LETTER WORDS</td> <td>$1200 | It's the name of the long-awaited new White Ho... | Bo | its the name of the longawaited new white hous... | bo | 1200 |
19974 | 5694 | 2009-05-14 | Double Jeopardy! | ANGELS & DEMONS | $1200 | Langdon in "Angels & Demons" is looking for <a... | an antimatter bomb | langdon in angels demons is looking for a hre... | an antimatter bomb | 1200 |
19975 | 5694 | 2009-05-14 | Double Jeopardy! | AMERICAN HISTORY | $1600 | In the 1600s most of New York State was occupi... | the Iroquois | in the 1600s most of new york state was occupi... | the iroquois | 1600 |
19976 | 5694 | 2009-05-14 | Double Jeopardy! | MIND YOUR SHAKESPEARE "P"s & "Q"s | $1600</td> <td>Marina's dad (need a hint? he rules Tyre)</td> <td>Pericles</td> <td>marinas dad need a hint he rules tyre</td> <td>pericles</td> <td>1600</td> </tr> <tr> <th>19977</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ALMA MATERS</td> <td>$1600 | Presidential kids are welcome at this New Orle... | Tulane | presidential kids are welcome at this new orle... | tulane | 1600 |
19978 | 5694 | 2009-05-14 | Double Jeopardy! | ACTRESSES | $1600 | She didn't vamp it up & did a bella job as Em ... | Kristen Stewart | she didnt vamp it up did a bella job as em in... | kristen stewart | 1600 |
19979 | 5694 | 2009-05-14 | Double Jeopardy! | 2-LETTER WORDS | $1600 | Third syllable intoned by the giant who smells... | fo | third syllable intoned by the giant who smells... | fo | 1600 |
19980 | 5694 | 2009-05-14 | Double Jeopardy! | ANGELS & DEMONS | $1600 | Much of "Angels & Demons" takes place at one o... | a conclave | much of angels demons takes place at one of a... | a conclave | 1600 |
19981 | 5694 | 2009-05-14 | Double Jeopardy! | AMERICAN HISTORY | $1,200 | In 1899 Secretary of State John Hay proclaimed... | open-door policy | in 1899 secretary of state john hay proclaimed... | opendoor policy | 1200 |
19982 | 5694 | 2009-05-14 | Double Jeopardy! | MIND YOUR SHAKESPEARE "P"s & "Q"s | $2000</td> <td>Fruity surname of Peter in "A Midsummer Night'...</td> <td>Quince</td> <td>fruity surname of peter in a midsummer nights ...</td> <td>quince</td> <td>2000</td> </tr> <tr> <th>19983</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ALMA MATERS</td> <td>$2000 | Quincy Jones, Kevin Eubanks & Branford Marsali... | Berklee | quincy jones kevin eubanks branford marsalis ... | berklee | 2000 |
19984 | 5694 | 2009-05-14 | Double Jeopardy! | ACTRESSES | $2000 | In 2009 she returned to being "Fast & Furious"... | Michelle Rodriguez | in 2009 she returned to being fast furious as... | michelle rodriguez | 2000 |
19985 | 5694 | 2009-05-14 | Double Jeopardy! | 2-LETTER WORDS | $2000 | The book of Genesis says this ancient city "of... | Ur | the book of genesis says this ancient city of ... | ur | 2000 |
19986 | 5694 | 2009-05-14 | Double Jeopardy! | ANGELS & DEMONS | $2000</td> <td>"Habakkuk and the Angel" is one of a series of...</td> <td>Bernini</td> <td>habakkuk and the angel is one of a series of a...</td> <td>bernini</td> <td>2000</td> </tr> <tr> <th>19987</th> <td>5694</td> <td>2009-05-14</td> <td>Final Jeopardy!</td> <td>SCIENCE TERMS</td> <td>None</td> <td>In medieval England, it meant the smallest uni...</td> <td>atom</td> <td>in medieval england it meant the smallest unit...</td> <td>atom</td> <td>0</td> </tr> <tr> <th>19988</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>U.S. GEOGRAPHY</td> <td>$100 | This Texas city is the largest in the U.S. to ... | Houston (Lee Brown) | this texas city is the largest in the us to ha... | houston lee brown | 100 |
19989 | 3582 | 2000-03-14 | Jeopardy! | POP MUSIC PAIRINGS | $100 | ...& the Crickets | Buddy Holly | the crickets | buddy holly | 100 |
19990 | 3582 | 2000-03-14 | Jeopardy! | HISTORIC PEOPLE | $100 | In the 990s this son of Erik the Red brought C... | Leif Ericson | in the 990s this son of erik the red brought c... | leif ericson | 100 |
19991 | 3582 | 2000-03-14 | Jeopardy! | 1998 QUOTATIONS | $100</td> <td>Concerning a failed Windows 98 demonstration, ...</td> <td>Bill Gates</td> <td>concerning a failed windows 98 demonstration h...</td> <td>bill gates</td> <td>100</td> </tr> <tr> <th>19992</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>LLAMA-RAMA</td> <td>$100 | This llama product is used to make hats, blank... | Wool | this llama product is used to make hats blanke... | wool | 100 |
19993 | 3582 | 2000-03-14 | Jeopardy! | DING DONG | $100</td> <td>In 1967 this company introduced its chocolate-...</td> <td>Hostess</td> <td>in 1967 this company introduced its chocolatec...</td> <td>hostess</td> <td>100</td> </tr> <tr> <th>19994</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>U.S. GEOGRAPHY</td> <td>$200 | Of 8, 12 or 18, the number of U.S. states that... | 18 | of 8 12 or 18 the number of us states that tou... | 18 | 200 |
19995 | 3582 | 2000-03-14 | Jeopardy! | POP MUSIC PAIRINGS | $200 | ...& the New Power Generation | Prince | the new power generation | prince | 200 |
19996 | 3582 | 2000-03-14 | Jeopardy! | HISTORIC PEOPLE | $200 | In 1589 he was appointed professor of mathemat... | Galileo | in 1589 he was appointed professor of mathemat... | galileo | 200 |
19997 | 3582 | 2000-03-14 | Jeopardy! | 1998 QUOTATIONS | $200</td> <td>Before the grand jury she said, "I'm really so...</td> <td>Monica Lewinsky</td> <td>before the grand jury she said im really sorry...</td> <td>monica lewinsky</td> <td>200</td> </tr> <tr> <th>19998</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>LLAMA-RAMA</td> <td>$200 | Llamas are the heftiest South American members... | Camels | llamas are the heftiest south american members... | camels | 200 |
19999 rows × 10 columns
In [36]:
jeopardy["Air Date"] = pandas.to_datetime(jeopardy["Air Date"])
In [38]:
jeopardy.dtypes
Out[38]:
Show Number int64 Air Date datetime64[ns] Round object Category object Value object Question object Answer object clean_question object clean_answer object clean_value int64 dtype: object
In [51]:
def count_matches(row):split_answer = row["clean_answer"].split(" ")split_question = row["clean_question"].split(" ")if "the" in split_answer:split_answer.remove("the")if len(split_answer) == 0:return 0match_count = 0for item in split_answer:if item in split_question:match_count += 1return match_count / len(split_answer)jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
In [53]:
jeopardy["answer_in_question"].mean()
Out[53]:
0.060493257069335872
Answer terms in the question
The answer only appears in the question about 6%
of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.
In [54]:
question_overlap = [] terms_used = set() for i, row in jeopardy.iterrows():split_question = row["clean_question"].split(" ")split_question = [q for q in split_question if len(q) > 5]match_count = 0for word in split_question:if word in terms_used:match_count += 1for word in split_question:terms_used.add(word)if len(split_question) > 0:match_count /= len(split_question)question_overlap.append(match_count) jeopardy["question_overlap"] = question_overlapjeopardy["question_overlap"].mean()
Out[54]:
0.69087373156719623
Question overlap
There is about 70%
overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.
In [62]:
def determine_value(row):value = 0if row["clean_value"] > 800:value = 1return valuejeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)
In [84]:
def count_usage(term):low_count = 0high_count = 0for i, row in jeopardy.iterrows():if term in row["clean_question"].split(" "):if row["high_value"] == 1:high_count += 1else:low_count += 1return high_count, low_countcomparison_terms = list(terms_used)[:5] observed_expected = [] for term in comparison_terms:observed_expected.append(count_usage(term))observed_expected
Out[84]:
[(1, 2), (0, 1), (1, 0), (0, 1), (1, 1)]
In [86]:
from scipy.stats import chisquare import numpy as nphigh_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0] low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]chi_squared = [] for obs in observed_expected:total = sum(obs)total_prop = total / jeopardy.shape[0]high_value_exp = total_prop * high_value_countlow_value_exp = total_prop * low_value_countobserved = np.array([obs[0], obs[1]])expected = np.array([high_value_exp, low_value_exp])chi_squared.append(chisquare(observed, expected))chi_squared
Out[86]:
[(0.031881167234403623, 0.85828871632352932),(0.40196284612688399, 0.52607729857054686),(2.4877921171956752, 0.11473257634454047),(0.40196284612688399, 0.52607729857054686),(0.44487748166127949, 0.50477764875459963)]
Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5
, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.