Guided Project: Winning Jeopardy

本文主要是介绍Guided Project: Winning Jeopardy，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

2019独角兽企业重金招聘Python工程师标准>>>

https://github.com/dataquestio/solutions/blob/master/Mission210Solution.ipynb

1: Jeopardy Questions

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. If you need help at any point, you can consult our solution notebookhere.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here. Here's the beginning of the file:

Imgur

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

Show Number -- the Jeopardy episode number of the show this question was in.
Air Date -- the date the episode aired.
Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
Category -- the category of the question.
Value -- the number of dollars answering the question correctly is worth.
Question -- the text of the question.
Answer -- the text of the answer.

Instructions

Read the dataset into a Dataframe called jeopardy usingPandas.
Print out the first 5 rows of jeopardy.
Print out the columns of jeopardy using jeopardy.columns.
Some of the column names have spaces in front.
- Remove the spaces in each item in jeopardy.columns.
- Assign the result back to jeopardy.columns to fix the column names in jeopardy.
Make sure you pay close attention to the format of each column.

2: Normalizing Text

Before you can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the Question and Answercolumns). We covered normalization before, but the idea is to ensure that you lowercase words and remove punctuation so Don't and don'taren't considered to be different words when you compare them.

Instructions

Write a function to normalize questions and answers. It should:
- Take in a string.
- Convert the string to lowercase.
- Remove all punctuation in the string.
- Return the string.
Normalize the Question column.
- Use the Pandas apply method to apply the function to each item in the Question column.
- Assign the result to the clean_question column.
Normalize the Answer column.
- Use the Pandas apply method to apply the function to each item in the Answer column.
- Assign the result to the clean_answer column.

3: Normalizing Columns

Now that you've normalized the text columns, there are also some other columns to normalize.

The Value column should also be numeric, to allow you to manipulate it more easily. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable you to work with it more easily.

Instructions

Write a function to normalize dollar values. It should:
- Take in a string.
- Remove any punctuation in the string.
- Convert the string to an integer.
- If the conversion has an error, assign 0 instead.
- Return the integer.
Normalize the Value column.
- Use the Pandas apply method to apply the function to each item in the Value column.
- Assign the result to the clean_value column.
Use the pandas.to_datetime function to convert the Air Datecolumn to a datetime column.

4: Answers In Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

How often the answer is deducible from the question.
How often new questions are repeats of older questions.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

Instructions

Write a function that takes in a row in jeopardy, as a Series. It should:
- Split the clean_answer column on the space character (), and assign to the variable split_answer.
  - Split the clean_question column on the space character (), and assign to the variablesplit_question.
- Create a variable called match_count, and set it to 0.
- If the is in split_answer, remove it using the removemethod on lists. The is commonly found in answers and questions, but doesn't have any meaningful use in finding the answer.
- If the length of split_answer is 0, return 0. This prevents a division by zero error later.
- Loop through each item in split_answer, and see if it occurs in split_question. If it does, add 1 tomatch_count.
- Divide match_count by the length of split_answer, and return the result.
Count how many times terms in clean_answer occur inclean_question.
- Use the Pandas apply method on Dataframes to apply the function to each row in jeopardy.
- Pass the axis=1 argument to apply the function across each row.
- Assign the result to the answer_in_question column.
Find the mean of the answer_in_question column using themean method on Series.
Write up a markdown cell with a short explanation of how finding this mean might influence your studying strategy for Jeopardy.

5: Recycled Questions

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

To do this, you can:

Sort jeopardy in order of ascending air date.
Maintain a set called terms_used that will be empty initially.
Iterate through each row of jeopardy.
Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
- If it does, increment a counter.
- Add each word to terms_used.

This will enable you to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables you to filter out words like the andthan, which are commonly used, but don't tell you a lot about a question.

Instructions

Create an empty list called question_overlap.
Create an empty set called terms_used.
Use the iterrows Dataframe method to loop through each row ofjeopardy.
- Split the clean_question column of the row on the space character (), and assign to split_question.
- Remove any words in split_question that are less than6 characters long.
- Set match_count to 0.
- Loop through each word in split_question.
  - If the term occurs in terms_used, add 1 tomatch_count.
- Add each word in split_question to terms_used using the add method on sets.
- If the length of split_question is greater than 0, dividematch_count by the length of split_question.
- Append match_count to question_overlap.
Assign question_overlap to the question_overlap column of jeopardy.
Find the mean of the question_overlap column and print it.
Look at the value, and think about what this might mean for questions being recycled. Write up your thoughts in a markdown cell.

6: Low Value Vs High Value Questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

Low value -- Any row where Value is less than 800.
High value -- Any row where Value is greater than 800.

You'll then be able to loop through each of the terms from the last screen, terms_used, and:

Find the number of low value questions the word occurs in.
Find the number of high value questions the word occurs in.
Find the percentage of questions the word occurs in.
Based on the percentage of questions the word occurs in, find expected counts.
Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

Instructions

Create a function that takes in a row from a Dataframe, and:
- If the clean_value column is greater than 800, assign 1to value.
- Otherwise, assign 0 to value.
- Return value.
Determine which questions are high and low value.
- Use the Pandas apply method on Dataframes to apply the function to each row in jeopardy.
- Pass the axis=1 argument to apply the function across each row.
- Assign the result to the high_value column.
Create a function that takes in a word, and:
- Assigns 0 to low_count.
- Assigns 0 to high_count.
- Loops through each row in jeopardy using the iterrowsmethod.
  - Split the clean_question column on the space character ().
  - If the word is in the split question:
    - If the high_value column is 1, add 1 tohigh_count.
    - Else, add 1 to low_count.
- Returns high_count and low_count. You can return multiple values by separating them with a comma.
Create an empty list called observed_expected.
Convert terms_used into a list using the list function, and assign the first 5 elements to comparison_terms.
Loop through each term in comparison_terms, and:
- Run the function on the term to get the high value and low value counts.
- Append the result of running the function (which will be a list) to observed_expected.

7: Applying The Chi-Squared Test

Now that you've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value.

Instructions

Find the number of rows in jeopardy where high_value is 1, and assign to high_value_count.
Find the number of rows in jeopardy where high_value is 0, and assign to low_value_count.
Create an empty list called chi_squared.
Loop through each list in observed_expected.
- Add up both items in the list (high and low counts) to get the total count, and assign to total.
- Divide total by the number of rows in jeopardy to get the proportion across the dataset. Assign to total_prop.
- Multiply total_prop by high_value_count to get the expected term count for high value rows.
- Multiply total_prop by low_value_count to get the expected term count for low value rows.
- Use the scipy.stats.chisquare function to compute the chi-squared value and p-value given the expected and observed counts.
- Append the results to chi_squared.
Look over the chi-squared values and the associated p-values. Are there any statistically significant results? Write up your thoughts in a markdown cell.

8: Next Steps

That's it for the guided steps! We recommend exploring the data more on your own.

Here are some potential next steps:

Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
- Manually create a list of words to remove, like the, than, etc.
- Find a list of stopwords to remove.
- Remove words that occur in more than a certain percentage (like 5%) of questions.
Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
- Use the apply method to make the code that calculates frequencies more efficient.
- Only select terms that have high frequencies across the dataset, and ignore the others.
Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
- See which categories appear the most often.
- Find the probability of each category appearing in each round.
Use the whole Jeopardy dataset (availablehere) instead of the subset we used in this mission.
Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.

We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.

You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.

We hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work!

In [23]:

import pandas
import csvjeopardy = pandas.read_csv("jeopardy.csv")jeopardy

Out[23]:

	Show Number	Air Date	Round	Category	Value	Question	Answer
0	4680	2004-12-31	Jeopardy!	HISTORY	$200</td> <td>For the last 8 years of his life, Galileo was ...</td> <td>Copernicus</td> </tr> <tr> <th>1</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$200	No. 2: 1912 Olympian; football star at Carlisl...	Jim Thorpe
2	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$200</td> <td>The city of Yuma in this state has a record av...</td> <td>Arizona</td> </tr> <tr> <th>3</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200</td> <td>Signer of the Dec. of Indep., framer of the Co...</td> <td>John Adams</td> </tr> <tr> <th>5</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$200	In the title of an Aesop fable, this insect sh...	the ant
6	4680	2004-12-31	Jeopardy!	HISTORY	$400	Built in 312 B.C. to link Rome & the South of ...	the Appian Way
7	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$400	No. 8: 30 steals for the Birmingham Barons; 2,...	Michael Jordan
8	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$400</td> <td>In the winter of 1971-72, a record 1,122 inche...</td> <td>Washington</td> </tr> <tr> <th>9</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$400	This housewares store was named for the packag...	Crate & Barrel
10	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$400</td> <td>"And away we go"</td> <td>Jackie Gleason</td> </tr> <tr> <th>11</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$400	Cows regurgitate this from the first stomach t...	the cud
12	4680	2004-12-31	Jeopardy!	HISTORY	$600</td> <td>In 1000 Rajaraja I of the Cholas battled to ta...</td> <td>Ceylon (or Sri Lanka)</td> </tr> <tr> <th>13</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$600	No. 1: Lettered in hoops, football & lacrosse ...	Jim Brown
14	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$600</td> <td>On June 28, 1994 the nat'l weather service beg...</td> <td>the UV index</td> </tr> <tr> <th>15</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$600	This company's Accutron watch, introduced in 1...	Bulova
16	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$600</td> <td>Outlaw: "Murdered by a traitor and a coward wh...</td> <td>Jesse James</td> </tr> <tr> <th>17</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$600	A small demon, or a mischievous child (who mig...	imp
18	4680	2004-12-31	Jeopardy!	HISTORY	$800</td> <td>Karl led the first of these Marxist organizati...</td> <td>the International</td> </tr> <tr> <th>19</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$800	No. 10: FB/LB for Columbia U. in the 1920s; MV...	(Lou) Gehrig
20	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$800</td> <td>Africa's lowest temperature was 11 degrees bel...</td> <td>Morocco</td> </tr> <tr> <th>21</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$800	Edward Teller & this man partnered in 1898 to ...	(Paul) Bonwit
22	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$2,000</td> <td>1939 Oscar winner: "...you are a credit to you...</td> <td>Hattie McDaniel (for her role in Gone with the...</td> </tr> <tr> <th>23</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$800	In geologic time one of these, shorter than an...	era
24	4680	2004-12-31	Jeopardy!	HISTORY	$1000</td> <td>This Asian political party was founded in 1885...</td> <td>the Congress Party</td> </tr> <tr> <th>25</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>ESPN's TOP 10 ALL-TIME ATHLETES</td> <td>$1000	No. 5: Only center to lead the NBA in assists;...	(Wilt) Chamberlain
26	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$1000	The Kirschner brothers, Don & Bill, named this...	K2
27	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$1000	Revolutionary War hero: "His spirit is in Verm...	Ethan Allen
28	4680	2004-12-31	Jeopardy!	3-LETTER WORDS	$1000</td> <td>A single layer of paper, or to perform one's c...</td> <td>ply</td> </tr> <tr> <th>29</th> <td>4680</td> <td>2004-12-31</td> <td>Double Jeopardy!</td> <td>DR. SEUSS AT THE MULTIPLEX</td> <td>$400	<a href="http://www.j-archive.com/media/2004-1...	Horton
...	...	...	...	...	...	...	...
19969	5694	2009-05-14	Double Jeopardy!	AMERICAN HISTORY	$1200	In 1960 the last of these locomotives was reti...	steam engines
19970	5694	2009-05-14	Double Jeopardy!	MIND YOUR SHAKESPEARE "P"s & "Q"s	$1200	Kate: "if I be waspish, best beware my sting";...	Petruchio
19971	5694	2009-05-14	Double Jeopardy!	ALMA MATERS	$1,500</td> <td>This private college in Northern California bo...</td> <td>Stanford University</td> </tr> <tr> <th>19972</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ACTRESSES</td> <td>$1200	She voiced Princess Pea in "The Tale of Desper...	Emma Watson
19973	5694	2009-05-14	Double Jeopardy!	2-LETTER WORDS	$1200	It's the name of the long-awaited new White Ho...	Bo
19974	5694	2009-05-14	Double Jeopardy!	ANGELS & DEMONS	$1200	Langdon in "Angels & Demons" is looking for <a...	an antimatter bomb
19975	5694	2009-05-14	Double Jeopardy!	AMERICAN HISTORY	$1600	In the 1600s most of New York State was occupi...	the Iroquois
19976	5694	2009-05-14	Double Jeopardy!	MIND YOUR SHAKESPEARE "P"s & "Q"s	$1600	Marina's dad (need a hint? he rules Tyre)	Pericles
19977	5694	2009-05-14	Double Jeopardy!	ALMA MATERS	$1600</td> <td>Presidential kids are welcome at this New Orle...</td> <td>Tulane</td> </tr> <tr> <th>19978</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ACTRESSES</td> <td>$1600	She didn't vamp it up & did a bella job as Em ...	Kristen Stewart
19979	5694	2009-05-14	Double Jeopardy!	2-LETTER WORDS	$1600	Third syllable intoned by the giant who smells...	fo
19980	5694	2009-05-14	Double Jeopardy!	ANGELS & DEMONS	$1600	Much of "Angels & Demons" takes place at one o...	a conclave
19981	5694	2009-05-14	Double Jeopardy!	AMERICAN HISTORY	$1,200	In 1899 Secretary of State John Hay proclaimed...	open-door policy
19982	5694	2009-05-14	Double Jeopardy!	MIND YOUR SHAKESPEARE "P"s & "Q"s	$2000	Fruity surname of Peter in "A Midsummer Night'...	Quince
19983	5694	2009-05-14	Double Jeopardy!	ALMA MATERS	$2000	Quincy Jones, Kevin Eubanks & Branford Marsali...	Berklee
19984	5694	2009-05-14	Double Jeopardy!	ACTRESSES	$2000	In 2009 she returned to being "Fast & Furious"...	Michelle Rodriguez
19985	5694	2009-05-14	Double Jeopardy!	2-LETTER WORDS	$2000	The book of Genesis says this ancient city "of...	Ur
19986	5694	2009-05-14	Double Jeopardy!	ANGELS & DEMONS	$2000	"Habakkuk and the Angel" is one of a series of...	Bernini
19987	5694	2009-05-14	Final Jeopardy!	SCIENCE TERMS	None	In medieval England, it meant the smallest uni...	atom
19988	3582	2000-03-14	Jeopardy!	U.S. GEOGRAPHY	$100</td> <td>This Texas city is the largest in the U.S. to ...</td> <td>Houston (Lee Brown)</td> </tr> <tr> <th>19989</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>POP MUSIC PAIRINGS</td> <td>$100	...& the Crickets	Buddy Holly
19990	3582	2000-03-14	Jeopardy!	HISTORIC PEOPLE	$100</td> <td>In the 990s this son of Erik the Red brought C...</td> <td>Leif Ericson</td> </tr> <tr> <th>19991</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>1998 QUOTATIONS</td> <td>$100	Concerning a failed Windows 98 demonstration, ...	Bill Gates
19992	3582	2000-03-14	Jeopardy!	LLAMA-RAMA	$100</td> <td>This llama product is used to make hats, blank...</td> <td>Wool</td> </tr> <tr> <th>19993</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>DING DONG</td> <td>$100	In 1967 this company introduced its chocolate-...	Hostess
19994	3582	2000-03-14	Jeopardy!	U.S. GEOGRAPHY	$200</td> <td>Of 8, 12 or 18, the number of U.S. states that...</td> <td>18</td> </tr> <tr> <th>19995</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>POP MUSIC PAIRINGS</td> <td>$200	...& the New Power Generation	Prince
19996	3582	2000-03-14	Jeopardy!	HISTORIC PEOPLE	$200</td> <td>In 1589 he was appointed professor of mathemat...</td> <td>Galileo</td> </tr> <tr> <th>19997</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>1998 QUOTATIONS</td> <td>$200	Before the grand jury she said, "I'm really so...	Monica Lewinsky
19998	3582	2000-03-14	Jeopardy!	LLAMA-RAMA	$200	Llamas are the heftiest South American members...	Camels

19999 rows × 7 columns

In [26]:

jeopardy.columns

Out[26]:

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',' Question', ' Answer'],dtype='object')

In [27]:

jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [31]:

import redef normalize_text(text):text = text.lower()text = re.sub("[^A-Za-z0-9\s]", "", text)return textdef normalize_values(text):text = re.sub("[^A-Za-z0-9\s]", "", text)try:text = int(text)except Exception:text = 0return text

In [40]:

jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [41]:

jeopardy

Out[41]:

	Show Number	Air Date	Round	Category	Value	Question	Answer	clean_question	clean_answer	clean_value
0	4680	2004-12-31	Jeopardy!	HISTORY	$200	For the last 8 years of his life, Galileo was ...	Copernicus	for the last 8 years of his life galileo was u...	copernicus	200
1	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$200</td> <td>No. 2: 1912 Olympian; football star at Carlisl...</td> <td>Jim Thorpe</td> <td>no 2 1912 olympian football star at carlisle i...</td> <td>jim thorpe</td> <td>200</td> </tr> <tr> <th>2</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>EVERYBODY TALKS ABOUT IT...</td> <td>$200	The city of Yuma in this state has a record av...	Arizona	the city of yuma in this state has a record av...	arizona	200
3	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$200	In 1963, live on "The Art Linkletter Show", th...	McDonald's	in 1963 live on the art linkletter show this c...	mcdonalds	200
4	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$200	Signer of the Dec. of Indep., framer of the Co...	John Adams	signer of the dec of indep framer of the const...	john adams	200
5	4680	2004-12-31	Jeopardy!	3-LETTER WORDS	$200</td> <td>In the title of an Aesop fable, this insect sh...</td> <td>the ant</td> <td>in the title of an aesop fable this insect sha...</td> <td>the ant</td> <td>200</td> </tr> <tr> <th>6</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$400	Built in 312 B.C. to link Rome & the South of ...	the Appian Way	built in 312 bc to link rome the south of ita...	the appian way	400
7	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$400</td> <td>No. 8: 30 steals for the Birmingham Barons; 2,...</td> <td>Michael Jordan</td> <td>no 8 30 steals for the birmingham barons 2306 ...</td> <td>michael jordan</td> <td>400</td> </tr> <tr> <th>8</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>EVERYBODY TALKS ABOUT IT...</td> <td>$400	In the winter of 1971-72, a record 1,122 inche...	Washington	in the winter of 197172 a record 1122 inches o...	washington	400
9	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$400	This housewares store was named for the packag...	Crate & Barrel	this housewares store was named for the packag...	crate barrel	400
10	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$400	"And away we go"	Jackie Gleason	and away we go	jackie gleason	400
11	4680	2004-12-31	Jeopardy!	3-LETTER WORDS	$400</td> <td>Cows regurgitate this from the first stomach t...</td> <td>the cud</td> <td>cows regurgitate this from the first stomach t...</td> <td>the cud</td> <td>400</td> </tr> <tr> <th>12</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$600	In 1000 Rajaraja I of the Cholas battled to ta...	Ceylon (or Sri Lanka)	in 1000 rajaraja i of the cholas battled to ta...	ceylon or sri lanka	600
13	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$600	No. 1: Lettered in hoops, football & lacrosse ...	Jim Brown	no 1 lettered in hoops football lacrosse at s...	jim brown	600
14	4680	2004-12-31	Jeopardy!	EVERYBODY TALKS ABOUT IT...	$600	On June 28, 1994 the nat'l weather service beg...	the UV index	on june 28 1994 the natl weather service began...	the uv index	600
15	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$600	This company's Accutron watch, introduced in 1...	Bulova	this companys accutron watch introduced in 196...	bulova	600
16	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$600	Outlaw: "Murdered by a traitor and a coward wh...	Jesse James	outlaw murdered by a traitor and a coward whos...	jesse james	600
17	4680	2004-12-31	Jeopardy!	3-LETTER WORDS	$600</td> <td>A small demon, or a mischievous child (who mig...</td> <td>imp</td> <td>a small demon or a mischievous child who might...</td> <td>imp</td> <td>600</td> </tr> <tr> <th>18</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$800	Karl led the first of these Marxist organizati...	the International	karl led the first of these marxist organizati...	the international	800
19	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$800</td> <td>No. 10: FB/LB for Columbia U. in the 1920s; MV...</td> <td>(Lou) Gehrig</td> <td>no 10 fblb for columbia u in the 1920s mvp for...</td> <td>lou gehrig</td> <td>800</td> </tr> <tr> <th>20</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>EVERYBODY TALKS ABOUT IT...</td> <td>$800	Africa's lowest temperature was 11 degrees bel...	Morocco	africas lowest temperature was 11 degrees belo...	morocco	800
21	4680	2004-12-31	Jeopardy!	THE COMPANY LINE	$800	Edward Teller & this man partnered in 1898 to ...	(Paul) Bonwit	edward teller this man partnered in 1898 to s...	paul bonwit	800
22	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$2,000	1939 Oscar winner: "...you are a credit to you...	Hattie McDaniel (for her role in Gone with the...	1939 oscar winner you are a credit to your cra...	hattie mcdaniel for her role in gone with the ...	2000
23	4680	2004-12-31	Jeopardy!	3-LETTER WORDS	$800</td> <td>In geologic time one of these, shorter than an...</td> <td>era</td> <td>in geologic time one of these shorter than an ...</td> <td>era</td> <td>800</td> </tr> <tr> <th>24</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>HISTORY</td> <td>$1000	This Asian political party was founded in 1885...	the Congress Party	this asian political party was founded in 1885...	the congress party	1000
25	4680	2004-12-31	Jeopardy!	ESPN's TOP 10 ALL-TIME ATHLETES	$1000</td> <td>No. 5: Only center to lead the NBA in assists;...</td> <td>(Wilt) Chamberlain</td> <td>no 5 only center to lead the nba in assists tr...</td> <td>wilt chamberlain</td> <td>1000</td> </tr> <tr> <th>26</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>THE COMPANY LINE</td> <td>$1000	The Kirschner brothers, Don & Bill, named this...	K2	the kirschner brothers don bill named this sk...	k2	1000
27	4680	2004-12-31	Jeopardy!	EPITAPHS & TRIBUTES	$1000</td> <td>Revolutionary War hero: "His spirit is in Verm...</td> <td>Ethan Allen</td> <td>revolutionary war hero his spirit is in vermon...</td> <td>ethan allen</td> <td>1000</td> </tr> <tr> <th>28</th> <td>4680</td> <td>2004-12-31</td> <td>Jeopardy!</td> <td>3-LETTER WORDS</td> <td>$1000	A single layer of paper, or to perform one's c...	ply	a single layer of paper or to perform ones cra...	ply	1000
29	4680	2004-12-31	Double Jeopardy!	DR. SEUSS AT THE MULTIPLEX	$400	<a href="http://www.j-archive.com/media/2004-1...	Horton	a hrefhttpwwwjarchivecommedia20041231dj23mp3be...	horton	400
...	...	...	...	...	...	...	...	...	...	...
19969	5694	2009-05-14	Double Jeopardy!	AMERICAN HISTORY	$1200	In 1960 the last of these locomotives was reti...	steam engines	in 1960 the last of these locomotives was reti...	steam engines	1200
19970	5694	2009-05-14	Double Jeopardy!	MIND YOUR SHAKESPEARE "P"s & "Q"s	$1200</td> <td>Kate: "if I be waspish, best beware my sting";...</td> <td>Petruchio</td> <td>kate if i be waspish best beware my sting his ...</td> <td>petruchio</td> <td>1200</td> </tr> <tr> <th>19971</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ALMA MATERS</td> <td>$1,500	This private college in Northern California bo...	Stanford University	this private college in northern california bo...	stanford university	1500
19972	5694	2009-05-14	Double Jeopardy!	ACTRESSES	$1200</td> <td>She voiced Princess Pea in "The Tale of Desper...</td> <td>Emma Watson</td> <td>she voiced princess pea in the tale of despere...</td> <td>emma watson</td> <td>1200</td> </tr> <tr> <th>19973</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>2-LETTER WORDS</td> <td>$1200	It's the name of the long-awaited new White Ho...	Bo	its the name of the longawaited new white hous...	bo	1200
19974	5694	2009-05-14	Double Jeopardy!	ANGELS & DEMONS	$1200	Langdon in "Angels & Demons" is looking for <a...	an antimatter bomb	langdon in angels demons is looking for a hre...	an antimatter bomb	1200
19975	5694	2009-05-14	Double Jeopardy!	AMERICAN HISTORY	$1600	In the 1600s most of New York State was occupi...	the Iroquois	in the 1600s most of new york state was occupi...	the iroquois	1600
19976	5694	2009-05-14	Double Jeopardy!	MIND YOUR SHAKESPEARE "P"s & "Q"s	$1600</td> <td>Marina's dad (need a hint? he rules Tyre)</td> <td>Pericles</td> <td>marinas dad need a hint he rules tyre</td> <td>pericles</td> <td>1600</td> </tr> <tr> <th>19977</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ALMA MATERS</td> <td>$1600	Presidential kids are welcome at this New Orle...	Tulane	presidential kids are welcome at this new orle...	tulane	1600
19978	5694	2009-05-14	Double Jeopardy!	ACTRESSES	$1600	She didn't vamp it up & did a bella job as Em ...	Kristen Stewart	she didnt vamp it up did a bella job as em in...	kristen stewart	1600
19979	5694	2009-05-14	Double Jeopardy!	2-LETTER WORDS	$1600	Third syllable intoned by the giant who smells...	fo	third syllable intoned by the giant who smells...	fo	1600
19980	5694	2009-05-14	Double Jeopardy!	ANGELS & DEMONS	$1600	Much of "Angels & Demons" takes place at one o...	a conclave	much of angels demons takes place at one of a...	a conclave	1600
19981	5694	2009-05-14	Double Jeopardy!	AMERICAN HISTORY	$1,200	In 1899 Secretary of State John Hay proclaimed...	open-door policy	in 1899 secretary of state john hay proclaimed...	opendoor policy	1200
19982	5694	2009-05-14	Double Jeopardy!	MIND YOUR SHAKESPEARE "P"s & "Q"s	$2000</td> <td>Fruity surname of Peter in "A Midsummer Night'...</td> <td>Quince</td> <td>fruity surname of peter in a midsummer nights ...</td> <td>quince</td> <td>2000</td> </tr> <tr> <th>19983</th> <td>5694</td> <td>2009-05-14</td> <td>Double Jeopardy!</td> <td>ALMA MATERS</td> <td>$2000	Quincy Jones, Kevin Eubanks & Branford Marsali...	Berklee	quincy jones kevin eubanks branford marsalis ...	berklee	2000
19984	5694	2009-05-14	Double Jeopardy!	ACTRESSES	$2000	In 2009 she returned to being "Fast & Furious"...	Michelle Rodriguez	in 2009 she returned to being fast furious as...	michelle rodriguez	2000
19985	5694	2009-05-14	Double Jeopardy!	2-LETTER WORDS	$2000	The book of Genesis says this ancient city "of...	Ur	the book of genesis says this ancient city of ...	ur	2000
19986	5694	2009-05-14	Double Jeopardy!	ANGELS & DEMONS	$2000</td> <td>"Habakkuk and the Angel" is one of a series of...</td> <td>Bernini</td> <td>habakkuk and the angel is one of a series of a...</td> <td>bernini</td> <td>2000</td> </tr> <tr> <th>19987</th> <td>5694</td> <td>2009-05-14</td> <td>Final Jeopardy!</td> <td>SCIENCE TERMS</td> <td>None</td> <td>In medieval England, it meant the smallest uni...</td> <td>atom</td> <td>in medieval england it meant the smallest unit...</td> <td>atom</td> <td>0</td> </tr> <tr> <th>19988</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>U.S. GEOGRAPHY</td> <td>$100	This Texas city is the largest in the U.S. to ...	Houston (Lee Brown)	this texas city is the largest in the us to ha...	houston lee brown	100
19989	3582	2000-03-14	Jeopardy!	POP MUSIC PAIRINGS	$100	...& the Crickets	Buddy Holly	the crickets	buddy holly	100
19990	3582	2000-03-14	Jeopardy!	HISTORIC PEOPLE	$100	In the 990s this son of Erik the Red brought C...	Leif Ericson	in the 990s this son of erik the red brought c...	leif ericson	100
19991	3582	2000-03-14	Jeopardy!	1998 QUOTATIONS	$100</td> <td>Concerning a failed Windows 98 demonstration, ...</td> <td>Bill Gates</td> <td>concerning a failed windows 98 demonstration h...</td> <td>bill gates</td> <td>100</td> </tr> <tr> <th>19992</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>LLAMA-RAMA</td> <td>$100	This llama product is used to make hats, blank...	Wool	this llama product is used to make hats blanke...	wool	100
19993	3582	2000-03-14	Jeopardy!	DING DONG	$100</td> <td>In 1967 this company introduced its chocolate-...</td> <td>Hostess</td> <td>in 1967 this company introduced its chocolatec...</td> <td>hostess</td> <td>100</td> </tr> <tr> <th>19994</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>U.S. GEOGRAPHY</td> <td>$200	Of 8, 12 or 18, the number of U.S. states that...	18	of 8 12 or 18 the number of us states that tou...	18	200
19995	3582	2000-03-14	Jeopardy!	POP MUSIC PAIRINGS	$200	...& the New Power Generation	Prince	the new power generation	prince	200
19996	3582	2000-03-14	Jeopardy!	HISTORIC PEOPLE	$200	In 1589 he was appointed professor of mathemat...	Galileo	in 1589 he was appointed professor of mathemat...	galileo	200
19997	3582	2000-03-14	Jeopardy!	1998 QUOTATIONS	$200</td> <td>Before the grand jury she said, "I'm really so...</td> <td>Monica Lewinsky</td> <td>before the grand jury she said im really sorry...</td> <td>monica lewinsky</td> <td>200</td> </tr> <tr> <th>19998</th> <td>3582</td> <td>2000-03-14</td> <td>Jeopardy!</td> <td>LLAMA-RAMA</td> <td>$200	Llamas are the heftiest South American members...	Camels	llamas are the heftiest south american members...	camels	200

19999 rows × 10 columns

In [36]:

jeopardy["Air Date"] = pandas.to_datetime(jeopardy["Air Date"])

In [38]:

jeopardy.dtypes

Out[38]:

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In [51]:

def count_matches(row):split_answer = row["clean_answer"].split(" ")split_question = row["clean_question"].split(" ")if "the" in split_answer:split_answer.remove("the")if len(split_answer) == 0:return 0match_count = 0for item in split_answer:if item in split_question:match_count += 1return match_count / len(split_answer)jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [53]:

jeopardy["answer_in_question"].mean()

Out[53]:

0.060493257069335872

Answer terms in the question

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

In [54]:

question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():split_question = row["clean_question"].split(" ")split_question = [q for q in split_question if len(q) > 5]match_count = 0for word in split_question:if word in terms_used:match_count += 1for word in split_question:terms_used.add(word)if len(split_question) > 0:match_count /= len(split_question)question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlapjeopardy["question_overlap"].mean()

Out[54]:

0.69087373156719623

Question overlap

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [62]:

def determine_value(row):value = 0if row["clean_value"] > 800:value = 1return valuejeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [84]:

def count_usage(term):low_count = 0high_count = 0for i, row in jeopardy.iterrows():if term in row["clean_question"].split(" "):if row["high_value"] == 1:high_count += 1else:low_count += 1return high_count, low_countcomparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:observed_expected.append(count_usage(term))observed_expected

Out[84]:

[(1, 2), (0, 1), (1, 0), (0, 1), (1, 1)]

In [86]:

from scipy.stats import chisquare
import numpy as nphigh_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]chi_squared = []
for obs in observed_expected:total = sum(obs)total_prop = total / jeopardy.shape[0]high_value_exp = total_prop * high_value_countlow_value_exp = total_prop * low_value_countobserved = np.array([obs[0], obs[1]])expected = np.array([high_value_exp, low_value_exp])chi_squared.append(chisquare(observed, expected))chi_squared

Out[86]:

[(0.031881167234403623, 0.85828871632352932),(0.40196284612688399, 0.52607729857054686),(2.4877921171956752, 0.11473257634454047),(0.40196284612688399, 0.52607729857054686),(0.44487748166127949, 0.50477764875459963)]

Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

转载于:https://my.oschina.net/Bettyty/blog/750943

这篇关于Guided Project: Winning Jeopardy的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！