本文主要是介绍基于PaddleNLP的真假新闻分类(一)数据EDA,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
一、基于PaddleNLP的美国大选的新闻真假分类
1.简介
新闻媒体已成为向世界人民传递世界上正在发生的事情的信息的渠道。 人们通常认为新闻中传达的一切都是真实的。 在某些情况下,甚至新闻频道也承认他们的新闻不如他们写的那样真实。 但是,一些新闻不仅对人民或政府产生重大影响,而且对经济也产生重大影响。 一则新闻可以根据人们的情绪和政治局势上下移动曲线。
从真实的真实新闻中识别虚假新闻非常重要。 该问题已通过自然语言处理工具解决并得到了解决,本篇文章可帮助我们根据历史数据识别假新闻或真实新闻。
2.问题描述
对于印刷媒体和数字媒体,信息的真实性已成为影响企业和社会的长期问题。在社交网络上,信息传播的范围和影响以如此快的速度发生,并且如此迅速地放大,以至于失真,不准确或虚假的信息具有巨大的潜力,可在数分钟内对数百万用户造成现实世界的影响。最近,人们表达了对该问题的一些担忧,并提出了一些缓解该问题的方法。
在各种信息广播的整个历史中,一直存在着不那么精确的引人注目和引人入胜的新闻标题,这些新闻标题旨在吸引观众的注意力来出售信息。但是,在社交网站上,信息传播的范围和影响得到了显着放大,并且发展速度如此之快,以至于失真,不准确或虚假的信息具有巨大的潜力,可在数分钟内为数百万的用户带来真正的影响。
3.目标
- 我们唯一的目标是将数据集中的新闻分类为假新闻或真实新闻。
- 新闻的细致EDA
- 选择并建立强大的分类模型
数据
数据地址:https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
二、数据EDA
1.PaddleNLP环境
!pip install -U paddlenlp
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting paddlenlp
[?25l Downloading https://mirror.baidu.com/pypi/packages/62/10/ccc761d3e3a994703f31a4d0f93db0d13789d1c624a0cbbe9fe6439ed601/paddlenlp-2.0.5-py3-none-any.whl (435kB)
[K |████████████████████████████████| 440kB 10.8MB/s eta 0:00:01
[?25hRequirement already satisfied, skipping upgrade: visualdl in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.1.1)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)
Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)
Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied, skipping upgrade: bce-python-sdk in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (0.8.53)
Requirement already satisfied, skipping upgrade: Pillow>=7.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (7.1.2)
Requirement already satisfied, skipping upgrade: six>=1.14.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.15.0)
Requirement already satisfied, skipping upgrade: flask>=1.1.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.1.1)
Requirement already satisfied, skipping upgrade: protobuf>=3.11.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (3.14.0)
Requirement already satisfied, skipping upgrade: requests in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (2.22.0)
Requirement already satisfied, skipping upgrade: Flask-Babel>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.0.0)
Requirement already satisfied, skipping upgrade: shellcheck-py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (0.7.1.1)
Requirement already satisfied, skipping upgrade: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.20.3)
Requirement already satisfied, skipping upgrade: pre-commit in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.21.0)
Requirement already satisfied, skipping upgrade: flake8>=3.7.9 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (3.8.2)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)
Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp) (0.3.3)
Requirement already satisfied, skipping upgrade: pycryptodome>=3.8.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl->paddlenlp) (3.9.9)
Requirement already satisfied, skipping upgrade: future>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl->paddlenlp) (0.18.0)
Requirement already satisfied, skipping upgrade: itsdangerous>=0.24 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (1.1.0)
Requirement already satisfied, skipping upgrade: Werkzeug>=0.15 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (0.16.0)
Requirement already satisfied, skipping upgrade: click>=5.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (7.0)
Requirement already satisfied, skipping upgrade: Jinja2>=2.10.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (2.10.1)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (2019.9.11)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (2.8)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (1.25.6)
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (3.0.4)
Requirement already satisfied, skipping upgrade: Babel>=2.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp) (2.8.0)
Requirement already satisfied, skipping upgrade: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp) (2019.3)
Requirement already satisfied, skipping upgrade: identify>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.4.10)
Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (0.23)
Requirement already satisfied, skipping upgrade: pyyaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (5.1.2)
Requirement already satisfied, skipping upgrade: cfgv>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (2.0.1)
Requirement already satisfied, skipping upgrade: toml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (0.10.0)
Requirement already satisfied, skipping upgrade: aspy.yaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.3.0)
Requirement already satisfied, skipping upgrade: virtualenv>=15.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (16.7.9)
Requirement already satisfied, skipping upgrade: nodeenv>=0.11.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.3.4)
Requirement already satisfied, skipping upgrade: mccabe<0.7.0,>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (0.6.1)
Requirement already satisfied, skipping upgrade: pyflakes<2.3.0,>=2.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (2.2.0)
Requirement already satisfied, skipping upgrade: pycodestyle<2.7.0,>=2.6.0a1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (2.6.0)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Jinja2>=2.10.1->flask>=1.1.1->visualdl->paddlenlp) (1.1.1)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->pre-commit->visualdl->paddlenlp) (0.6.0)
Requirement already satisfied, skipping upgrade: more-itertools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < "3.8"->pre-commit->visualdl->paddlenlp) (7.2.0)
Installing collected packages: paddlenlpFound existing installation: paddlenlp 2.0.1Uninstalling paddlenlp-2.0.1:Successfully uninstalled paddlenlp-2.0.1
Successfully installed paddlenlp-2.0.5
2.解压缩数据
!unzip data/data27271/真假新闻数据集.zip
3.导出基础库
# 基本数据包:pandas和numpy
import pandas as pd
import numpy as np
# 可视化包
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 5]# 导入Paddle库
import os
import paddle
import paddle.nn.functional as F
4.加载数据
import pandas as pd
# 读取数据集
fake_news = pd.read_csv('Fake.csv')
true_news = pd.read_csv('True.csv')
# 虚假新闻数据集的大小以及字段
print ("虚假新闻数据集的大小以及字段 (row, column):"+ str(fake_news.shape))
print (fake_news.info())
print("\n --------------------------------------- \n")
# 真实新闻数据集的大小以及字段
print ("真实新闻数据集的大小以及字段 (row, column):"+ str(true_news.shape))
print (true_news.info())
虚假新闻数据集的大小以及字段 (row, column):(23481, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 title 23481 non-null object1 text 23481 non-null object2 subject 23481 non-null object3 date 23481 non-null object
dtypes: object(4)
memory usage: 733.9+ KB
None--------------------------------------- 真实新闻数据集的大小以及字段 (row, column):(21417, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 title 21417 non-null object1 text 21417 non-null object2 subject 21417 non-null object3 date 21417 non-null object
dtypes: object(4)
memory usage: 669.4+ KB
None
5.数据集的详情
数据有2个CSV文件,其中一个数据集包含假新闻,另一个包真新闻,有将近23481个假新闻和21417个真新闻。 文件中列名的描述:
- title-包含新闻标题
- text-包含新闻内容/文章
- subject-新闻的主题
- date-消息发布的日期
6.查看数据
fake_news.head(5)
title | text | subject | date | |
---|---|---|---|---|
0 | Donald Trump Sends Out Embarrassing New Year’... | Donald Trump just couldn t wish all Americans ... | News | December 31, 2017 |
1 | Drunk Bragging Trump Staffer Started Russian ... | House Intelligence Committee Chairman Devin Nu... | News | December 31, 2017 |
2 | Sheriff David Clarke Becomes An Internet Joke... | On Friday, it was revealed that former Milwauk... | News | December 30, 2017 |
3 | Trump Is So Obsessed He Even Has Obama’s Name... | On Christmas day, Donald Trump announced that ... | News | December 29, 2017 |
4 | Pope Francis Just Called Out Donald Trump Dur... | Pope Francis used his annual Christmas Day mes... | News | December 25, 2017 |
true_news.head(5)
title | text | subject | date | |
---|---|---|---|---|
0 | As U.S. budget fight looms, Republicans flip t... | WASHINGTON (Reuters) - The head of a conservat... | politicsNews | December 31, 2017 |
1 | U.S. military to accept transgender recruits o... | WASHINGTON (Reuters) - Transgender people will... | politicsNews | December 29, 2017 |
2 | Senior U.S. Republican senator: 'Let Mr. Muell... | WASHINGTON (Reuters) - The special counsel inv... | politicsNews | December 31, 2017 |
3 | FBI Russia probe helped by Australian diplomat... | WASHINGTON (Reuters) - Trump campaign adviser ... | politicsNews | December 30, 2017 |
4 | Trump wants Postal Service to charge 'much mor... | SEATTLE/WASHINGTON (Reuters) - President Donal... | politicsNews | December 29, 2017 |
7.数据预处理与文本清洗
在执行EDA并将数据提供给模型之前,我们必须执行某些预处理步骤:
7.1创建目标列
让我们为假新闻和真新闻创建目标列。在这里,我们将目标值表示为“ 0”(假新闻),“ 1”(真新闻)。
#假新闻的目标变量
fake_news['output']=0
#真新闻的目标变量
true_news['output']=1
7.2拼接新闻标题和内容
新闻是将根据标题和文本进行分类。 分开处理新闻标题和内容不会带来任何好处。 因此,我们将两个数据集中的两个列连接起来。
#合并title与text合并为news
fake_news['news']=fake_news['title']+fake_news['text']
fake_news=fake_news.drop(['title', 'text'], axis=1)
#合并title与text合并为news
true_news['news']=true_news['title']+true_news['text']
true_news=true_news.drop(['title', 'text'], axis=1)
# 重构表格
fake_news = fake_news[['subject', 'date', 'news','output']]
true_news = true_news[['subject', 'date', 'news','output']]
7.3将日期列转换为日期时间格式
我们可以使用pd.datetime将日期列转换为所需的日期格式。尤其是在fake_news date列中,让我们检查value_counts()看看里面有什么。
fake_news['date'].value_counts()
May 10, 2017 46
May 26, 2016 44
May 5, 2016 44
May 6, 2016 44
May 11, 2016 43..
November 12, 2017 1
October 22, 2017 1
Apr 2, 2015 1
https://100percentfedup.com/video-hillary-asked-about-trump-i-just-want-to-eat-some-pie/ 1
October 9, 2017 1
Name: date, Length: 1681, dtype: int64
我们在日期列内有链接和新闻标题,这在转换为日期时间格式时会给我们带来麻烦。因此,让我们从列中删除这些记录。
# 删除含有链接以及Host的数据
fake_news=fake_news[~fake_news.date.str.contains("http")]
fake_news=fake_news[~fake_news.date.str.contains("HOST")]
# '''等效'''
#fake_news=fake_news[fake_news.date.str.contains("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec")]
# 只有假新闻数据集的日期列存在问题。现在,我们将日期列转换为日期时间格式
# 将日期列转为时间格式
fake_news['date'] = pd.to_datetime(fake_news['date'])
true_news['date'] = pd.to_datetime(true_news['date'])
7.4合并数据集
当我们为模型提供数据集时,我们必须将其作为单个文件提供。因此,最好同时添加真假新闻数据,并对其进行进一步预处理并执行EDA。
frames = [fake_news, true_news]
news_dataset = pd.concat(frames)
news_dataset
subject | date | news | output | |
---|---|---|---|---|
0 | News | 2017-12-31 | Donald Trump Sends Out Embarrassing New Year’... | 0 |
1 | News | 2017-12-31 | Drunk Bragging Trump Staffer Started Russian ... | 0 |
2 | News | 2017-12-30 | Sheriff David Clarke Becomes An Internet Joke... | 0 |
3 | News | 2017-12-29 | Trump Is So Obsessed He Even Has Obama’s Name... | 0 |
4 | News | 2017-12-25 | Pope Francis Just Called Out Donald Trump Dur... | 0 |
... | ... | ... | ... | ... |
21412 | worldnews | 2017-08-22 | 'Fully committed' NATO backs new U.S. approach... | 1 |
21413 | worldnews | 2017-08-22 | LexisNexis withdrew two products from Chinese ... | 1 |
21414 | worldnews | 2017-08-22 | Minsk cultural hub becomes haven from authorit... | 1 |
21415 | worldnews | 2017-08-22 | Vatican upbeat on possibility of Pope Francis ... | 1 |
21416 | worldnews | 2017-08-22 | Indonesia to buy $1.14 billion worth of Russia... | 1 |
44888 rows × 4 columns
7.5文本处理
对于任何文本分析应用程序来说,这都是重要的阶段。 新闻中将有很多无用的内容,这可能会阻碍机器学习模型的发展。 除非我们删除它们,否则机器学习模型将无法有效运行。 让我们一步一步走。
7.5.1标点符号去除
import re
import stringclean_news=news_dataset.copy()
def review_cleaning(text):'''Make text lowercase, remove text in square brackets,remove links,remove punctuationand remove words containing numbers.'''text = str(text).lower()text = re.sub('\[.*?\]', '', text)text = re.sub('https?://\S+|www\.\S+', '', text)text = re.sub('<.*?>+', '', text)text = re.sub('[%s]' % re.escape(string.punctuation), '', text)text = re.sub('\n', '', text)text = re.sub('\w*\d\w*', '', text)return text
clean_news['news']=clean_news['news'].apply(lambda x:review_cleaning(x))
clean_news.head()
<>:9: DeprecationWarning: invalid escape sequence \[
<>:10: DeprecationWarning: invalid escape sequence \S
<>:14: DeprecationWarning: invalid escape sequence \w
<>:9: DeprecationWarning: invalid escape sequence \[
<>:10: DeprecationWarning: invalid escape sequence \S
<>:14: DeprecationWarning: invalid escape sequence \w
<>:9: DeprecationWarning: invalid escape sequence \[
<>:10: DeprecationWarning: invalid escape sequence \S
<>:14: DeprecationWarning: invalid escape sequence \w
<ipython-input-13-2c34c041cbd0>:9: DeprecationWarning: invalid escape sequence \[text = re.sub('\[.*?\]', '', text)
<ipython-input-13-2c34c041cbd0>:10: DeprecationWarning: invalid escape sequence \Stext = re.sub('https?://\S+|www\.\S+', '', text)
<ipython-input-13-2c34c041cbd0>:14: DeprecationWarning: invalid escape sequence \wtext = re.sub('\w*\d\w*', '', text)
subject | date | news | output | |
---|---|---|---|---|
0 | News | 2017-12-31 | donald trump sends out embarrassing new year’... | 0 |
1 | News | 2017-12-31 | drunk bragging trump staffer started russian ... | 0 |
2 | News | 2017-12-30 | sheriff david clarke becomes an internet joke... | 0 |
3 | News | 2017-12-29 | trump is so obsessed he even has obama’s name... | 0 |
4 | News | 2017-12-25 | pope francis just called out donald trump dur... | 0 |
7.5.2停用词去除
停止词是一个常用的词(例如“the”,“A”,“an”,“in”),搜索引擎在为搜索条目建立索引和作为搜索查询的结果检索它们时都忽略它。我们不希望这些词占用数据库中的空间,或占用宝贵的处理时间。为此,我们可以很容易地删除它们,方法是存储一组你认为可以终止单词的单词。python中的NLTK(自然语言工具包)有一个以16种不同语言存储的stopwords列表。
import nltk
nltk.download('stopwords')
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/nltk/decorators.py:68: DeprecationWarning: `formatargspec` is deprecated since Python 3.5. Use `signature` and the `Signature` object directlyregargs, varargs, varkwargs, defaults, formatvalue=lambda value: ""
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/nltk/lm/counter.py:15: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingfrom collections import Sequence, defaultdict
[nltk_data] Downloading package stopwords to
[nltk_data] /home/aistudio/nltk_data...
[nltk_data] Package stopwords is already up-to-date!True
from nltk.corpus import stopwordsstop = stopwords.words('english')
clean_news['news'] = clean_news['news'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
clean_news.head()
subject | date | news | output | |
---|---|---|---|---|
0 | News | 2017-12-31 | donald trump sends embarrassing new year’s eve... | 0 |
1 | News | 2017-12-31 | drunk bragging trump staffer started russian c... | 0 |
2 | News | 2017-12-30 | sheriff david clarke becomes internet joke thr... | 0 |
3 | News | 2017-12-29 | trump obsessed even obama’s name coded website... | 0 |
4 | News | 2017-12-25 | pope francis called donald trump christmas spe... | 0 |
8.新闻的事件演变和可视化
在本节中,我们将完成对新闻的探索性数据分析,例如ngram分析,并了解哪些是所有单词,上下文(最有可能在伪造的new中找到)。
8.1新闻主题数
ax = sns.countplot(x="subject", data=clean_news,facecolor=(0, 0, 0, 0),linewidth=5,edgecolor=sns.color_palette("dark", 3))
# 设置label与字体大小
ax.set(xlabel='Type of news', ylabel='Number of news',title='Count of news type')
ax.xaxis.get_label().set_fontsize(15)
ax.yaxis.get_label().set_fontsize(15)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/seaborn/utils.py:538: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsnp.asarray(values).astype(np.float)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2349: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingif isinstance(obj, collections.Iterator):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2366: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingreturn list(data) if isinstance(data, collections.MappingView) else data
8.2基于真假的新闻主题计数
g = sns.catplot(x="subject", col="output",data=clean_news, kind="count",height=4, aspect=2)
# 旋转x轴
g.set_xticklabels(rotation=45)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/seaborn/axisgrid.py:270: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsnone_na = np.zeros(len(data), np.bool)<seaborn.axisgrid.FacetGrid at 0x7f2404ce3890>
结论:
- 假新闻无处不在,政治和世界新闻除外
- 真正的新闻只存在于政治和世界新闻中,而且数量很高
- 这是一个高度偏差的数据集,考虑到数据集的质量较差,我们可以期望更高的准确性,但这并不意味着它是一个好的模型
8.3 从新闻中提取新特征
让我们从新闻特征中提取更多的特征,比如
- 极性:表示新闻情感的尺度
- 评论长度:新闻的长度(字母和空格的数量)
- 单词数:新闻中单词的数量
!pip install textblob
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting textblob
[?25l Downloading https://mirror.baidu.com/pypi/packages/60/f0/1d9bfcc8ee6b83472ec571406bd0dd51c0e6330ff1a51b2d29861d389e85/textblob-0.15.3-py2.py3-none-any.whl (636kB)
[K |████████████████████████████████| 645kB 13.5MB/s eta 0:00:01
[?25hRequirement already satisfied: nltk>=3.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from textblob) (3.4.5)
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from nltk>=3.1->textblob) (1.15.0)
Installing collected packages: textblob
Successfully installed textblob-0.15.3
from textblob import TextBlob# 从新闻中提取新特征
clean_news['polarity'] = clean_news['news'].map(lambda text: TextBlob(text).sentiment.polarity)
clean_news['review_len'] = clean_news['news'].astype(str).apply(len)
clean_news['word_count'] = clean_news['news'].apply(lambda x: len(str(x).split()))
# 新特征分布
plt.figure(figsize = (20, 5))
plt.style.use('seaborn-white')
plt.subplot(131)
sns.distplot(clean_news['polarity'])
fig = plt.gcf()
plt.subplot(132)
sns.distplot(clean_news['review_len'])
fig = plt.gcf()
plt.subplot(133)
sns.distplot(clean_news['word_count'])
fig = plt.gcf()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/seaborn/distributions.py:179: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsa = np.asarray(a, np.float)
结论
- 大部分极性是中性的,既不表示坏消息也不表示高兴消息
- 字数在0到1000之间,新闻的长度在0到5000之间,并且接近1000个单词,这可能是一篇文章
9.N-gram分析
9.1新闻中的前20个词
让我们看一下新闻中的前20个词,这可以让我们简要了解一下数据集中最受欢迎的新闻。
!pip install iplot
!pip install plotly
!pip install cufflinks
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import iplot
# 获取topn的词
def get_top_n_words(corpus, n=None):vec = CountVectorizer().fit(corpus)bag_of_words = vec.transform(corpus)sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)return words_freq[:n]
# 获取top20常见的词
common_words = get_top_n_words(clean_news['news'], 20)
# 打印词频
for word, freq in common_words:print(word, freq)
# 创建词与词频的dataframe
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)df1 = pd.DataFrame(common_words, columns = ['news' , 'count'])
df1.groupby('news').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 words in news')
trump 140400
said 130258
us 68081
would 55422
president 53189
people 41718
one 36146
state 33190
new 31799
also 31209
obama 29881
clinton 29003
house 28716
government 27392
donald 27376
reuters 27348
states 26331
republican 25287
could 24356
white 23823
var gd = document.getElementById(‘885c3bd1-c725-489e-9655-b67ca6c96f34’);
var x = new MutationObserver(function (mutations, observer) {{
var display = window.getComputedStyle(gd).display;
if (!display || display === ‘none’) {{
console.log([gd, ‘removed!’]);
Plotly.purge(gd);
observer.disconnect();
}}
}});
// Listen for the removal of the full notebook cells
var notebookContainer = gd.closest(’#notebook-container’);
if (notebookContainer) {{
x.observe(notebookContainer, {childList: true});
}}
// Listen for the clearing of the current output cell
var outputEl = gd.closest(’.output’);
if (outputEl) {{
x.observe(outputEl, {childList: true});
}}
}) }; }); </script> </div>
9.2结论
- 所有前20条新闻都与美国政府有关
- 特别是关于特朗普和美国,其次是奥巴马
- 我们可以了解到,新闻来自路透社
9.3新闻中的topn的2个词组合
现在,让我们将探索范围扩展到新闻中的最常见的2个词组合。
def get_top_n_bigram(corpus, n=None):vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)bag_of_words = vec.transform(corpus)sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)return words_freq[:n]
common_words = get_top_n_bigram(clean_news['news'], 20)
for word, freq in common_words:print(word, freq)df3 = pd.DataFrame(common_words, columns = ['news' , 'count'])
df3.groupby('news').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams in news')
donald trump 25059
united states 18394
white house 15485
hillary clinton 9502
new york 8110
north korea 7053
president donald 6928
image via 6188
barack obama 5603
trump said 4816
prime minister 4753
president trump 4646
supreme court 4595
last year 4560
last week 4512
said statement 4425
fox news 4074
president obama 4065
islamic state 4014
national security 3858
var gd = document.getElementById(‘0e6d51e5-0677-4204-be21-0038b16ffac2’);
var x = new MutationObserver(function (mutations, observer) {{
var display = window.getComputedStyle(gd).display;
if (!display || display === ‘none’) {{
console.log([gd, ‘removed!’]);
Plotly.purge(gd);
observer.disconnect();
}}
}});
// Listen for the removal of the full notebook cells
var notebookContainer = gd.closest(’#notebook-container’);
if (notebookContainer) {{
x.observe(notebookContainer, {childList: true});
}}
// Listen for the clearing of the current output cell
var outputEl = gd.closest(’.output’);
if (outputEl) {{
x.observe(outputEl, {childList: true});
}}
}) }; }); </script> </div>
10.词云
!pip install wordcloud
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting wordcloud
[?25l Downloading https://mirror.baidu.com/pypi/packages/1b/06/0516bdba2ebdc0d5bd476aa66f94666dd0ad6b9abda723fdf28e451db919/wordcloud-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (366kB)
[K |████████████████████████████████| 368kB 16.2MB/s eta 0:00:01
[?25hRequirement already satisfied: numpy>=1.6.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (1.20.3)
Requirement already satisfied: matplotlib in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (2.2.3)
Requirement already satisfied: pillow in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (7.1.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2.4.2)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2.8.0)
Requirement already satisfied: six>=1.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.15.0)
Requirement already satisfied: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2019.3)
Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->wordcloud) (56.2.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.1
from wordcloud import WordCloud,STOPWORDStext = fake_news["news"]
wordcloud = WordCloud(width = 3000,height = 2000,background_color = 'black',stopwords = STOPWORDS).generate(str(text))
fig = plt.figure(figsize = (40, 30),facecolor = 'k',edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()
发现
- 大多数虚假新闻都围绕唐纳德·特朗普和美国
- 还有关于隐私,互联网等的虚假新闻
text = true_news["news"]
wordcloud = WordCloud(width = 3000,height = 2000,background_color = 'black',stopwords = STOPWORDS).generate(str(text))
fig = plt.figure(figsize = (40, 30),facecolor = 'k',edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
ze = (40, 30),facecolor = 'k',edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()
这篇关于基于PaddleNLP的真假新闻分类(一)数据EDA的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!