基于PaddleNLP的真假新闻分类(一)数据EDA

2023-11-02 20:30

本文主要是介绍基于PaddleNLP的真假新闻分类(一)数据EDA,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

一、基于PaddleNLP的美国大选的新闻真假分类

1.简介

新闻媒体已成为向世界人民传递世界上正在发生的事情的信息的渠道。 人们通常认为新闻中传达的一切都是真实的。 在某些情况下,甚至新闻频道也承认他们的新闻不如他们写的那样真实。 但是,一些新闻不仅对人民或政府产生重大影响,而且对经济也产生重大影响。 一则新闻可以根据人们的情绪和政治局势上下移动曲线。

从真实的真实新闻中识别虚假新闻非常重要。 该问题已通过自然语言处理工具解决并得到了解决,本篇文章可帮助我们根据历史数据识别假新闻或真实新闻。

2.问题描述

对于印刷媒体和数字媒体,信息的真实性已成为影响企业和社会的长期问题。在社交网络上,信息传播的范围和影响以如此快的速度发生,并且如此迅速地放大,以至于失真,不准确或虚假的信息具有巨大的潜力,可在数分钟内对数百万用户造成现实世界的影响。最近,人们表达了对该问题的一些担忧,并提出了一些缓解该问题的方法。

在各种信息广播的整个历史中,一直存在着不那么精确的引人注目和引人入胜的新闻标题,这些新闻标题旨在吸引观众的注意力来出售信息。但是,在社交网站上,信息传播的范围和影响得到了显着放大,并且发展速度如此之快,以至于失真,不准确或虚假的信息具有巨大的潜力,可在数分钟内为数百万的用户带来真正的影响。

3.目标

  • 我们唯一的目标是将数据集中的新闻分类为假新闻或真实新闻。
  • 新闻的细致EDA
  • 选择并建立强大的分类模型

数据

数据地址:https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

二、数据EDA

1.PaddleNLP环境

!pip install -U paddlenlp
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting paddlenlp
[?25l  Downloading https://mirror.baidu.com/pypi/packages/62/10/ccc761d3e3a994703f31a4d0f93db0d13789d1c624a0cbbe9fe6439ed601/paddlenlp-2.0.5-py3-none-any.whl (435kB)
[K     |████████████████████████████████| 440kB 10.8MB/s eta 0:00:01
[?25hRequirement already satisfied, skipping upgrade: visualdl in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.1.1)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0)
Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2)
Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0)
Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1)
Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1)
Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4)
Requirement already satisfied, skipping upgrade: bce-python-sdk in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (0.8.53)
Requirement already satisfied, skipping upgrade: Pillow>=7.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (7.1.2)
Requirement already satisfied, skipping upgrade: six>=1.14.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.15.0)
Requirement already satisfied, skipping upgrade: flask>=1.1.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.1.1)
Requirement already satisfied, skipping upgrade: protobuf>=3.11.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (3.14.0)
Requirement already satisfied, skipping upgrade: requests in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (2.22.0)
Requirement already satisfied, skipping upgrade: Flask-Babel>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.0.0)
Requirement already satisfied, skipping upgrade: shellcheck-py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (0.7.1.1)
Requirement already satisfied, skipping upgrade: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.20.3)
Requirement already satisfied, skipping upgrade: pre-commit in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (1.21.0)
Requirement already satisfied, skipping upgrade: flake8>=3.7.9 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from visualdl->paddlenlp) (3.8.2)
Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2)
Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp) (0.3.3)
Requirement already satisfied, skipping upgrade: pycryptodome>=3.8.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl->paddlenlp) (3.9.9)
Requirement already satisfied, skipping upgrade: future>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from bce-python-sdk->visualdl->paddlenlp) (0.18.0)
Requirement already satisfied, skipping upgrade: itsdangerous>=0.24 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (1.1.0)
Requirement already satisfied, skipping upgrade: Werkzeug>=0.15 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (0.16.0)
Requirement already satisfied, skipping upgrade: click>=5.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (7.0)
Requirement already satisfied, skipping upgrade: Jinja2>=2.10.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flask>=1.1.1->visualdl->paddlenlp) (2.10.1)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (2019.9.11)
Requirement already satisfied, skipping upgrade: idna<2.9,>=2.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (2.8)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (1.25.6)
Requirement already satisfied, skipping upgrade: chardet<3.1.0,>=3.0.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from requests->visualdl->paddlenlp) (3.0.4)
Requirement already satisfied, skipping upgrade: Babel>=2.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp) (2.8.0)
Requirement already satisfied, skipping upgrade: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Flask-Babel>=1.0.0->visualdl->paddlenlp) (2019.3)
Requirement already satisfied, skipping upgrade: identify>=1.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.4.10)
Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (0.23)
Requirement already satisfied, skipping upgrade: pyyaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (5.1.2)
Requirement already satisfied, skipping upgrade: cfgv>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (2.0.1)
Requirement already satisfied, skipping upgrade: toml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (0.10.0)
Requirement already satisfied, skipping upgrade: aspy.yaml in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.3.0)
Requirement already satisfied, skipping upgrade: virtualenv>=15.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (16.7.9)
Requirement already satisfied, skipping upgrade: nodeenv>=0.11.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pre-commit->visualdl->paddlenlp) (1.3.4)
Requirement already satisfied, skipping upgrade: mccabe<0.7.0,>=0.6.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (0.6.1)
Requirement already satisfied, skipping upgrade: pyflakes<2.3.0,>=2.2.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (2.2.0)
Requirement already satisfied, skipping upgrade: pycodestyle<2.7.0,>=2.6.0a1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from flake8>=3.7.9->visualdl->paddlenlp) (2.6.0)
Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0)
Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1)
Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from Jinja2>=2.10.1->flask>=1.1.1->visualdl->paddlenlp) (1.1.1)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from importlib-metadata; python_version < "3.8"->pre-commit->visualdl->paddlenlp) (0.6.0)
Requirement already satisfied, skipping upgrade: more-itertools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from zipp>=0.5->importlib-metadata; python_version < "3.8"->pre-commit->visualdl->paddlenlp) (7.2.0)
Installing collected packages: paddlenlpFound existing installation: paddlenlp 2.0.1Uninstalling paddlenlp-2.0.1:Successfully uninstalled paddlenlp-2.0.1
Successfully installed paddlenlp-2.0.5

2.解压缩数据

!unzip data/data27271/真假新闻数据集.zip

3.导出基础库

# 基本数据包:pandas和numpy
import pandas as pd 
import numpy as np 
# 可视化包
import matplotlib.pyplot as plt 
from matplotlib import rcParams
import seaborn as sns
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 5]# 导入Paddle库
import os
import paddle
import paddle.nn.functional as F

4.加载数据

import pandas as pd
# 读取数据集
fake_news = pd.read_csv('Fake.csv')
true_news = pd.read_csv('True.csv')
# 虚假新闻数据集的大小以及字段
print ("虚假新闻数据集的大小以及字段 (row, column):"+ str(fake_news.shape))
print (fake_news.info())
print("\n --------------------------------------- \n")
# 真实新闻数据集的大小以及字段
print ("真实新闻数据集的大小以及字段 (row, column):"+ str(true_news.shape))
print (true_news.info())
虚假新闻数据集的大小以及字段 (row, column):(23481, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):#   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 0   title    23481 non-null  object1   text     23481 non-null  object2   subject  23481 non-null  object3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB
None--------------------------------------- 真实新闻数据集的大小以及字段 (row, column):(21417, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):#   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 0   title    21417 non-null  object1   text     21417 non-null  object2   subject  21417 non-null  object3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB
None

5.数据集的详情

数据有2个CSV文件,其中一个数据集包含假新闻,另一个包真新闻,有将近23481个假新闻和21417个真新闻。 文件中列名的描述:

  • title-包含新闻标题
  • text-包含新闻内容/文章
  • subject-新闻的主题
  • date-消息发布的日期

6.查看数据

fake_news.head(5)
titletextsubjectdate
0Donald Trump Sends Out Embarrassing New Year’...Donald Trump just couldn t wish all Americans ...NewsDecember 31, 2017
1Drunk Bragging Trump Staffer Started Russian ...House Intelligence Committee Chairman Devin Nu...NewsDecember 31, 2017
2Sheriff David Clarke Becomes An Internet Joke...On Friday, it was revealed that former Milwauk...NewsDecember 30, 2017
3Trump Is So Obsessed He Even Has Obama’s Name...On Christmas day, Donald Trump announced that ...NewsDecember 29, 2017
4Pope Francis Just Called Out Donald Trump Dur...Pope Francis used his annual Christmas Day mes...NewsDecember 25, 2017
true_news.head(5)
titletextsubjectdate
0As U.S. budget fight looms, Republicans flip t...WASHINGTON (Reuters) - The head of a conservat...politicsNewsDecember 31, 2017
1U.S. military to accept transgender recruits o...WASHINGTON (Reuters) - Transgender people will...politicsNewsDecember 29, 2017
2Senior U.S. Republican senator: 'Let Mr. Muell...WASHINGTON (Reuters) - The special counsel inv...politicsNewsDecember 31, 2017
3FBI Russia probe helped by Australian diplomat...WASHINGTON (Reuters) - Trump campaign adviser ...politicsNewsDecember 30, 2017
4Trump wants Postal Service to charge 'much mor...SEATTLE/WASHINGTON (Reuters) - President Donal...politicsNewsDecember 29, 2017

7.数据预处理与文本清洗

在执行EDA并将数据提供给模型之前,我们必须执行某些预处理步骤:

7.1创建目标列

让我们为假新闻和真新闻创建目标列。在这里,我们将目标值表示为“ 0”(假新闻),“ 1”(真新闻)。

#假新闻的目标变量
fake_news['output']=0
#真新闻的目标变量
true_news['output']=1

7.2拼接新闻标题和内容

新闻是将根据标题和文本进行分类。 分开处理新闻标题和内容不会带来任何好处。 因此,我们将两个数据集中的两个列连接起来。

#合并title与text合并为news
fake_news['news']=fake_news['title']+fake_news['text']
fake_news=fake_news.drop(['title', 'text'], axis=1)
#合并title与text合并为news
true_news['news']=true_news['title']+true_news['text']
true_news=true_news.drop(['title', 'text'], axis=1)
# 重构表格
fake_news = fake_news[['subject', 'date', 'news','output']]
true_news = true_news[['subject', 'date', 'news','output']]

7.3将日期列转换为日期时间格式

我们可以使用pd.datetime将日期列转换为所需的日期格式。尤其是在fake_news date列中,让我们检查value_counts()看看里面有什么。

fake_news['date'].value_counts()
May 10, 2017                                                                                46
May 26, 2016                                                                                44
May 5, 2016                                                                                 44
May 6, 2016                                                                                 44
May 11, 2016                                                                                43..
November 12, 2017                                                                            1
October 22, 2017                                                                             1
Apr 2, 2015                                                                                  1
https://100percentfedup.com/video-hillary-asked-about-trump-i-just-want-to-eat-some-pie/     1
October 9, 2017                                                                              1
Name: date, Length: 1681, dtype: int64

我们在日期列内有链接和新闻标题,这在转换为日期时间格式时会给我们带来麻烦。因此,让我们从列中删除这些记录。

# 删除含有链接以及Host的数据 
fake_news=fake_news[~fake_news.date.str.contains("http")]
fake_news=fake_news[~fake_news.date.str.contains("HOST")]
# '''等效'''
#fake_news=fake_news[fake_news.date.str.contains("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec")]
# 只有假新闻数据集的日期列存在问题。现在,我们将日期列转换为日期时间格式
# 将日期列转为时间格式
fake_news['date'] = pd.to_datetime(fake_news['date'])
true_news['date'] = pd.to_datetime(true_news['date'])

7.4合并数据集

当我们为模型提供数据集时,我们必须将其作为单个文件提供。因此,最好同时添加真假新闻数据,并对其进行进一步预处理并执行EDA。

frames = [fake_news, true_news]
news_dataset = pd.concat(frames)
news_dataset
subjectdatenewsoutput
0News2017-12-31Donald Trump Sends Out Embarrassing New Year’...0
1News2017-12-31Drunk Bragging Trump Staffer Started Russian ...0
2News2017-12-30Sheriff David Clarke Becomes An Internet Joke...0
3News2017-12-29Trump Is So Obsessed He Even Has Obama’s Name...0
4News2017-12-25Pope Francis Just Called Out Donald Trump Dur...0
...............
21412worldnews2017-08-22'Fully committed' NATO backs new U.S. approach...1
21413worldnews2017-08-22LexisNexis withdrew two products from Chinese ...1
21414worldnews2017-08-22Minsk cultural hub becomes haven from authorit...1
21415worldnews2017-08-22Vatican upbeat on possibility of Pope Francis ...1
21416worldnews2017-08-22Indonesia to buy $1.14 billion worth of Russia...1

44888 rows × 4 columns

7.5文本处理

对于任何文本分析应用程序来说,这都是重要的阶段。 新闻中将有很多无用的内容,这可能会阻碍机器学习模型的发展。 除非我们删除它们,否则机器学习模型将无法有效运行。 让我们一步一步走。

7.5.1标点符号去除
import re
import stringclean_news=news_dataset.copy()
def review_cleaning(text):'''Make text lowercase, remove text in square brackets,remove links,remove punctuationand remove words containing numbers.'''text = str(text).lower()text = re.sub('\[.*?\]', '', text)text = re.sub('https?://\S+|www\.\S+', '', text)text = re.sub('<.*?>+', '', text)text = re.sub('[%s]' % re.escape(string.punctuation), '', text)text = re.sub('\n', '', text)text = re.sub('\w*\d\w*', '', text)return text
clean_news['news']=clean_news['news'].apply(lambda x:review_cleaning(x))
clean_news.head()
<>:9: DeprecationWarning: invalid escape sequence \[
<>:10: DeprecationWarning: invalid escape sequence \S
<>:14: DeprecationWarning: invalid escape sequence \w
<>:9: DeprecationWarning: invalid escape sequence \[
<>:10: DeprecationWarning: invalid escape sequence \S
<>:14: DeprecationWarning: invalid escape sequence \w
<>:9: DeprecationWarning: invalid escape sequence \[
<>:10: DeprecationWarning: invalid escape sequence \S
<>:14: DeprecationWarning: invalid escape sequence \w
<ipython-input-13-2c34c041cbd0>:9: DeprecationWarning: invalid escape sequence \[text = re.sub('\[.*?\]', '', text)
<ipython-input-13-2c34c041cbd0>:10: DeprecationWarning: invalid escape sequence \Stext = re.sub('https?://\S+|www\.\S+', '', text)
<ipython-input-13-2c34c041cbd0>:14: DeprecationWarning: invalid escape sequence \wtext = re.sub('\w*\d\w*', '', text)
subjectdatenewsoutput
0News2017-12-31donald trump sends out embarrassing new year’...0
1News2017-12-31drunk bragging trump staffer started russian ...0
2News2017-12-30sheriff david clarke becomes an internet joke...0
3News2017-12-29trump is so obsessed he even has obama’s name...0
4News2017-12-25pope francis just called out donald trump dur...0
7.5.2停用词去除

停止词是一个常用的词(例如“the”,“A”,“an”,“in”),搜索引擎在为搜索条目建立索引和作为搜索查询的结果检索它们时都忽略它。我们不希望这些词占用数据库中的空间,或占用宝贵的处理时间。为此,我们可以很容易地删除它们,方法是存储一组你认为可以终止单词的单词。python中的NLTK(自然语言工具包)有一个以16种不同语言存储的stopwords列表。

import nltk
nltk.download('stopwords')
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/nltk/decorators.py:68: DeprecationWarning: `formatargspec` is deprecated since Python 3.5. Use `signature` and the `Signature` object directlyregargs, varargs, varkwargs, defaults, formatvalue=lambda value: ""
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/nltk/lm/counter.py:15: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingfrom collections import Sequence, defaultdict
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/aistudio/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!True
from nltk.corpus import stopwordsstop = stopwords.words('english')
clean_news['news'] = clean_news['news'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
clean_news.head()
subjectdatenewsoutput
0News2017-12-31donald trump sends embarrassing new year’s eve...0
1News2017-12-31drunk bragging trump staffer started russian c...0
2News2017-12-30sheriff david clarke becomes internet joke thr...0
3News2017-12-29trump obsessed even obama’s name coded website...0
4News2017-12-25pope francis called donald trump christmas spe...0

8.新闻的事件演变和可视化

在本节中,我们将完成对新闻的探索性数据分析,例如ngram分析,并了解哪些是所有单词,上下文(最有可能在伪造的new中找到)。

8.1新闻主题数

ax = sns.countplot(x="subject", data=clean_news,facecolor=(0, 0, 0, 0),linewidth=5,edgecolor=sns.color_palette("dark", 3))
# 设置label与字体大小
ax.set(xlabel='Type of news', ylabel='Number of news',title='Count of news type')
ax.xaxis.get_label().set_fontsize(15)
ax.yaxis.get_label().set_fontsize(15)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/seaborn/utils.py:538: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsnp.asarray(values).astype(np.float)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2349: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingif isinstance(obj, collections.Iterator):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2366: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop workingreturn list(data) if isinstance(data, collections.MappingView) else data

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FreKuVGK-1636252217134)(output_31_1.png)]

8.2基于真假的新闻主题计数

g = sns.catplot(x="subject", col="output",data=clean_news, kind="count",height=4, aspect=2)
# 旋转x轴
g.set_xticklabels(rotation=45)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/seaborn/axisgrid.py:270: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsnone_na = np.zeros(len(data), np.bool)<seaborn.axisgrid.FacetGrid at 0x7f2404ce3890>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-W2W55gQg-1636252217145)(output_33_2.png)]

结论:
  • 假新闻无处不在,政治和世界新闻除外
  • 真正的新闻只存在于政治和世界新闻中,而且数量很高
  • 这是一个高度偏差的数据集,考虑到数据集的质量较差,我们可以期望更高的准确性,但这并不意味着它是一个好的模型

8.3 从新闻中提取新特征

让我们从新闻特征中提取更多的特征,比如

  • 极性:表示新闻情感的尺度
  • 评论长度:新闻的长度(字母和空格的数量)
  • 单词数:新闻中单词的数量
!pip install textblob
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting textblob
[?25l  Downloading https://mirror.baidu.com/pypi/packages/60/f0/1d9bfcc8ee6b83472ec571406bd0dd51c0e6330ff1a51b2d29861d389e85/textblob-0.15.3-py2.py3-none-any.whl (636kB)
[K     |████████████████████████████████| 645kB 13.5MB/s eta 0:00:01
[?25hRequirement already satisfied: nltk>=3.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from textblob) (3.4.5)
Requirement already satisfied: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from nltk>=3.1->textblob) (1.15.0)
Installing collected packages: textblob
Successfully installed textblob-0.15.3
from textblob import TextBlob# 从新闻中提取新特征
clean_news['polarity'] = clean_news['news'].map(lambda text: TextBlob(text).sentiment.polarity)
clean_news['review_len'] = clean_news['news'].astype(str).apply(len)
clean_news['word_count'] = clean_news['news'].apply(lambda x: len(str(x).split()))
# 新特征分布
plt.figure(figsize = (20, 5))
plt.style.use('seaborn-white')
plt.subplot(131)
sns.distplot(clean_news['polarity'])
fig = plt.gcf()
plt.subplot(132)
sns.distplot(clean_news['review_len'])
fig = plt.gcf()
plt.subplot(133)
sns.distplot(clean_news['word_count'])
fig = plt.gcf()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/seaborn/distributions.py:179: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecationsa = np.asarray(a, np.float)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iMypffzA-1636252217157)(output_37_1.png)]

结论
  • 大部分极性是中性的,既不表示坏消息也不表示高兴消息
  • 字数在0到1000之间,新闻的长度在0到5000之间,并且接近1000个单词,这可能是一篇文章

9.N-gram分析

9.1新闻中的前20个词

让我们看一下新闻中的前20个词,这可以让我们简要了解一下数据集中最受欢迎的新闻。

!pip install iplot
!pip install plotly 
!pip install cufflinks
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import iplot
# 获取topn的词
def get_top_n_words(corpus, n=None):vec = CountVectorizer().fit(corpus)bag_of_words = vec.transform(corpus)sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)return words_freq[:n]
# 获取top20常见的词
common_words = get_top_n_words(clean_news['news'], 20)
# 打印词频
for word, freq in common_words:print(word, freq)
# 创建词与词频的dataframe
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)df1 = pd.DataFrame(common_words, columns = ['news' , 'count'])
df1.groupby('news').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 words in news')
trump 140400
said 130258
us 68081
would 55422
president 53189
people 41718
one 36146
state 33190
new 31799
also 31209
obama 29881
clinton 29003
house 28716
government 27392
donald 27376
reuters 27348
states 26331
republican 25287
could 24356
white 23823

var gd = document.getElementById(‘885c3bd1-c725-489e-9655-b67ca6c96f34’);
var x = new MutationObserver(function (mutations, observer) {{
var display = window.getComputedStyle(gd).display;
if (!display || display === ‘none’) {{
console.log([gd, ‘removed!’]);
Plotly.purge(gd);
observer.disconnect();
}}
}});

// Listen for the removal of the full notebook cells
var notebookContainer = gd.closest(’#notebook-container’);
if (notebookContainer) {{
x.observe(notebookContainer, {childList: true});
}}

// Listen for the clearing of the current output cell
var outputEl = gd.closest(’.output’);
if (outputEl) {{
x.observe(outputEl, {childList: true});
}}

                    })                };                });            </script>        </div>

9.2结论

  • 所有前20条新闻都与美国政府有关
  • 特别是关于特朗普和美国,其次是奥巴马
  • 我们可以了解到,新闻来自路透社

9.3新闻中的topn的2个词组合

现在,让我们将探索范围扩展到新闻中的最常见的2个词组合。

def get_top_n_bigram(corpus, n=None):vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)bag_of_words = vec.transform(corpus)sum_words = bag_of_words.sum(axis=0) words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)return words_freq[:n]
common_words = get_top_n_bigram(clean_news['news'], 20)
for word, freq in common_words:print(word, freq)df3 = pd.DataFrame(common_words, columns = ['news' , 'count'])
df3.groupby('news').sum()['count'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams in news')
donald trump 25059
united states 18394
white house 15485
hillary clinton 9502
new york 8110
north korea 7053
president donald 6928
image via 6188
barack obama 5603
trump said 4816
prime minister 4753
president trump 4646
supreme court 4595
last year 4560
last week 4512
said statement 4425
fox news 4074
president obama 4065
islamic state 4014
national security 3858

var gd = document.getElementById(‘0e6d51e5-0677-4204-be21-0038b16ffac2’);
var x = new MutationObserver(function (mutations, observer) {{
var display = window.getComputedStyle(gd).display;
if (!display || display === ‘none’) {{
console.log([gd, ‘removed!’]);
Plotly.purge(gd);
observer.disconnect();
}}
}});

// Listen for the removal of the full notebook cells
var notebookContainer = gd.closest(’#notebook-container’);
if (notebookContainer) {{
x.observe(notebookContainer, {childList: true});
}}

// Listen for the clearing of the current output cell
var outputEl = gd.closest(’.output’);
if (outputEl) {{
x.observe(outputEl, {childList: true});
}}

                    })                };                });            </script>        </div>

10.词云

!pip install  wordcloud
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Collecting wordcloud
[?25l  Downloading https://mirror.baidu.com/pypi/packages/1b/06/0516bdba2ebdc0d5bd476aa66f94666dd0ad6b9abda723fdf28e451db919/wordcloud-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (366kB)
[K     |████████████████████████████████| 368kB 16.2MB/s eta 0:00:01
[?25hRequirement already satisfied: numpy>=1.6.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (1.20.3)
Requirement already satisfied: matplotlib in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (2.2.3)
Requirement already satisfied: pillow in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from wordcloud) (7.1.2)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2.4.2)
Requirement already satisfied: python-dateutil>=2.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2.8.0)
Requirement already satisfied: six>=1.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.15.0)
Requirement already satisfied: pytz in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (2019.3)
Requirement already satisfied: cycler>=0.10 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from matplotlib->wordcloud) (1.1.0)
Requirement already satisfied: setuptools in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->wordcloud) (56.2.0)
Installing collected packages: wordcloud
Successfully installed wordcloud-1.8.1
from wordcloud import WordCloud,STOPWORDStext = fake_news["news"]
wordcloud = WordCloud(width = 3000,height = 2000,background_color = 'black',stopwords = STOPWORDS).generate(str(text))
fig = plt.figure(figsize = (40, 30),facecolor = 'k',edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Fp7QrSs8-1636252217161)(output_47_0.png)]

发现

  • 大多数虚假新闻都围绕唐纳德·特朗普和美国
  • 还有关于隐私,互联网等的虚假新闻
text = true_news["news"]
wordcloud = WordCloud(width = 3000,height = 2000,background_color = 'black',stopwords = STOPWORDS).generate(str(text))
fig = plt.figure(figsize = (40, 30),facecolor = 'k',edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
ze = (40, 30),facecolor = 'k',edgecolor = 'k')
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mIApIzIX-1636252217166)(output_49_0.png)]

这篇关于基于PaddleNLP的真假新闻分类(一)数据EDA的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/333097

相关文章

详谈redis跟数据库的数据同步问题

《详谈redis跟数据库的数据同步问题》文章讨论了在Redis和数据库数据一致性问题上的解决方案,主要比较了先更新Redis缓存再更新数据库和先更新数据库再更新Redis缓存两种方案,文章指出,删除R... 目录一、Redis 数据库数据一致性的解决方案1.1、更新Redis缓存、删除Redis缓存的区别二

Redis事务与数据持久化方式

《Redis事务与数据持久化方式》该文档主要介绍了Redis事务和持久化机制,事务通过将多个命令打包执行,而持久化则通过快照(RDB)和追加式文件(AOF)两种方式将内存数据保存到磁盘,以防止数据丢失... 目录一、Redis 事务1.1 事务本质1.2 数据库事务与redis事务1.2.1 数据库事务1.

Oracle Expdp按条件导出指定表数据的方法实例

《OracleExpdp按条件导出指定表数据的方法实例》:本文主要介绍Oracle的expdp数据泵方式导出特定机构和时间范围的数据,并通过parfile文件进行条件限制和配置,文中通过代码介绍... 目录1.场景描述 2.方案分析3.实验验证 3.1 parfile文件3.2 expdp命令导出4.总结

更改docker默认数据目录的方法步骤

《更改docker默认数据目录的方法步骤》本文主要介绍了更改docker默认数据目录的方法步骤,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友们下面随着小编来一... 目录1.查看docker是否存在并停止该服务2.挂载镜像并安装rsync便于备份3.取消挂载备份和迁

不删数据还能合并磁盘? 让电脑C盘D盘合并并保留数据的技巧

《不删数据还能合并磁盘?让电脑C盘D盘合并并保留数据的技巧》在Windows操作系统中,合并C盘和D盘是一个相对复杂的任务,尤其是当你不希望删除其中的数据时,幸运的是,有几种方法可以实现这一目标且在... 在电脑生产时,制造商常为C盘分配较小的磁盘空间,以确保软件在运行过程中不会出现磁盘空间不足的问题。但在

Java如何接收并解析HL7协议数据

《Java如何接收并解析HL7协议数据》文章主要介绍了HL7协议及其在医疗行业中的应用,详细描述了如何配置环境、接收和解析数据,以及与前端进行交互的实现方法,文章还分享了使用7Edit工具进行调试的经... 目录一、前言二、正文1、环境配置2、数据接收:HL7Monitor3、数据解析:HL7Busines

Mybatis拦截器如何实现数据权限过滤

《Mybatis拦截器如何实现数据权限过滤》本文介绍了MyBatis拦截器的使用,通过实现Interceptor接口对SQL进行处理,实现数据权限过滤功能,通过在本地线程变量中存储数据权限相关信息,并... 目录背景基础知识MyBATis 拦截器介绍代码实战总结背景现在的项目负责人去年年底离职,导致前期规

Redis KEYS查询大批量数据替代方案

《RedisKEYS查询大批量数据替代方案》在使用Redis时,KEYS命令虽然简单直接,但其全表扫描的特性在处理大规模数据时会导致性能问题,甚至可能阻塞Redis服务,本文将介绍SCAN命令、有序... 目录前言KEYS命令问题背景替代方案1.使用 SCAN 命令2. 使用有序集合(Sorted Set)

SpringBoot整合Canal+RabbitMQ监听数据变更详解

《SpringBoot整合Canal+RabbitMQ监听数据变更详解》在现代分布式系统中,实时获取数据库的变更信息是一个常见的需求,本文将介绍SpringBoot如何通过整合Canal和Rabbit... 目录需求步骤环境搭建整合SpringBoot与Canal实现客户端Canal整合RabbitMQSp

MyBatis框架实现一个简单的数据查询操作

《MyBatis框架实现一个简单的数据查询操作》本文介绍了MyBatis框架下进行数据查询操作的详细步骤,括创建实体类、编写SQL标签、配置Mapper、开启驼峰命名映射以及执行SQL语句等,感兴趣的... 基于在前面几章我们已经学习了对MyBATis进行环境配置,并利用SqlSessionFactory核