文本自动文摘(automatic summarization/abstracting)是利用计算机自动实现文本分析、内容归纳和摘要自动生成的技术。这项技术在互联网技术迅速发展、海量信息急速膨胀的今天,具有非常重要的用途。Tweets作为社交媒体内容的典型代表,具有极大的研究价值。本文尝试将经典的TF-IDF算法应用到tweets上提取原文中最有代表性的句子做automatic summarization。
0. 认识数据
- id. Twitter API 中下载数据自带的id;
- topic. 命名实体识别的结果,作为topic使用;
- sentiment. 情感分析的结果,在本文中没有使用;
- body. Tweets正文,summarization作用的具体对象;
id topic sentiment body
628949369883000832 @microsoft negative dear @Microsoft the newOoffice for Mac is grea...
1. 预处理
第二步是清理数据。 直观地讲,像URL这样的字符串,“@ …”,标题和标点符号很少有助于句子的重要性。 另外,在大多数的NLP任务中,stopwords通常都会被视为噪音。 这些东西应该被删除。
Number of sentences:158
['dear @Microsoft the newOoffice for Mac is great and all, but no Lync update?',"C'mon.","@Microsoft how about you make a system that doesn't eat my friggin discs.",'This is the 2nd time this has happened and I am so sick of it!',"I may be ignorant on this issue but... should we celebrate @Microsoft's "'parental leave changes?']
Number of unique words after filtering:591
[['dear', 'newooffice', 'mac', 'great', 'lync', 'update'],['cmon'],['microsoft', 'make', 'system', 'doesnt', 'eat', 'friggin', 'discs'],['2nd', 'time', 'happened', 'sick'],['may', 'ignorant', 'issue', 'celebrates', 'parental', 'leave', 'changes']]
3. 计算TF-IDF值
def tfidf(data_tokenized):'''Caculate tf-idf matrix.:param data_tokenized: A sequence of tokenized documents, where each document is a sequence of (str) terms.:return: vectorizer, instance of textacy.vsm.Vectorizer.calculate , tf-idf matrix whose row is document, column is term'''vectorizer = Vectorizer(weighting='tfidf')term_matrix = vectorizer.fit_transform(data_tokenized).todense() # dense matrix means most of the elements are nonzeroreturn vectorizer, term_matrix
是一个单词-文档矩阵,也称为“bag-of-words”。 在这种情况下,term_matrix
4. 提取最具代表性的句子作summarization
由于tweet很短,一些广泛使用的技术,如position weights和biased heading weights不适合此任务。在目前阶段,使用每个句子的tf-idf值的总和排序句子。
def rank_sentences(sents, filtered_words, vectorizer, term_matrix, top_n=3):'''Select top n important sentence.:param sents: a list containing sentences.:param filtered_words: a tokenized sentences list whose element is word list:param vectorizer: instance of textacy.vsm.Vectorizer:param term_matrix: tf-idf matrix whose row is document, column is term:param top_n: the selecting number:return: a list containing top n important sentences'''tfidf_sent = [[term_matrix[index, vectorizer.vocabulary[token]] for token in sent] for index, sent inenumerate(filtered_words)] # Get tfidf value for noun word in each sentencesent_values = [sum(sent) for sent in tfidf_sent] # Caculate whole tfidf weights for each sentenceranked_sent = sorted(zip(sents, sent_values), key=lambda x: x[1], reverse=True) # Sort sentence at descending orderreturn [sent[0] for sent in ranked_sent[:top_n]]
["@eyesonfoxorg @Microsoft I'm still using Vista on one & Win-7 on "'another, Vista is a dinosaur, unfortunately I may use a free 10 with limits','W/ all the $$$ and drones U have working 4 U, maybe U guys could get it ''right the 1st time?',"@Lumia #Lumia @Microsoft 2nd, you guys haven't released a lumia that has a "'QHD screen, or takes video in 2k resolution yet.']
- Sentence Extraction by tf/idf and Position Weighting from Newspaper Articles
- Automatic Summarization
- 统计自然语言处理(第2版)