An Introduction to Text Representation

2024-04-02 01:32

本文主要是介绍An Introduction to Text Representation,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!


文章目录

  • 1. Introduction
  • 2. Word Representation
    • 2.1. One-hot Encoding
    • 2.2. Word Embedding
      • 2.2.1. Word2Vec
        • 2.2.1.1. Continuous Bag of Words Model(CBOW)
        • 2.2.1.2. Skip-Gram Model
  • 3. Sentence Representation
    • 3.1. Bag of Words
      • 3.1.1. Sum of One-Hot Word Vectors
      • 3.1.2. Sum of One-Hot Word Vectors Weighted by Word Counts
      • 3.1.3. Sum of One-Hot Word Vectors Weighted by TF-IDF
      • 3.1.4. Bag-of-Words with Word Embeddings
    • 3.2. Probabilistic Language Model
    • 3.3. Neural Network Language Model

1. Introduction

Before feeding our text data into a model, we need firstly transformed it to a numerical format that the model can understand. That is called text representation, and it is a necessary data pre-processing procedure for almost every kind of nlp task.

From the view of language granularity, text representation can be divided into word representation, sentence representation and document representation. Word representation is the fundamental of the other two and we’ll introduce it first. Then we will introduce sentence representation in details. Documentation representation can be simiplified to sentence representation in many situations, so we won’t spend too much time on it.

2. Word Representation

Word representation includes two categories of models: discrete word representation and distributed word representation. The major difference between discrete and distributed representation is whether it considers the relationships among words. Discrete representation assumes that words are independent, and it fills in the word vector mainly by extracting features from the properties of the word itself, such as occurrence and frequency. Therefore, the word vector of discrete representation can’t capture the semantic and syntactic information of the word, and it is usually high dimensional and sparse. On the contrary, distributed representation analyzes relationships among the word and other words and learns a dense and low-dimensional feature vector to represent the word vector. We will firstly introduce one-hot encoding, a classical and simple method of discrete representation. Then we’ll discuss word embedding, the most popular frame of distributed representation techniques in recent years.

2.1. One-hot Encoding

Given a fixed set of vocabulary V = { w 1 , w 2 , ⋯ , w ∣ V ∣ } V=\{w_1,w_2,\cdots,w_{|V|}\} V={w1,w2,,wV}, one-hot encoding encodes a word w i w_i wi with a ∣ V ∣ |V| V-dimensional vector X X X, of which the i-th dimension where w i w_i wi occurs is 1 and the other dimensions are all zeros.
For example, if we have a corpus as follows(This example will also be used in the later chapters, and we mark it as Example 1):

"I like mathematics."
"I like science."
"You like computer science."
"I like computer and science."

Then the vocabulary is:

{I, like, mathematics, science, You, computer, and}

For word “mathematics”, its one-hot vector is:

(0,0,1,0,0,0,0)

One-hot encoding only captures the information of the word’s occurrence and position in the vocabulary, neglecting the frequency information and co-occurrence of different words.
In practice, one-hot encoding can be transformed into storage of the indexes of words in the vocabulary, so the the characteristics of high dimensionality and sparsity won’t cause computation problems. However, it is seldomly used directly for word representation because of its lack of semantic and syntactic information. Instead, it is used to tokenizing the word in other representation methods.

2.2. Word Embedding

Generally, the phrase “word embedding” has two meanings: 1. The technique to find a function that maps a word to a multi-dimensional and dense vector. 2. The word vector obtained in 1.
As peviously mentioned, one-hot encoding has serval problems: high dimension, sparsity and lack of semantics. The first two problems can be easily solved in word embedding by choosing appropriate dimension number of the result vector. What about the third problem? We haven’t discussed the exact definition of semantics. We won’t discuss too much of it since it is inherited from linguistic, instead we only give some intuitions. Semantics represents the meaning of a word and words with similar meanings should have similar semantics.
With the hypothesis of distributional semantics——linguistic items with similar distributions have similar meanings, the problem of learning word semantics can be transformed into modeling the relations among the target word and its context, and that is what statistical language model does. In fact, the earliest word embedding is the by-product of neural networks language model. In the following sections, we will introduce some classical word embedding methods.

2.2.1. Word2Vec

Word2Vec is a set of approaches that learn word representations from a model by iterative methods. The idea of Word2Vec is to design a model of which the parameters are the word vectors. After the model structure is finalized, we can train it on a certain objective with huge training data, learning our word vectors.
There are two classical algorithms to get word vectors: continuous bag-of-words(CBOW) and skip-gram.

2.2.1.1. Continuous Bag of Words Model(CBOW)

The core idea of CBOW is to predict the center word based on surrounding context words. Note that the order of the context words is not important in predicting the center word, so it is called a bag-of-words model.
Firstly we represent each word in the vocabulary V by a one-hot vector x i , i = 1 , 2 , ⋯ , ∣ V ∣ x^i, i=1,2,\cdots,|V| xi,i=1,2,,V, where ∣ V ∣ |V| V is the size of V. Let’s denote the target word as x c x^c xc, and the context words with a window of size 2 m 2m 2m as x ( c − m ) , ⋯ , x ( c − 1 ) , x ( c + 1 ) , ⋯ , x ( c + m ) x^{(c-m)},\cdots,x^{(c-1)},x^{(c+1)},\cdots,x^{(c+m)} x(cm),,x(c1),x(c+1),,x(c+m). To build up the model structure, we create two matrices, W ∈ R n × ∣ V ∣ W\in R^{n\times |V|} WRn×V and U ∈ R ∣ V ∣ × n U\in R^{|V|\times n} URV×n, where n is the dimensionality of the embedding space, and the i-th column of W W W is the embedding vector the i-th word(of which the corresponding one-hot vector is x i x^i xi). Let W W W be the weights between the input layer and hidden layer, and U U U be the weights between hidden layer and output layer, then we obtain a three-layer neural network model.
The following steps show how the model works:

  1. We generate the one-hot vectors of the context words with a window of size 2m: x ( c − m ) , ⋯ , x ( c − 1 ) , x ( c + 1 ) , ⋯ , x ( c + m ) x^{(c-m)},\cdots,x^{(c-1)},x^{(c+1)},\cdots,x^{(c+m)} x(cm),,x(c1),x(c+1),,x(c+m).
  2. We calculate the embedding vectors of the context words: ( v c − m = W ⋅ x ( c − m ) , v c − m + 1 = W ⋅ x ( c − m + 1 ) , ⋯ , v c + m = W ⋅ x ( c + m ) ) (v_{c-m}=W\cdot x^{(c-m)},v_{c-m+1}=W\cdot x^{(c-m+1)},\cdots,v_{c+m}=W\cdot x^{(c+m)}) (vcm=Wx(cm),vcm+1=Wx(cm+1),,vc+m=Wx(c+m)).
  3. Average these embedding vectors: v ˉ = v c − m + v c − m + 1 + ⋯ v c + m 2 m ∈ R n \bar v=\frac{v_{c-m}+v_{c-m+1}+\cdots v_{c+m}}{2m}\in R^n vˉ=2mvcm+vcm+1+vc+mRn.
  4. Generate a score vector z = U ⋅ v ˉ ∈ R ∣ V ∣ z=U\cdot \bar v\in R^{|V|} z=UvˉRV.
  5. Turn the score vector into probabilities y ^ = s o f t m a x ( z ) ∈ R ∣ V ∣ \hat y=softmax(z)\in R^{|V|} y^=softmax(z)RV.
  6. We desire y ^ \hat y y^ to match the true probabilities y ∈ R ∣ V ∣ y\in R^{|V|} yRV, which happens to be the one-hot vector of the center word. We can choose cross-entropy as the loss function.
2.2.1.2. Skip-Gram Model

Skip-Gram Model is similar to CBOW, and the most different part is that we use the center word to predict the surrounding context words.

  1. We generate the one-hot vector of the center word: x ( c ) x^{(c)} x(c).
  2. We calculate the embedding vectors of the center word: v c = W ⋅ x ( c ) v_{c}=W\cdot x^{(c)} vc=Wx(c).
  3. Generate a score vector z = U ⋅ v c ∈ R ∣ V ∣ z=U\cdot v_c\in R^{|V|} z=UvcRV.
  4. Turn the score vector into probabilities y ^ = s o f t m a x ( z ) ∈ R ∣ V ∣ \hat y=softmax(z)\in R^{|V|} y^=softmax(z)RV. Note that we will get 2 m 2m 2m scores: ( y ^ c − m , y ^ c − m + 1 , ⋯ , y ^ c + m ) (\hat y_{c-m},\hat y_{c-m+1},\cdots,\hat y_{c+m}) (y^cm,y^cm+1,,y^c+m)
  5. We desire ( y ^ c − m , y ^ c − m + 1 , ⋯ , y ^ c + m ) (\hat y_{c-m},\hat y_{c-m+1},\cdots,\hat y_{c+m}) (y^cm,y^cm+1,,y^c+m) to match the true probabilities ( y c − m , y c − m + 1 , ⋯ , y c + m ) ( y_{c-m}, y_{c-m+1},\cdots, y_{c+m}) (ycm,ycm+1,,yc+m), which happens to be the one-hot vectors of the context words. We can choose cross-entropy as the loss function.

3. Sentence Representation

3.1. Bag of Words

The model of Bag-of-Words represents a sentence as a bag of words, and the sentence can be represented as the linear combination of its words’ vectors.

3.1.1. Sum of One-Hot Word Vectors

With each word of the sentence represented by one-hot encoding, the sentence vector is the sum of all of the word vectors:
X ( s ) = ∑ w i ∈ s X ( w i ) X(s)=\sum_{w_i\in s}{X(w_i)} X(s)=wisX(wi)
Where X ( w i ) X(w_i) X(wi) is the one-hot encoding vector of w i w_i wi in sentence s s s, and X ( s ) X(s) X(s) is the vector of s s s.
In example 1, for a sentence “I like computer and science”, its sentence vector is:

(1,1,0,1,0,1,1)

3.1.2. Sum of One-Hot Word Vectors Weighted by Word Counts

In one-hot encoding, every word has the same value–1, in the non-zero component of its feature vector, which ignores importance of different words. Thus, it can’t distinguish these two sentences: “I like computer and science” and “I like computer and computer science”.
To evalute the importance of a word, a straightforward way is to multiply the one-hot vector by a weight–the count of the word in the sentence. Then, the vector of “I like computer and computer science” can be represented as:

(1,1,0,1,0,2,1)

3.1.3. Sum of One-Hot Word Vectors Weighted by TF-IDF

A better choice is to use TF-IDF(Term Frequency-Inverse Document Frequency. In this section, document is identity with sentence) as the weight, where T F ( t ) = c o u n t s o f w o r d t i n t h e d o c n u m o f w o r d s i n t h e d o c TF(t)=\frac{counts\ of\ word\ t\ in\ the\ doc}{num\ of\ words\ in\ the\ doc} TF(t)=num of words in the doccounts of word t in the doc.
The original definition of IDF is:
I D F ( t ) = l n n d f ( t ) IDF(t)=ln\frac{n}{df(t)} IDF(t)=lndf(t)n,
where n n n is the number of documents while d f ( t ) df(t) df(t) is number of documents containing the target word t t t. To avoid division by zero, we can use the smoothing technique: I D F ( t ) = l n 1 + n 1 + d f ( t ) IDF(t)=ln\frac{1+n}{1+df(t)} IDF(t)=ln1+df(t)1+n, and it looks like that there is a document containing all the words. What’s more, to avoid I D F = 0 IDF=0 IDF=0, we can add one to IDF:
I D F ( t ) = l n 1 + n 1 + d f ( t ) + 1 IDF(t)=ln\frac{1+n}{1+df(t)}+1 IDF(t)=ln1+df(t)1+n+1
Then, TF-IDF = T F ( t ) ⋅ I D F ( t ) =TF(t)\cdot IDF(t) =TF(t)IDF(t)
Finally,
X ( s ) = ∑ w i ∈ s T F ( w i ) ⋅ I D F ( w i ) ⋅ X ( w i ) X(s)=\sum_{w_i\in s}{TF(w_i)\cdot IDF(w_i)\cdot X(w_i)} X(s)=wisTF(wi)IDF(wi)X(wi)

TF-IDF representation can be easily implemented by calling the following sk-learn functions:

vectorizer=CountVectorizer()
transformer=TfidfTransformer()
tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))

or

transformer=TfidfVectorizer()
tfidf2=transformer.fit_transform(corpus)

3.1.4. Bag-of-Words with Word Embeddings

In this method, each word is represented by word embeddings instead of one-hot encoding. The sentence is then calculated by weighted averaging the word embeddings. You can refer to Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. "A simple but tough-to-beat baseline for sentence embeddings." _International conference on learning representations_. 2017.for more details of the weight factors.

3.2. Probabilistic Language Model

A big drawback of B

3.3. Neural Network Language Model

这篇关于An Introduction to Text Representation的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/868629

相关文章

【Python报错已解决】AttributeError: ‘list‘ object has no attribute ‘text‘

🎬 鸽芷咕:个人主页  🔥 个人专栏: 《C++干货基地》《粉丝福利》 ⛺️生活的理想,就是为了理想的生活! 文章目录 前言一、问题描述1.1 报错示例1.2 报错分析1.3 解决思路 二、解决方法2.1 方法一:检查属性名2.2 步骤二:访问列表元素的属性 三、其他解决方法四、总结 前言 在Python编程中,属性错误(At

【ReactJS】困惑于text/babel与browser.js还是babel.js?

使用JSX   使用JSX,可以极大的简化React元素的创建,JSX抽象化了React.createElement()函数的使用,其语法风格类似于HTML语法风格。对比如下代码可以让你更好的理解这一点。 // 使用React.createElement()return React.createElement('div',null,'Hello',this.props.name);//使用J

访问controller404:The origin server did not find a current representation for the target resource

ider build->rebuild project。Rebuild:对选定的目标(Project),进行强制性编译,不管目标是否是被修改过。由于 Rebuild 的目标只有 Project,所以 Rebuild 每次花的时间会比较长。 参考:资料

Android:EditText在hint字体大小和text字体大小不一致时的设置方法

今天碰到一个需求,有一个输入框EditText,要求输入某项金额,要求在未输入文字之前,hint提示,输入文字之后显示输入的文字,要求是未输入内容时hint字体大小为14sp,输入金额之后字体大小要变成30sp。,可是EditText本身没有这个属性可以设置,怎么办呢,只有在代码中添加监听事件了: /*** 添加监听,在hint时和text时切换字体大小*/cetMoney.addTextCha

AI基础 L1 Introduction to Artificial Intelligence

什么是AI Chinese Room Thought Experiment 关于“强人工智能”的观点,即认为只要一个系统在行为上表现得像有意识,那么它就真的具有理解能力。  实验内容如下: 假设有一个不懂中文的英语说话者被关在一个房间里。房间里有一本用英文写的中文使用手册,可以指导他如何处理中文符号。当外面的中文母语者通过一个小窗口传递给房间里的人一些用中文写的问题时,房间里的人能够依

Sublime Text 3搭建PHP开发环境说明

1、设置环境变量 Windows系统环境变量path增加php.exe所在目录路径 2、创建PHP编译系统 添加 PHP 的 build system,如图所示, Tools->Build System-> New Build System : 新建一个,默认的内容是:{ "shell_cmd": "make"}修改为:{ "cmd": ["php", "$file"], "file_re

多字节、宽字节、兼容字节(TEXT) 相关操作汇总

常用函数对照 ANSIUNICODE通用说明数据类型(char.h)(wchar.h)(tchar.h) charwchar_tTCHAR char *wchar_t *TCHAR* LPSTRLPWSTRLPTSTR LPCSTRLPCWSTRLPCTSTR     字符串转换atoi_wtoi_ttoi把字符串转换成整数(int)atol_wtol_ttol把字符串转换成长整型数(long)

Sublime Text 3常用快键键总结

通用(General) ↑↓←→:上下左右移动光标,注意不是不是 KJHL !Alt:调出菜单Ctrl + Shift + P:调出命令板(Command Palette)Ctrl + ` :调出控制台 编辑(Editing) Ctrl + Enter:在当前行下面新增一行然后跳至该行Ctrl + Shift + Enter:在当前行上面增加一行并跳至该行Ctrl + ←/→:进行逐词移动

sublime_text中如何使用快捷键打开默认浏览器

原创:http://blog.csdn.net/u013383042/article/details/51058899 1、在SublimeText下打开该路径:preference - key bindings - user 2、在以下打开窗口中输入如下语句: {"keys": ["ctrl+r"],"command": "open_in_browser"} 如上图所示,”ctrl+

Introduction to Deep Learning with PyTorch

1、Introduction to PyTorch, a Deep Learning Library 1.1、Importing PyTorch and related packages import torch# supports:## image data with torchvision## audio data with torchaudio## text data with t