本文主要是介绍An Introduction to Text Representation,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
文章目录
- 1. Introduction
- 2. Word Representation
- 2.1. One-hot Encoding
- 2.2. Word Embedding
- 2.2.1. Word2Vec
- 2.2.1.1. Continuous Bag of Words Model(CBOW)
- 2.2.1.2. Skip-Gram Model
- 3. Sentence Representation
- 3.1. Bag of Words
- 3.1.1. Sum of One-Hot Word Vectors
- 3.1.2. Sum of One-Hot Word Vectors Weighted by Word Counts
- 3.1.3. Sum of One-Hot Word Vectors Weighted by TF-IDF
- 3.1.4. Bag-of-Words with Word Embeddings
- 3.2. Probabilistic Language Model
- 3.3. Neural Network Language Model
1. Introduction
Before feeding our text data into a model, we need firstly transformed it to a numerical format that the model can understand. That is called text representation, and it is a necessary data pre-processing procedure for almost every kind of nlp task.
From the view of language granularity, text representation can be divided into word representation, sentence representation and document representation. Word representation is the fundamental of the other two and we’ll introduce it first. Then we will introduce sentence representation in details. Documentation representation can be simiplified to sentence representation in many situations, so we won’t spend too much time on it.
2. Word Representation
Word representation includes two categories of models: discrete word representation and distributed word representation. The major difference between discrete and distributed representation is whether it considers the relationships among words. Discrete representation assumes that words are independent, and it fills in the word vector mainly by extracting features from the properties of the word itself, such as occurrence and frequency. Therefore, the word vector of discrete representation can’t capture the semantic and syntactic information of the word, and it is usually high dimensional and sparse. On the contrary, distributed representation analyzes relationships among the word and other words and learns a dense and low-dimensional feature vector to represent the word vector. We will firstly introduce one-hot encoding, a classical and simple method of discrete representation. Then we’ll discuss word embedding, the most popular frame of distributed representation techniques in recent years.
2.1. One-hot Encoding
Given a fixed set of vocabulary V = { w 1 , w 2 , ⋯ , w ∣ V ∣ } V=\{w_1,w_2,\cdots,w_{|V|}\} V={w1,w2,⋯,w∣V∣}, one-hot encoding encodes a word w i w_i wi with a ∣ V ∣ |V| ∣V∣-dimensional vector X X X, of which the i-th dimension where w i w_i wi occurs is 1 and the other dimensions are all zeros.
For example, if we have a corpus as follows(This example will also be used in the later chapters, and we mark it as Example 1):
"I like mathematics."
"I like science."
"You like computer science."
"I like computer and science."
Then the vocabulary is:
{I, like, mathematics, science, You, computer, and}
For word “mathematics”, its one-hot vector is:
(0,0,1,0,0,0,0)
One-hot encoding only captures the information of the word’s occurrence and position in the vocabulary, neglecting the frequency information and co-occurrence of different words.
In practice, one-hot encoding can be transformed into storage of the indexes of words in the vocabulary, so the the characteristics of high dimensionality and sparsity won’t cause computation problems. However, it is seldomly used directly for word representation because of its lack of semantic and syntactic information. Instead, it is used to tokenizing the word in other representation methods.
2.2. Word Embedding
Generally, the phrase “word embedding” has two meanings: 1. The technique to find a function that maps a word to a multi-dimensional and dense vector. 2. The word vector obtained in 1.
As peviously mentioned, one-hot encoding has serval problems: high dimension, sparsity and lack of semantics. The first two problems can be easily solved in word embedding by choosing appropriate dimension number of the result vector. What about the third problem? We haven’t discussed the exact definition of semantics. We won’t discuss too much of it since it is inherited from linguistic, instead we only give some intuitions. Semantics represents the meaning of a word and words with similar meanings should have similar semantics.
With the hypothesis of distributional semantics——linguistic items with similar distributions have similar meanings, the problem of learning word semantics can be transformed into modeling the relations among the target word and its context, and that is what statistical language model does. In fact, the earliest word embedding is the by-product of neural networks language model. In the following sections, we will introduce some classical word embedding methods.
2.2.1. Word2Vec
Word2Vec is a set of approaches that learn word representations from a model by iterative methods. The idea of Word2Vec is to design a model of which the parameters are the word vectors. After the model structure is finalized, we can train it on a certain objective with huge training data, learning our word vectors.
There are two classical algorithms to get word vectors: continuous bag-of-words(CBOW) and skip-gram.
2.2.1.1. Continuous Bag of Words Model(CBOW)
The core idea of CBOW is to predict the center word based on surrounding context words. Note that the order of the context words is not important in predicting the center word, so it is called a bag-of-words model.
Firstly we represent each word in the vocabulary V by a one-hot vector x i , i = 1 , 2 , ⋯ , ∣ V ∣ x^i, i=1,2,\cdots,|V| xi,i=1,2,⋯,∣V∣, where ∣ V ∣ |V| ∣V∣ is the size of V. Let’s denote the target word as x c x^c xc, and the context words with a window of size 2 m 2m 2m as x ( c − m ) , ⋯ , x ( c − 1 ) , x ( c + 1 ) , ⋯ , x ( c + m ) x^{(c-m)},\cdots,x^{(c-1)},x^{(c+1)},\cdots,x^{(c+m)} x(c−m),⋯,x(c−1),x(c+1),⋯,x(c+m). To build up the model structure, we create two matrices, W ∈ R n × ∣ V ∣ W\in R^{n\times |V|} W∈Rn×∣V∣ and U ∈ R ∣ V ∣ × n U\in R^{|V|\times n} U∈R∣V∣×n, where n is the dimensionality of the embedding space, and the i-th column of W W W is the embedding vector the i-th word(of which the corresponding one-hot vector is x i x^i xi). Let W W W be the weights between the input layer and hidden layer, and U U U be the weights between hidden layer and output layer, then we obtain a three-layer neural network model.
The following steps show how the model works:
- We generate the one-hot vectors of the context words with a window of size 2m: x ( c − m ) , ⋯ , x ( c − 1 ) , x ( c + 1 ) , ⋯ , x ( c + m ) x^{(c-m)},\cdots,x^{(c-1)},x^{(c+1)},\cdots,x^{(c+m)} x(c−m),⋯,x(c−1),x(c+1),⋯,x(c+m).
- We calculate the embedding vectors of the context words: ( v c − m = W ⋅ x ( c − m ) , v c − m + 1 = W ⋅ x ( c − m + 1 ) , ⋯ , v c + m = W ⋅ x ( c + m ) ) (v_{c-m}=W\cdot x^{(c-m)},v_{c-m+1}=W\cdot x^{(c-m+1)},\cdots,v_{c+m}=W\cdot x^{(c+m)}) (vc−m=W⋅x(c−m),vc−m+1=W⋅x(c−m+1),⋯,vc+m=W⋅x(c+m)).
- Average these embedding vectors: v ˉ = v c − m + v c − m + 1 + ⋯ v c + m 2 m ∈ R n \bar v=\frac{v_{c-m}+v_{c-m+1}+\cdots v_{c+m}}{2m}\in R^n vˉ=2mvc−m+vc−m+1+⋯vc+m∈Rn.
- Generate a score vector z = U ⋅ v ˉ ∈ R ∣ V ∣ z=U\cdot \bar v\in R^{|V|} z=U⋅vˉ∈R∣V∣.
- Turn the score vector into probabilities y ^ = s o f t m a x ( z ) ∈ R ∣ V ∣ \hat y=softmax(z)\in R^{|V|} y^=softmax(z)∈R∣V∣.
- We desire y ^ \hat y y^ to match the true probabilities y ∈ R ∣ V ∣ y\in R^{|V|} y∈R∣V∣, which happens to be the one-hot vector of the center word. We can choose cross-entropy as the loss function.
2.2.1.2. Skip-Gram Model
Skip-Gram Model is similar to CBOW, and the most different part is that we use the center word to predict the surrounding context words.
- We generate the one-hot vector of the center word: x ( c ) x^{(c)} x(c).
- We calculate the embedding vectors of the center word: v c = W ⋅ x ( c ) v_{c}=W\cdot x^{(c)} vc=W⋅x(c).
- Generate a score vector z = U ⋅ v c ∈ R ∣ V ∣ z=U\cdot v_c\in R^{|V|} z=U⋅vc∈R∣V∣.
- Turn the score vector into probabilities y ^ = s o f t m a x ( z ) ∈ R ∣ V ∣ \hat y=softmax(z)\in R^{|V|} y^=softmax(z)∈R∣V∣. Note that we will get 2 m 2m 2m scores: ( y ^ c − m , y ^ c − m + 1 , ⋯ , y ^ c + m ) (\hat y_{c-m},\hat y_{c-m+1},\cdots,\hat y_{c+m}) (y^c−m,y^c−m+1,⋯,y^c+m)
- We desire ( y ^ c − m , y ^ c − m + 1 , ⋯ , y ^ c + m ) (\hat y_{c-m},\hat y_{c-m+1},\cdots,\hat y_{c+m}) (y^c−m,y^c−m+1,⋯,y^c+m) to match the true probabilities ( y c − m , y c − m + 1 , ⋯ , y c + m ) ( y_{c-m}, y_{c-m+1},\cdots, y_{c+m}) (yc−m,yc−m+1,⋯,yc+m), which happens to be the one-hot vectors of the context words. We can choose cross-entropy as the loss function.
3. Sentence Representation
3.1. Bag of Words
The model of Bag-of-Words represents a sentence as a bag of words, and the sentence can be represented as the linear combination of its words’ vectors.
3.1.1. Sum of One-Hot Word Vectors
With each word of the sentence represented by one-hot encoding, the sentence vector is the sum of all of the word vectors:
X ( s ) = ∑ w i ∈ s X ( w i ) X(s)=\sum_{w_i\in s}{X(w_i)} X(s)=wi∈s∑X(wi)
Where X ( w i ) X(w_i) X(wi) is the one-hot encoding vector of w i w_i wi in sentence s s s, and X ( s ) X(s) X(s) is the vector of s s s.
In example 1, for a sentence “I like computer and science”, its sentence vector is:
(1,1,0,1,0,1,1)
3.1.2. Sum of One-Hot Word Vectors Weighted by Word Counts
In one-hot encoding, every word has the same value–1, in the non-zero component of its feature vector, which ignores importance of different words. Thus, it can’t distinguish these two sentences: “I like computer and science” and “I like computer and computer science”.
To evalute the importance of a word, a straightforward way is to multiply the one-hot vector by a weight–the count of the word in the sentence. Then, the vector of “I like computer and computer science” can be represented as:
(1,1,0,1,0,2,1)
3.1.3. Sum of One-Hot Word Vectors Weighted by TF-IDF
A better choice is to use TF-IDF(Term Frequency-Inverse Document Frequency. In this section, document is identity with sentence) as the weight, where T F ( t ) = c o u n t s o f w o r d t i n t h e d o c n u m o f w o r d s i n t h e d o c TF(t)=\frac{counts\ of\ word\ t\ in\ the\ doc}{num\ of\ words\ in\ the\ doc} TF(t)=num of words in the doccounts of word t in the doc.
The original definition of IDF is:
I D F ( t ) = l n n d f ( t ) IDF(t)=ln\frac{n}{df(t)} IDF(t)=lndf(t)n,
where n n n is the number of documents while d f ( t ) df(t) df(t) is number of documents containing the target word t t t. To avoid division by zero, we can use the smoothing technique: I D F ( t ) = l n 1 + n 1 + d f ( t ) IDF(t)=ln\frac{1+n}{1+df(t)} IDF(t)=ln1+df(t)1+n, and it looks like that there is a document containing all the words. What’s more, to avoid I D F = 0 IDF=0 IDF=0, we can add one to IDF:
I D F ( t ) = l n 1 + n 1 + d f ( t ) + 1 IDF(t)=ln\frac{1+n}{1+df(t)}+1 IDF(t)=ln1+df(t)1+n+1
Then, TF-IDF = T F ( t ) ⋅ I D F ( t ) =TF(t)\cdot IDF(t) =TF(t)⋅IDF(t)
Finally,
X ( s ) = ∑ w i ∈ s T F ( w i ) ⋅ I D F ( w i ) ⋅ X ( w i ) X(s)=\sum_{w_i\in s}{TF(w_i)\cdot IDF(w_i)\cdot X(w_i)} X(s)=wi∈s∑TF(wi)⋅IDF(wi)⋅X(wi)
TF-IDF representation can be easily implemented by calling the following sk-learn functions:
vectorizer=CountVectorizer()
transformer=TfidfTransformer()
tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))
or
transformer=TfidfVectorizer()
tfidf2=transformer.fit_transform(corpus)
3.1.4. Bag-of-Words with Word Embeddings
In this method, each word is represented by word embeddings instead of one-hot encoding. The sentence is then calculated by weighted averaging the word embeddings. You can refer to Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. "A simple but tough-to-beat baseline for sentence embeddings." _International conference on learning representations_. 2017.
for more details of the weight factors.
3.2. Probabilistic Language Model
A big drawback of B
3.3. Neural Network Language Model
这篇关于An Introduction to Text Representation的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!