本文主要是介绍Word2Vec (Part 2): NLP With Deep Learning with Tensorflow (CBOW),希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
Tensorflow上其实本来已经有word2vec的代码了,但是我第一次看的时候也是看得云里雾里,还是看得不太明白。并且官方文档中只有word2vec的skip-gram实现,所以google了一下,发现了这两篇好文章,好像也没看到中文版本,本着学习的态度,决定翻译一下,一来加深一下自己的理解,二来也可以方便一下别人。第一次翻译,如有不当,欢迎指出。
原文章地址:
Word2Vec (Part 1): NLP With Deep Learning with Tensorflow (Skip-gram)
Word2Vec (Part 2): NLP With Deep Learning with Tensorflow (CBOW)
上一篇文章,也就是Skip-gram模型,点这里
下面带来CBOW模型的讲解:
CBOW是什么?
CBOW是什么呢?它的全称是 continuous bag-of-words ,中文是连续词袋模型。它的框架可以说就是将skip-gram模型倒转过来。在skip-gram模型中,是根据目标词预测上下文。而CBOW模型,则是根据上下文预测目标词。
为什么要使用CBOW模型?
既然我们已经有了skip-gram模型,为什么我们还要学习CBOW模型呢?原因就是CBOW模型的表现更加优秀。一部分的原因在于CBOW模型的 inputs 更加丰富。换句化说,假定如下句子:the dog barked at the mailman
, 在skip-gram模型中,输入输出为 (input:'dog',output:'barked'
) ,而在CBOW模型中,将有以下输入输出:(input:['the','barked','at'],output:'dog'
) 。可以看出在CBOW中,只有当 [the, barked, at] 等词准确地出现了,才会预测dog 出现,而不像skip-gram那样,只能预测出 dog 将出现在 barked 附近。
CBOW模型
CBOW的概念模型看起来就像倒过来的skip-gram模型一样。尽管看起来如此,但是CBOW模型和skip-gram模型并不是对称的。下面是模型的框架图。
注意到,因为实现框架图跟skip-gram的十分相似,所以没有给出来。如何把这个概念模型如转化成实现模型呢,我们要做的就是生成 (input, output) 的 batch。换句化说,对每一列每次处理处理 b 个(b - batch大小)词(比如b x word[t-2]
,b x word[t-1]
, b x word[t+1]
,b x word[t+2])。
CBOW 背后的思想是,我们使用所有 input 词的平均词向量作为学习模型的输入。
数据生成
现在,生成数据的函数需要做一些小小的修改来适应CBOW模型。下面是修改后的代码:
- def generate_batch(batch_size, skip_window):
-
-
- global data_index
- assert skip_window%2==1
-
- span = 2 * skip_window + 1
-
- batch = np.ndarray(shape=(batch_size,span-1), dtype=np.int32)
- labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
-
-
-
-
-
- buffer = collections.deque(maxlen=span)
-
-
- for _ in range(span):
- buffer.append(data[data_index])
- data_index = (data_index + 1) % len(data)
-
-
-
-
- for i in range(batch_size):
- target = skip_window
- target_to_avoid = [ skip_window ]
-
-
-
-
-
- col_idx = 0
- for j in range(span):
- if j==span//2:
- continue
-
- batch[i,col_idx] = buffer[j]
- col_idx += 1
- labels[i, 0] = buffer[target]
-
- buffer.append(data[data_index])
- data_index = (data_index + 1) % len(data)
-
- assert batch.shape[0]==batch_size and batch.shape[1]== span-1
可以注意到,batch 的大小变为 (b x span-1),而修改前为(b x 1)。并且去除了num_skips.因为我们将使用 span 中所有的词。直观地说,batch 的索引(i,j)可以理解为文档labels 中第 j 个词的 i-skip_window 偏移量( i<skip_window 时)或 i-skip_window+1 个词 (i>=skip_window时)。举个例子,假设 skip_window=1 , 输入句子 the dog barked at the mailman,我们将会得到,
batch: [['the','barked'],['dog','at'],['barked','the'],['at','mailman']]
labels: ['dog','barked','at','the']
训练模型
同样,训练模型的阶段也需要做一些调整。但是这也没有很复杂,我们要做的就是把 data placholder 的大小做一些调整,并且为多个输入写
写入正确的符号操作以得到其平均值。考虑到训练过程比较重要,因此我将会把代码分成几个小片段,并且从中挑取重要的来讲解。
变量初始化
首先我们要把 train_dataset 的 placeholder 改变为 (b x 2*skip_window)(记住,span-1 = 2*skip_window)。其它保持不变。
- if __name__ == '__main__':
- batch_size = 128
- embedding_size = 128
- skip_window = 1
- num_skips = 2
- valid_size = 16
- valid_window = 100
-
- valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
- valid_examples = np.append(valid_examples,random.sample(range(1000,1000+valid_window), valid_size//2))
- num_sampled = 64
-
- graph = tf.Graph()
-
- with graph.as_default(), tf.device('/cpu:0'):
-
-
- train_dataset = tf.placeholder(tf.int32, shape=[batch_size,2*skip_window])
- train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
- valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
-
-
-
- embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
- softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
- stddev=1.0 / math.sqrt(embedding_size)))
- softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
向量查找与求平均
这里我们要做些大的变动。我们要重新编写 embedding lookup 并且正确地求出它们的平均值。总来的来说,我们要查找 train_dataset (大小为 b x 2*skip_window) 的每一行,查找行中词ID对应的向量。然后将这些向量保存在临时变量( embedding_i )中,在把这些向量连接起来称为复合向量(embeds)(大小为 2*skip_window x b x D),进而在 axis 0 上求得 reduce mean 。最终我们可以对 data 的每个 batch 生成 train_labels 中词相应上下文的平均向量。
-
- embeds = None
- for i in range(2*skip_window):
- embedding_i = tf.nn.embedding_lookup(embeddings, train_dataset[:,i])
- print('embedding %d shape: %s'%(i,embedding_i.get_shape().as_list()))
- emb_x,emb_y = embedding_i.get_shape().as_list()
- if embeds is None:
- embeds = tf.reshape(embedding_i,[emb_x,emb_y,1])
- else:
- embeds = tf.concat(2,[embeds,tf.reshape(embedding_i,[emb_x,emb_y,1])])
-
- assert embeds.get_shape().as_list()[2]==2*skip_window
- print("Concat embedding size: %s"%embeds.get_shape().as_list())
- avg_embed = tf.reduce_mean(embeds,2,keep_dims=False)
- print("Avg embedding size: %s"%avg_embed.get_shape().as_list())
Loss 函数以及优化
现在相较于skip-gram模型,我们在CBOW模型的 sampled_softmax_loss 中使用了平均向量。代码方面没有很大的变动。
- loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, avg_embed,
- train_labels, num_sampled, vocabulary_size))
- optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
-
-
- norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
- normalized_embeddings = embeddings / norm
- valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
- similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
运行程序
最后我们要来让tensorflow跑起来。
- with tf.Session(graph=graph) as session:
- tf.initialize_all_variables().run()
- print('Initialized')
- average_loss = 0
- for step in range(num_steps):
- batch_data, batch_labels = generate_batch(batch_size, skip_window)
- feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
- _, l = session.run([optimizer, loss], feed_dict=feed_dict)
- average_loss += l
- if step % 2000 == 0:
- if step > 0:
- average_loss = average_loss / 2000
-
- print('Average loss at step %d: %f' % (step, average_loss))
- average_loss = 0
-
- if step % 10000 == 0:
- sim = similarity.eval()
- for i in range(valid_size):
- valid_word = reverse_dictionary[valid_examples[i]]
- top_k = 8
- nearest = (-sim[i, :]).argsort()[1:top_k+1]
- log = 'Nearest to %s:' % valid_word
- for k in range(top_k):
- close_word = reverse_dictionary[nearest[k]]
- log = '%s %s,' % (log, close_word)
- print(log)
- final_embeddings = normalized_embeddings.eval()
结果
最终我们得到的结果如下
- Average loss at step 0: 7.687360
- Nearest to he: annoying, menachem, publicize, unwise, skinny, attractors, devastating, declination,
- Nearest to is: iarc, agrarianism, revoluci, bachman, distinguish, schliemann, carbons, ne,
- Nearest to some: routed, oscillations, reverence, collaborating, invitational, murderous, mortimer, migratory,
- Nearest to only: walkway, loud, today, headshot, foundational, asceticism, tracked, hare,
- ...
- Nearest to i: intermediates, backed, techs, duly, inefficiencies, ibadi, creole, poured,
- Nearest to bbc: mprp, catching, slavic, mol, dorian, mining, inactivity, applet,
- Nearest to cost: cakes, voltages, halter, disappeared, poking, buttocks, talents, salle,
- Nearest to proposed: prisoners, ecuador, sorghum, complying, saturdays, positioned, probing, observables,
- Average loss at step 100000: 2.422888
- Nearest to he: she, it, they, there, who, eventually, neighbors, theses,
- Nearest to is: was, has, became, remains, be, becomes, seems, cetacean,
- Nearest to some: many, several, certain, most, any, all, both, these,
- Nearest to only: settling, orchids, commutation, until, either, first, alcohols, rabba,
- ...
- Nearest to i: we, you, ii, iii, iv, they, t, lm,
- Nearest to bbc: news, corporation, coffers, inactivity, mprp, formatted, cara, pedestrian,
- Nearest to cost: cakes, length, completion, poking, measure, enforcers, parody, figurative,
- Nearest to proposed: introduced, discovered, foreground, suggested, dismissed, argued, ecuador, builder,
完整代码可以在这里下载:
5_word2vec_cbow.py
这篇关于Word2Vec (Part 2): NLP With Deep Learning with Tensorflow (CBOW)的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!