Lecture 9 Lexical Semantics

本文主要是介绍Lecture 9 Lexical Semantics，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

- - Introduction: sentiment analysis 引言：情感分析
  - Word Semantics 单词语义
  - Word meanings 单词含义
  - WordNet
  - Synsets 同义词集
  - Noun Relations in WordNet WordNet 中的词汇关系
  - Hypernymy Chain 上位链
  - Word Similarity 单词相似度
  - Word Similarity with Paths 基于路径的单词相似度
  - Beyond Path Length 超越路径长度
  - Abstract Nodes 抽象结点
  - Concept Probability of A Node 概念概率（Concept Probability）
  - Similarity with Information Content 基于信息量的相似度
  - 回顾之前情感分析的例子
  - Word Sense Disambiguation 词义消歧
  - Supervised WSD 有监督 WSD
  - Unsupervised WSD: Lesk 较少监督的方法
  - Unsupervised WSD: Clustering 无监督的WSD：聚类
  - 其他数据库：FrameNet
  - FrameNet
  - 进入语料库

Introduction: sentiment analysis 引言：情感分析

在 NLP 中，我们为什么要关注词汇语义学？我们先来看一个情感分析的例子：假设现在我们有一个情感分析任务，我们需要预测一段给定文本的情感极性。

Bag-of-words, KNN classifier. Training data: 词袋模型，KNN分类器。训练数据
- This is a good movie -> positive
- This is a great movie -> positive
- This is a terrible film -> negative
- This is a wonderful film -> ?
Two problems here: 这里存在两个问题
- The model does not know that movie and film are synonyms. Since film appears only in negative examples, the model learns that it is a negative word. 模型不知道 movie 和 film 是同义词。由于film只在负面示例中出现，模型学习到它是一个负面词语
- wonderful is not in the vocabulary (OOV: Out-Of-Vocabulary) wonderful这个单词在词汇表中并没有出现过（OOV, Out-Of-Vocabulary）
Comparing words directly will not work. How to make sure we compare word meanings? 直接比较单词并不是一种很好的方法。我们应当如何保证我们是在比较单词的含义呢？
Solution: Add this information explicitly through a lexical database 解决方案：通过一个词汇数据库（lexical database）来显式地加入这些信息。

Lexical Database 词汇数据库

Word Semantics 单词语义

Lexical Semantics: 词汇语义
- How th meanings of words connect to one another 单词含义之间如何相互联系
- Manually constructed resources 手动构建的资源：词汇表 (lexicons)、同义词词典 (thesauri)、本体论 (ontologies) 等。
  
  我们可以用文本来描述单词的含义，我们也可以观察不同单词之间是如何相互联系的。例如：单词film和movie实际上是同义词（synonym），所以，假如我们不知道film的意思，但是我们知道movie的意思，并且假如我们还知道两者是同义词关系的话，我们就可以知道单词film的意思。我们将看到如何通过手工构建这样的词汇数据库，这些同义词词典或者本体论捕获了单词含义之间的联系。
Distributional Semantics: 分布语义学
- How words relate to each other in the text 文本中的单词之间如何互相关联。
- Automatically created resources from corpora 从语料库中自动创建资源。
  
  我们也可以用另一种方式完成同样的事情。我们的任务仍然是捕获单词的含义，但是相比雇佣语言学家来手工构建词汇数据库，我们可以尝试从语料库中直接学习单词含义。我们尝试利用机器学习或者语料库的一些统计学方法来观察单词之间是如何互相关联的，而不是从语言学专家那里直接得到相关信息。

Word meanings 单词含义

物理或社交世界中的被引用的对象
- 但通常在文本分析中没有用
  
  回忆你小时候尝试学习一个新单词的场景，对于人类而言，单词的含义包含了对于物理世界的引用。例如：当你学习 dog（狗） 这个单词时，你会问自己，什么是dog？你不会仅通过文本或者口头描述来学习这个单词，而是通过观察真实世界中的狗来认识这个单词，这其中涉及到的信息不止包含语言学，而且还包括狗的叫声、气味等其他信息，所有这些信息共同构成了dog这个单词的含义。但是这些其他的信息通常在文本分析中并没有太大作用，并且我们也不容易对其进行表示。
Dictionary definition: 字典定义
- Dictionary definitions are necessarily circular 字典定义必然是循环的
- Only useful if meaning is already understood 仅在已经理解含义的情况下才有用
  
  因此，我们希望寻找一种其他方法来学习单词的含义：通过查词典学习单词含义。但是，我们会发现词典定义通常带有循环性质，我们用一些其他单词来解释目标单词。
- E.g
  
  red: n. the color of blood or a ruby
  
  blood: n. the red liquid that circulates in the heart, arteries, and veins of animals
  
  Here the word red is described by blood and blood is described by red. Therefore, to understand red and blood both meaning has to be understood
  
  可以看到，在定义red（红色） 这个单词时，我们将其描述为blood（血液）的颜色；然后在定义blood（血液）这个单词时，我们将其描述为心脏中的一种red（红色） 液体。所以，我们用 blood定义red，然后又用red定义 blood。如果我们本身不知道这两个单词的含义，那么我们无法从定义中获得词义。但是，字典定义仍然是非常有用的，因为当我们通过字典学习一个新的单词时，我们通常已经具有了一定的词汇背景，例如当我们学习一门新的语言时，字典可以提供一些非常有用的信息。
Their relationships with other words. 它们与其他单词的关系
- Also circular, but better for text analysis 也是循环的，但更实用
  
  另一种学习词义的方法是查看目标单词和其他单词的关系。同样，这种方法也涉及到循环性的问题，但是，当我们需要结合上下文使用某个单词时，这种方法非常有用，就像之前film和movie的例子。所以，单词之间的关系是另一种非常好的表征词义的方式。
Word sense: A word sense describes one aspect of the meaning of a word 单词义项：单词义项描述了单词含义的一个方面
- E.g. mouse: a quiet animal like a mouse
Polysemous: If a word has multiple senses, it is polysemous. 多义词：如果一个单词有多个义项，那么它就是多义词。
- E.g.
  - mouse¹: a mouse controlling a computer system in 1968
  - mouse²: a quiet animal like a mouse
Gloss: Textual definition of a sense, given by a dictionary 词义释义：由字典给出的一个义项的文本定义
Meaning Through Relations: 通过关系理解含义
- Synonymy(同义): near identical meaning 几乎相同的含义
  - vomit - throw up
  - big - large
- Antonymy(反义): opposite meaning 相反的含义
  - long - short
  - big - little
- Hypernymy(上位关系): is-a relation is-a 关系
  - 前者为下位词 (hyponym)，表示后者的一个更加具体的实例，例如cat。
  - 后者为上位词 (hypernym)，表示比前者更宽泛的一个类别，例如animal。
  - cat - animal
  - mango - fruit
- Meronymy(部分-整体关系): part-whole relation 部分-整体关系
  - 前者为部件词 (meronym)，表示后者的一部分，例如leg。
  - 后者为整体词 (holonym)，表示包含前者的一个整体，例如chair。
  - leg - chair
  - whel - car
Eg:

WordNet

A database of lexical relations 一个词汇关系的数据库
English WordNet includes ~120,000 nouns, ~12,000 verbs, ~21,000 adjectives, ~4,000 adverbs
On average: noun has 1.23 senses, verbs 2.16 平均来说：名词有1.23个义项，动词有2.16个义项
Eg:
可以看到，名词bass的词义基本上可以分为两大类：音乐和鲈鱼。而 WordNet 又将其细分为了 8 个类别。但是，这种分类对于一般的 NLP 任务而言可能太细了，所以，在使用这些词义之前，我们通常会进行一些聚类（clustering）操作。

Synsets 同义词集

Nodes of WordNet are not words or lemmas, but senses WordNet 的节点不是单词或词形，而是义项
There are represented by sets of synonyms, or called synsets 这些都由一组同义词表示，或称为同义词集
E.g. Bass:
- {bass, deep}
- {bass, bass voice, basso}

# 查看某个单词所有的同义词集
>>> nltk.corpus.wordnet.synsets(‘bank’)
[Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'),
Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'),
Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'),
Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'),
Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]# 查看某个单词的某个同义词集的定义
>>> nltk.corpus.wordnet.synsets(‘bank’)[0].definition()
u'sloping land (especially the slope beside a body of water)‘# 查看某个单词的某个同义词集中的所有相关词目
>>> nltk.corpus.wordnet.synsets(‘bank’)[1].lemma_names()
[u'depository_financial_institution', u'bank', u'banking_concern', u'banking_company']

Noun Relations in WordNet WordNet 中的词汇关系

在这里插入图片描述

Hypernymy Chain 上位链

在这里插入图片描述
如果我们观察 bass³ 和 bass⁷ 这两个同义词集（synsets）的上位关系链 (Hypernymy Chain)，我们可以发现一长串的上位关系。首先，我们看到 bass³ 的定义是关于男低音歌唱家的，然后我们看到第一层上位关系是歌手，第二层是音乐人，然后再到表演者、人类、生物、整体、实体等等。同样， bass⁷ 的定义是关于低音乐器的，第一层上位关系是乐器，然后再到设备、人工制品、整体、实体等等。可以看到，尽管这两个同义词集存在某种关联（都涉及低音方面的含义），二者共同的同义词集距离实际上非常远。如果我们观察整个上位关系链，可以发现二者实际上具有共同的同义词集 whole, unit，但是，上位关系链到这一层已经是一个非常宽泛和抽象的概念了。所以，从单词角度来看，二者实际上彼此差异是非常大的，尽管从直觉上我们会觉得二者差异并没有那么大。

Word Similarity

Word Similarity 单词相似度

Synonymy 同义: file - movie 我们已经见过了同义关系，并且我们可以从词汇关系数据库中捕获到这种同义关系。
What about show - file and opera - film? 这些单词的含义并非完全相同，但是非常接近。
Unlike synonymy which is a binary relation, word similarity is a spectrum 与同义词（二元关系，binary relation）不同，单词相似度是一个谱（spectrum）。
如果我们仅仅采用二元关系作为标准，我们将无法囊括上述关系（相似/相关但不相同），因此，我们希望构建一种谱来衡量单词之间的相似度，例如：opera 和 film 之间的相似度要高于 flower 和 film 之间的相似度。
Use lexical database or thesaurus(分类词词典) to estimate word similarity

Word Similarity with Paths 基于路径的单词相似度

首先，我们可以利用词汇数据库中的单词之间的路径信息

Given WordNet, find similarity based on path length 给定 WordNet，基于路径长度寻找相似度
路径长度 pathlen(c₁, c₂) = 1 + edge length in the shortest path between sense c₁ and c₂ 词义 c₁ 和 c₂ 之间的最短路径的边长。
Similarity between two senses: 两个词义（senses）之间的相似度：

$simpath(c_1, c_2) = \frac{1}{pathlen(c_1, c_2)}$
Similarity between two words: 两个单词（words）之间的相似度

$wordsim(w_1, w_2) = max_{c_1 \in senses(w_1), c_2 \in senses(w_2)}simpath(c_1, c_2)$
E.g.

假设现在我们希望计算 nickel和 coin、currency、money、Richter scale 之间的路径相似度：
可以看到，nickel（五分镍币）和 coin（硬币）之间的相似度（0.5）要高于 nickel 和 money (金钱）之间的相似度 (0.17），因为前者的相关性更高。但是，我们会发现 nickel 和 Richter scale （里氏震级）之间的相似度（0.13）和 money (0.17）相比似乎太高了，因为二者几乎毫不相关。因此，我们发现，单纯地使用路径信息并不是一种非常好的评估方式。

Beyond Path Length 超越路径长度

Problem of simple path length: Edges vary widely in actual semantic distance 问题：每条边在实际语义距离上的差异非常大
- E.g. from last example tree:
  - simpath(nickel, money) = 0.17
  - simpath(nickel, Richter scale) = 0.13
  - From the simple path length, similarity of nickel-money and nickel-Richter scale are very close. But in actual meanings nickel is much similar to money then Richter scale 从简单的路径长度来看，“nickel-money”和“nickel-Richter scale”的相似度非常接近。但在实际意义上，镍与里氏规模的货币非常相似
  - 在靠近顶部层级处，同样一条边的语义跳跃要比靠近底层时更大。
  - 因此，单纯依靠路径长度并不能区分这种差异，因为对于每条边，其长度总是为1。
Solution 1: include depth information 增加深度信息（Wu & Palmer）
- Use path to find lowest common subsumer (LCS) 使用路径查找最低公共归类（lowest common subsumer，LCS）
- Compare using depths: 利用深度进行比较
  
  $simwup(c_1, c_2) = \frac{2 * depth(LCS(c_1, c_2))}{depth(c_1) + depth(c_2)}$
  
  High simwup when parent is deep or senses are shallow
  
  LCS简单来说就是数有几个一样的字符
- E.g.
  
  假设standard 的深度为 1。
  可以看到，nickel 和 Richter scale 之间的相似度 (0.22) 是和money (0.44) 之间相似度的一半，这个结果比之前要好很多。但是，对于nickel和 Richter scale 之间的相似度而言，0.22这个数字仍然过高了，因为二者实际上几乎没有关联。

Abstract Nodes 抽象结点

Node depth is still poor semantic distance metric. E.g.: 计算边的数量或者结点深度仍然是一种较差的语义距离度量。
- simwup(nickel, money) = 0.44
- simwup(nickel, Richter scale) = 0.22
Node high in the hierarchy is very abstract or general 层次结构中的高级节点非常抽象或一般
我们如何能够使得通过非常抽象的节点连接的单词之间的相似度大大降低

Concept Probability of A Node 概念概率（Concept Probability）

Intuition: 直觉
- general node -> high concept probability 广泛节点 -> 高概念概率
- narrow node -> low concept probability 狭窄节点 -> 低概念概率
Find all the children of the node, and sum up their unigram probabilities 找到该节点的所有子节点，并将它们的单字概率相加: $P(c) = \frac{\sum_{s \in child(c)}count(s)}{N}$
- P(C)：从语料库中随机选择的一个单词是概念c的一个实例的概率。
- child(c)：c的所有子结点的单词的集合。
E.g.

Abstract nodes in the higher hierarchy has a higher P(c) 层次结构中较高的抽象结点具有较高的 P(c) 说明该结点表示的词义越抽象。

Similarity with Information Content 基于信息量的相似度

我们可以利用概念概率P(c)来定义信息量（information content，IC），P(c)越大，词义越抽象，信息量越小。

Information Content: $IC = -logP(c)$
- general concept = small values 广泛概念 = 小值
- narrow concept = large values 狭义概念 = 大值
simlin : $simlin(c_1, c_2) = \frac{2*IC(LCS(c_1, c_2))}{IC(c_1) + IC(c_2)}$
- High simlin when concept of parent is narrow or concept of senses are general
E.g

可以看到，simlin 和 simwup 的唯一区别就是用 IC 代替depth。

$simlin(hill, coast) = \frac{2*-logP(geolocial-formation)}{-logP(hill) - logP(coast)} = \frac{-2log0.00176}{-log0.0000189-log0.0000216}$

如果ICS 结点在层次结构中位置非常高（例如：P(C)=0.99），那么IC 将非常低（在这种情况下为 0.01)。

Word Sense Disambiguation

回顾之前情感分析的例子

This is a good movie -> positive 正面 😊
This is a wonderful film -> ?
如果我们的分类器是基于单词相似度（例如：KNN），使用 WordNet 路径比较单词效果较好。
但是，如果我们希望将词义（sense）作为一般特征表示，那么我们该如何使用其他分类器？
解决方案：将文本中的单词（words）显式地映射到 WordNet 中的词义（senses）。

Word Sense Disambiguation 词义消歧

Task: Selects the correct sense for words in a sentence 任务：为句子中的单词选择正确的词义。
Baseline: Assume the most popular sense 假设最流行的词义
Good WSD potentially useful for many tasks: 良好的词义消歧 (Word Sense Disambiguation, WSD) 在很多 NLP 任务中都有应用潜力
- Knowing which sense of mouse is used in a sentence is important 在实践中经常被忽略，因为良好的 WSD 是非常困难的
- Less popular nowadays because sense information is implicitly captured by contextual representations 在研究领域中非常活跃

Supervised WSD 有监督 WSD

Apply standard machine classifiers 采用标准的机器学习分类器
Feature vectors are typically words and syntax around target 特征向量通常是目标周围的单词和句法
- 但是上下文也可能带有歧义（ambiguous）
- 上下文窗口应该设置为多大？（通常很小）
Requires sense-tagged corpora 需要带词义标记（sense-tagged）的语料库
- E.g. SENSEVAL, SEMCOR 例如 SENSEVAL 和 SEMCOR（NLTK 中提供）
- Very time-consuming to create 创建过程非常耗时

Unsupervised WSD: Lesk 较少监督的方法

Lesk: Choose sense whose WordNet gloss overlaps most with the context 选择 WordNet 中的字典注解（dictionary gloss）与上下文重叠最大的词义（sense）
E.g.

给定下面的句子，我们希望知道句子中单词bank对应的词义：

Unsupervised WSD: Clustering 无监督的WSD：聚类

Gather usages of the word 收集该词的使用情况
Perform clustering on context words to learn the different senses 对语境词进行聚类，以了解不同的意义
- Rationale: context words of the same sense should be similar 理由：相同意义的语境词应该是相似的。
Disadvantages: 缺点
- Sense cluster not very interpretable 感官聚类的可解释性不强
- Need to align with dictionary senses 需要与字典中的意义保持一致

其他数据库：FrameNet

基于 框架语义（frame semantics）
- Mary bought a car from John. 玛丽从约翰那里买了辆车。
- John sold a car to Mary. 约翰把车卖给了玛丽。
- 相同的情况（语义框架），只是视角不同。
一个 框架（frames） 的词汇数据库，通常是典型情况
- 例如commerce_bug、apply_heat

FrameNet

包含引发框架的 词汇单元（lexical units） 的列表。
- 例如：cook, fry, bake, boil等等
语义角色（semantic roles） 或 框架元素（frame elements） 的列表。
- 例如：“the cook”, “the food”, ithe container", ithe instrument"