本文主要是介绍2018.9. Neural Cross-Lingual Named Entity Recognition 阅读笔记,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
Jiateng Xie
Neual Cross-Lingual Named Entity Recognition, CMU
Abstract
本文提出了两种方法来解决 under the unsupervised transfer setting 下 cross-lingual NER 中的挑战。lexical mapping (STEP 1-3). word ordering (STEP 4).
- STEP 1: 用 monolingual corpora 各自训练不同语种的 WE
- STEP 2: Proscutes Problem:用 seed dictionary 优化 WE alignment,将不同语种的 WE 映射到 a shared embedding space。
- STEP 3: 在 shared space 中利用 CSLS similarity metric 对 source lang 进行翻译,并copy labels directly
- STEP 4: 利用由STEP 3 得到的training data 和 labels train an NER model,引入了 self-attention layer.
对本文的评价:创新性实则在于引入 self-attention layer,其对于 lexical mapping 问题(即source lang. translation的问题)的解决并没有创新,方法同 Word Translation Without Parallel Data, Alexis Conneau, 2018.1 的完全一致。
Motivation
- 目的
- 对 resource-rich language 的 NER 模型进行 unsupervised transfer, 解决 language with no annotated resources 的问题。
- 上述做法的困难之处
- differences in words
- word order
- 本文的目的和工作
- mapping of lexical items across Lan. (STEP 1-3)
- find translations based on bilingual word embeddings
- word order (STEP 4)
- self-attention
- mapping of lexical items across Lan. (STEP 1-3)
- 本文的成果: under cross-lingual setting
- state-of-the-art: Spanish, Dutch
- competitive: German
- much lower resource requirement
- evaluate on Uyghur
Introduction
- NER的工作在引入 nerual architecture 之后取得了长足的进步,但在处理 Lan. with limited amounts of labeled data方面仍有许多不足。
- Cross-lingual NER: transfer knowledge from high-resource to low-resource
- 本文:unsupervised transfer
- 2 challenge of unsupervised transfer
- lexical mapping
- word order
- lexical mapping
- M1: use parallel porpora to project annotations through word alignment
- M2: cheap translation: uses a bilingual dictionary to perform word-level translation (参考文献 :利用字典取得了不错的翻译效果,主要是多个词对应时的处理,侧重于对morphological information 的处理)
- M3: bilingual word embeddings (BWE)
- 用 a small dictionary 或者 adversarial training / identical character strings 等方法将不同语种的 WE 映射到 a shared consistent embedding space
- 本文:discrete dictionary-based + contnuous embedding-based
- 将不同 Lan. 的 WE 映射到 shared BWE space. (embedding based)
- learn discrete word translations by 在 BWE 空间中找最近邻 (dictionary based)
- train a model on the translated data.
- word ordering
- 目前没有专门做 unsupervised cross-lingual transfer for NER 的相关工作
- 本文: alleviate this issue by incorporatin an order-invariant self-attention mechanism into the neural architecture
Approach
Problem Setting - Unsupervised cross-lingual NER
- 已有方法:使用各种 resources
- parallel corpora
- Wikipedia
- large dictionaries
- 本文需要的 resource
- labeled training data in the source Lan.
- monolingual corpora in both Lan.
- A dictionary: a small pre-existing one, or one induced by unsupervised methods
- 主要对比文献: Mayhew, 2017 与 Ni, 2017
Method
- STEP 1: 用 monolingual corpora 各自训练不同语种的 WE
- 实现方法:fastText & GloVe
- STEP 2: Proscutes Problem:将不同语种的 WE 映射到 a shared embedding space。
- 实现方法:用 seed dictionary 优化 WE alignment
- STEP 3: translate each word in source Lan. training data
- 实现方法:在 shared space 中寻找最近邻
- STEP 4: train an NER model
- 训练数据:the translated words & the NE tags from the English corpus
STEP 2: learning bilingual embeddings
- embedding alignment:
max W T r ( X D W Y D T ) s . t . W W T = I Y D T X D = U Σ V T W = U V T X ′ = X V Y ′ = Y U \max_W ~~Tr(X_DWY_D^T)~~s.t.~~WW^T=I \\ Y_D^TX_D=U\Sigma V^T \\ W=UV^T \\ X'=XV \\ Y'=YU Wmax Tr(XDWYDT) s.t. WWT=IYDTXD=UΣVTW=UVTX′=XVY′=YU - generate a new dictionary using the aligned embeddings (STEP 3)
- 利用 new dictionary 生成 a new set of bilingual embedding (STEP 1)
- 循环上述操作 k 次得到最终的 paralell translation: X k ′ X_k' Xk′ 和 Y k ′ Y_k' Yk′
STEP 3: learning word translations
- nearest-neighbor search in the common space
- distance metric: cross-domain similarity local scaling (CSLS)
STEP 4: training the NER model
- taking English sentences S = s 1 , s 2 , … , s n S=s_1, s_2, \dots, s_n S=s1,s2,…,sn
- translate S to target sentences T ^ = t ^ 1 , t ^ 2 , … , t ^ n \hat{T}=\hat{t}_1, \hat{t}_2, \dots, \hat{t}_n T^=t^1,t^2,…,t^n
- copy the English label to the target Lan.
- train an NER model directly using the translated data.
- have access to the surface forms
- can use the character sequences of the target language as part of its input
- have access to the surface forms
- 一些需要注意的细节
- 一般情况:会对 WE 进行 normalization => lie on unit ball
- 每个 training pair 具有 equal contribution to the objective
- 本文:do not normalize WE
- Preliminary expt. gave superior results.
- frequency information conveyed by vector length => important to NER
- 原因:normalizaton 不是针对句子长度,而是 word vector 的模,因为normalization把所有的点都推到了球面,而同一方向不同的模长对NER来说很重要
- 一般情况:会对 WE 进行 normalization => lie on unit ball
- existed work
- train NER directly on data using source embeddings
- directly modeling the shared embeddin space
- 本文
- 利用 shared space 进行了最近邻 translation, 并进行了相关迭代
- 优点:可以扩大 paralell 数据集 & 不断矫正 shared WE space.
NER model architecture
Hierarchical Neural CRF + Self-Attention Layer
Hierarchical Neural CRF
- Layer 1: a character-level NN
- usually using RNN or CNN
- 本文: bidirectional LSTMs
- to capture subword information: morphological variations & capitalization patterns
- Layer 2: a word-level NN
- usually using an RNN
- 本文: bidirectional LSTMs
- consumes word representations
- to produce context senstive hidden representations for each word
- Layer 3: a linear-chain CRF layer
- to model the dependency between labels (defines the joint distribution of all possible output label sequences) & perform inference (Viterbi)
Self-Attention Layer
- Layer 2.5: a single layer MLP
- to provide each word with a context feature vector
- irrespective of the words’ position
- The model is more likely to “see vectors similar to those seen at training time, which we posit introduces a level of flexibility with respect to the word order”
K = tanh ( H W + b ) H a = s o f t m a x ( Q K T ) ⊙ ( E − I ) H = [ h 1 a , h 2 a , … , h n a ] K = \tanh(HW+b) \\ H^a = softmax(QK^T)\odot(E-I)H = [h_1^a, h_2^a,\dots,h_n^a] K=tanh(HW+b)Ha=softmax(QKT)⊙(E−I)H=[h1a,h2a,…,hna]
(queris Q 指的是?如何得到的?)
Experiment
- 4 sets of experiments
- with & without provided dictionaries on a benchmark NER dataset
- CoNLL 2002 & 2003: English (source), German, Dutch, Spanish
- compare against a recently proposed dictionary-based translation baseline
- conduct an ablation study to further understand the proposed methods
- apply the method to Uyghur
- with & without provided dictionaries on a benchmark NER dataset
- word embedding
- fastText
- GloVe
- vocabulary size of 100,000 for both embedding methods.
- Seed Dictionary
- 用不同词汇表中共有的 identical character strings (对语言类型有要求,distant lang. 可能无法取得较好的结果)
- 通过 adversarial learning to induce a mapping that align the two WE (Lample, 2018)
- a provided dictionary ([Lample](https://github.com/facebookresearch/ MUSE))
- Translation
- for Out-Of-Vocabulary (OOV) words: keep them as-is (as-is是什么意思?)
- German 大小写:按照 Wikipedia 中的每个词大写的概率决定其大小写
- Network Parameters
- character embedding size: 25
- character level LSTM hidden size: 50
- word level LSTM hidden size: 200
- for OOV: initialize an unknown embedding by uniformly sampling [ − 3 / e m b , 3 / e m b ] [-\sqrt{3/emb},\sqrt{3/emb}] [−3/emb,3/emb], emb=100
- replace each number with 0 when input to the character level Bi-LSTM
- Network training
- SGD with momentum
- 30 epochs => select the best model on the target Lan.
- learning rate
- initial: η 0 = 0.015 \eta_0=0.015 η0=0.015
- update: η t = η 0 1 + ρ t \eta_t = \frac{\eta_0}{1+\rho t} ηt=1+ρtη0
- t: the number of completed epoch
- ρ = 0.05 \rho = 0.05 ρ=0.05: decay rate
- batch size: 10
- evaluate per 150 batches
- dropout:
- inputs to the word-level Bi-LSTM: rate=0.5
- outputs of the word-level Bi-LSTM: rate=0.5
- outputs of the sel-attention layer:
- rate=0.5 when using the translated data
- rate=0.2 when using cheap-translation data
- word embeddings are not fine-tuned during traininig.
Result
- best in Spanish & Dutch
- competitive in German
- rich morphology & compund words => embeddings less reliable
- a noiser embedding space alignment => lower the quality of BWE-based translation.
- Why does translation work better?
- Common Space
- trained with the source WE + source character sequence => applied on the target side
- worst: discrepance between the two embedding space
- Replace
- trained with the target WE + source character sequence
- Translation
- trained with the target WE + target character sequence
- best, especially in German => Capitalization
- Common Space
这篇关于2018.9. Neural Cross-Lingual Named Entity Recognition 阅读笔记的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!