2018.9. Neural Cross-Lingual Named Entity Recognition 阅读笔记

本文主要是介绍2018.9. Neural Cross-Lingual Named Entity Recognition 阅读笔记，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

Jiateng Xie

Neual Cross-Lingual Named Entity Recognition, CMU

Abstract

本文提出了两种方法来解决 under the unsupervised transfer setting 下 cross-lingual NER 中的挑战。lexical mapping (STEP 1-3). word ordering (STEP 4).

STEP 1: 用 monolingual corpora 各自训练不同语种的 WE
STEP 2: Proscutes Problem：用 seed dictionary 优化 WE alignment，将不同语种的 WE 映射到 a shared embedding space。
STEP 3: 在 shared space 中利用 CSLS similarity metric 对 source lang 进行翻译，并copy labels directly
STEP 4: 利用由STEP 3 得到的training data 和 labels train an NER model，引入了 self-attention layer.

对本文的评价：创新性实则在于引入 self-attention layer，其对于 lexical mapping 问题（即source lang. translation的问题）的解决并没有创新，方法同 Word Translation Without Parallel Data, Alexis Conneau, 2018.1 的完全一致。

Motivation

目的
- 对 resource-rich language 的 NER 模型进行 unsupervised transfer, 解决 language with no annotated resources 的问题。
上述做法的困难之处
- differences in words
- word order
本文的目的和工作
- mapping of lexical items across Lan. (STEP 1-3)
  - find translations based on bilingual word embeddings
- word order (STEP 4)
  - self-attention
本文的成果: under cross-lingual setting
- state-of-the-art: Spanish, Dutch
- competitive: German
- much lower resource requirement
- evaluate on Uyghur

Introduction

NER的工作在引入 nerual architecture 之后取得了长足的进步，但在处理 Lan. with limited amounts of labeled data方面仍有许多不足。
Cross-lingual NER: transfer knowledge from high-resource to low-resource
- 本文：unsupervised transfer
- 2 challenge of unsupervised transfer
  - lexical mapping
  - word order
lexical mapping
- M1: use parallel porpora to project annotations through word alignment
- M2: cheap translation: uses a bilingual dictionary to perform word-level translation （参考文献：利用字典取得了不错的翻译效果，主要是多个词对应时的处理，侧重于对morphological information 的处理）
- M3: bilingual word embeddings (BWE)
  - 用 a small dictionary 或者 adversarial training / identical character strings 等方法将不同语种的 WE 映射到 a shared consistent embedding space
- 本文：discrete dictionary-based + contnuous embedding-based
  - 将不同 Lan. 的 WE 映射到 shared BWE space. (embedding based)
  - learn discrete word translations by 在 BWE 空间中找最近邻 (dictionary based)
  - train a model on the translated data.
word ordering
- 目前没有专门做 unsupervised cross-lingual transfer for NER 的相关工作
- 本文: alleviate this issue by incorporatin an order-invariant self-attention mechanism into the neural architecture

Approach

Problem Setting - Unsupervised cross-lingual NER

已有方法：使用各种 resources
- parallel corpora
- Wikipedia
- large dictionaries
本文需要的 resource
- labeled training data in the source Lan.
- monolingual corpora in both Lan.
- A dictionary: a small pre-existing one, or one induced by unsupervised methods
- 主要对比文献： Mayhew, 2017 与 Ni, 2017

Method

STEP 1: 用 monolingual corpora 各自训练不同语种的 WE
- 实现方法：fastText & GloVe
STEP 2: Proscutes Problem：将不同语种的 WE 映射到 a shared embedding space。
- 实现方法：用 seed dictionary 优化 WE alignment
STEP 3: translate each word in source Lan. training data
- 实现方法：在 shared space 中寻找最近邻
STEP 4: train an NER model
- 训练数据：the translated words & the NE tags from the English corpus

STEP 2: learning bilingual embeddings

embedding alignment：
$\max_W ~~Tr(X_DWY_D^T)~~s.t.~~WW^T=I \\ Y_D^TX_D=U\Sigma V^T \\ W=UV^T \\ X'=XV \\ Y'=YU$
generate a new dictionary using the aligned embeddings (STEP 3)
利用 new dictionary 生成 a new set of bilingual embedding (STEP 1)
循环上述操作 k 次得到最终的 paralell translation: $X_k'$ 和 $Y_k'$

STEP 3: learning word translations

nearest-neighbor search in the common space
- distance metric: cross-domain similarity local scaling (CSLS)

STEP 4: training the NER model

taking English sentences $S=s_1, s_2, \dots, s_n$
translate S to target sentences $\hat{T}=\hat{t}_1, \hat{t}_2, \dots, \hat{t}_n$
copy the English label to the target Lan.
train an NER model directly using the translated data.
- have access to the surface forms
  - can use the character sequences of the target language as part of its input
一些需要注意的细节
- 一般情况：会对 WE 进行 normalization => lie on unit ball
  - 每个 training pair 具有 equal contribution to the objective
- 本文：do not normalize WE
  - Preliminary expt. gave superior results.
  - frequency information conveyed by vector length => important to NER
  - 原因：normalizaton 不是针对句子长度，而是 word vector 的模，因为normalization把所有的点都推到了球面，而同一方向不同的模长对NER来说很重要

existed work
- train NER directly on data using source embeddings
- directly modeling the shared embeddin space
本文
- 利用 shared space 进行了最近邻 translation, 并进行了相关迭代
- 优点：可以扩大 paralell 数据集 & 不断矫正 shared WE space.

NER model architecture

Hierarchical Neural CRF + Self-Attention Layer

Hierarchical Neural CRF

Layer 1: a character-level NN
- usually using RNN or CNN
- 本文: bidirectional LSTMs
- to capture subword information: morphological variations & capitalization patterns
Layer 2: a word-level NN
- usually using an RNN
- 本文: bidirectional LSTMs
- consumes word representations
- to produce context senstive hidden representations for each word
Layer 3: a linear-chain CRF layer
- to model the dependency between labels (defines the joint distribution of all possible output label sequences) & perform inference (Viterbi)

Self-Attention Layer

Layer 2.5: a single layer MLP
- to provide each word with a context feature vector
- irrespective of the words’ position
- The model is more likely to “see vectors similar to those seen at training time, which we posit introduces a level of flexibility with respect to the word order”

$\tanh(HW+b) \\ H^a = softmax(QK^T)\odot(E-I)H = [h_1^a, h_2^a,\dots,h_n^a]$
(queris Q 指的是？如何得到的？)

Experiment

4 sets of experiments
- with & without provided dictionaries on a benchmark NER dataset
  - CoNLL 2002 & 2003: English (source), German, Dutch, Spanish
- compare against a recently proposed dictionary-based translation baseline
- conduct an ablation study to further understand the proposed methods
- apply the method to Uyghur
word embedding
- fastText
- GloVe
- vocabulary size of 100,000 for both embedding methods.
Seed Dictionary
- 用不同词汇表中共有的 identical character strings (对语言类型有要求，distant lang. 可能无法取得较好的结果)
- 通过 adversarial learning to induce a mapping that align the two WE (Lample, 2018)
- a provided dictionary ([Lample](https://github.com/facebookresearch/ MUSE))
Translation
- for Out-Of-Vocabulary (OOV) words: keep them as-is (as-is是什么意思？)
- German 大小写：按照 Wikipedia 中的每个词大写的概率决定其大小写
Network Parameters
- character embedding size: 25
- character level LSTM hidden size: 50
- word level LSTM hidden size: 200
- for OOV: initialize an unknown embedding by uniformly sampling $[-\sqrt{3/emb},\sqrt{3/emb}]$ , emb=100
- replace each number with 0 when input to the character level Bi-LSTM
Network training
- SGD with momentum
- 30 epochs => select the best model on the target Lan.
- learning rate
  - initial: $\eta_0=0.015$
  - update: $\eta_t = \frac{\eta_0}{1+\rho t}$
    - t: the number of completed epoch
    - $\rho = 0.05$ : decay rate
- batch size: 10
- evaluate per 150 batches
- dropout:
  - inputs to the word-level Bi-LSTM: rate=0.5
  - outputs of the word-level Bi-LSTM: rate=0.5
  - outputs of the sel-attention layer:
    - rate=0.5 when using the translated data
    - rate=0.2 when using cheap-translation data
- word embeddings are not fine-tuned during traininig.

Result

best in Spanish & Dutch
competitive in German
- rich morphology & compund words => embeddings less reliable
- a noiser embedding space alignment => lower the quality of BWE-based translation.
Why does translation work better?
- Common Space
  - trained with the source WE + source character sequence => applied on the target side
  - worst: discrepance between the two embedding space
- Replace
  - trained with the target WE + source character sequence
- Translation
  - trained with the target WE + target character sequence
  - best, especially in German => Capitalization