2018.9. Neural Cross-Lingual Named Entity Recognition 阅读笔记

2024-04-24 02:32

Jiateng Xie

Neual Cross-Lingual Named Entity Recognition, CMU


本文提出了两种方法来解决 under the unsupervised transfer setting 下 cross-lingual NER 中的挑战。lexical mapping (STEP 1-3). word ordering (STEP 4).

  • STEP 1: 用 monolingual corpora 各自训练不同语种的 WE
  • STEP 2: Proscutes Problem:用 seed dictionary 优化 WE alignment,将不同语种的 WE 映射到 a shared embedding space。
  • STEP 3: 在 shared space 中利用 CSLS similarity metric 对 source lang 进行翻译,并copy labels directly
  • STEP 4: 利用由STEP 3 得到的training data 和 labels train an NER model,引入了 self-attention layer.

对本文的评价:创新性实则在于引入 self-attention layer,其对于 lexical mapping 问题(即source lang. translation的问题)的解决并没有创新,方法同 Word Translation Without Parallel Data, Alexis Conneau, 2018.1 的完全一致。


  • 目的
    • 对 resource-rich language 的 NER 模型进行 unsupervised transfer, 解决 language with no annotated resources 的问题。
  • 上述做法的困难之处
    • differences in words
    • word order
  • 本文的目的和工作
    • mapping of lexical items across Lan. (STEP 1-3)
      • find translations based on bilingual word embeddings
    • word order (STEP 4)
      • self-attention
  • 本文的成果: under cross-lingual setting
    • state-of-the-art: Spanish, Dutch
    • competitive: German
    • much lower resource requirement
    • evaluate on Uyghur


  • NER的工作在引入 nerual architecture 之后取得了长足的进步,但在处理 Lan. with limited amounts of labeled data方面仍有许多不足。
  • Cross-lingual NER: transfer knowledge from high-resource to low-resource
    • 本文:unsupervised transfer
    • 2 challenge of unsupervised transfer
      • lexical mapping
      • word order
  • lexical mapping
    • M1: use parallel porpora to project annotations through word alignment
    • M2: cheap translation: uses a bilingual dictionary to perform word-level translation (参考文献 :利用字典取得了不错的翻译效果,主要是多个词对应时的处理,侧重于对morphological information 的处理)
    • M3: bilingual word embeddings (BWE)
      • 用 a small dictionary 或者 adversarial training / identical character strings 等方法将不同语种的 WE 映射到 a shared consistent embedding space
    • 本文:discrete dictionary-based + contnuous embedding-based
      • 将不同 Lan. 的 WE 映射到 shared BWE space. (embedding based)
      • learn discrete word translations by 在 BWE 空间中找最近邻 (dictionary based)
      • train a model on the translated data.
  • word ordering
    • 目前没有专门做 unsupervised cross-lingual transfer for NER 的相关工作
    • 本文: alleviate this issue by incorporatin an order-invariant self-attention mechanism into the neural architecture


Problem Setting - Unsupervised cross-lingual NER

  • 已有方法:使用各种 resources
    • parallel corpora
    • Wikipedia
    • large dictionaries
  • 本文需要的 resource
    • labeled training data in the source Lan.
    • monolingual corpora in both Lan.
    • A dictionary: a small pre-existing one, or one induced by unsupervised methods
    • 主要对比文献: Mayhew, 2017 与 Ni, 2017


  • STEP 1: 用 monolingual corpora 各自训练不同语种的 WE
    • 实现方法:fastText & GloVe
  • STEP 2: Proscutes Problem:将不同语种的 WE 映射到 a shared embedding space。
    • 实现方法:用 seed dictionary 优化 WE alignment
  • STEP 3: translate each word in source Lan. training data
    • 实现方法:在 shared space 中寻找最近邻
  • STEP 4: train an NER model
    • 训练数据:the translated words & the NE tags from the English corpus

STEP 2: learning bilingual embeddings

  • embedding alignment:
    max ⁡ W T r ( X D W Y D T ) s . t . W W T = I Y D T X D = U Σ V T W = U V T X ′ = X V Y ′ = Y U \max_W ~~Tr(X_DWY_D^T)~~s.t.~~WW^T=I \\ Y_D^TX_D=U\Sigma V^T \\ W=UV^T \\ X'=XV \\ Y'=YU Wmax  Tr(XDWYDT)  s.t.  WWT=IYDTXD=UΣVTW=UVTX=XVY=YU
  • generate a new dictionary using the aligned embeddings (STEP 3)
  • 利用 new dictionary 生成 a new set of bilingual embedding (STEP 1)
  • 循环上述操作 k 次得到最终的 paralell translation: X k ′ X_k' Xk Y k ′ Y_k' Yk

STEP 3: learning word translations

  • nearest-neighbor search in the common space
    • distance metric: cross-domain similarity local scaling (CSLS)

STEP 4: training the NER model

  • taking English sentences S = s 1 , s 2 , … , s n S=s_1, s_2, \dots, s_n S=s1,s2,,sn
  • translate S to target sentences T ^ = t ^ 1 , t ^ 2 , … , t ^ n \hat{T}=\hat{t}_1, \hat{t}_2, \dots, \hat{t}_n T^=t^1,t^2,,t^n
  • copy the English label to the target Lan.
  • train an NER model directly using the translated data.
    • have access to the surface forms
      • can use the character sequences of the target language as part of its input
  • 一些需要注意的细节
    • 一般情况:会对 WE 进行 normalization => lie on unit ball
      • 每个 training pair 具有 equal contribution to the objective
    • 本文:do not normalize WE
      • Preliminary expt. gave superior results.
      • frequency information conveyed by vector length => important to NER
      • 原因:normalizaton 不是针对句子长度,而是 word vector 的模,因为normalization把所有的点都推到了球面,而同一方向不同的模长对NER来说很重要

  • existed work
    • train NER directly on data using source embeddings
    • directly modeling the shared embeddin space
  • 本文
    • 利用 shared space 进行了最近邻 translation, 并进行了相关迭代
    • 优点:可以扩大 paralell 数据集 & 不断矫正 shared WE space.

NER model architecture

Hierarchical Neural CRF + Self-Attention Layer

Hierarchical Neural CRF
  • Layer 1: a character-level NN
    • usually using RNN or CNN
    • 本文: bidirectional LSTMs
    • to capture subword information: morphological variations & capitalization patterns
  • Layer 2: a word-level NN
    • usually using an RNN
    • 本文: bidirectional LSTMs
    • consumes word representations
    • to produce context senstive hidden representations for each word
  • Layer 3: a linear-chain CRF layer
    • to model the dependency between labels (defines the joint distribution of all possible output label sequences) & perform inference (Viterbi)
Self-Attention Layer
  • Layer 2.5: a single layer MLP
    • to provide each word with a context feature vector
    • irrespective of the words’ position
    • The model is more likely to “see vectors similar to those seen at training time, which we posit introduces a level of flexibility with respect to the word order”

K = tanh ⁡ ( H W + b ) H a = s o f t m a x ( Q K T ) ⊙ ( E − I ) H = [ h 1 a , h 2 a , … , h n a ] K = \tanh(HW+b) \\ H^a = softmax(QK^T)\odot(E-I)H = [h_1^a, h_2^a,\dots,h_n^a] K=tanh(HW+b)Ha=softmax(QKT)(EI)H=[h1a,h2a,,hna]
(queris Q 指的是?如何得到的?)


  • 4 sets of experiments
    • with & without provided dictionaries on a benchmark NER dataset
      • CoNLL 2002 & 2003: English (source), German, Dutch, Spanish
    • compare against a recently proposed dictionary-based translation baseline
    • conduct an ablation study to further understand the proposed methods
    • apply the method to Uyghur
  • word embedding
    • fastText
    • GloVe
    • vocabulary size of 100,000 for both embedding methods.
  • Seed Dictionary
    • 用不同词汇表中共有的 identical character strings (对语言类型有要求,distant lang. 可能无法取得较好的结果)
    • 通过 adversarial learning to induce a mapping that align the two WE (Lample, 2018)
    • a provided dictionary ([Lample](https://github.com/facebookresearch/ MUSE))
  • Translation
    • for Out-Of-Vocabulary (OOV) words: keep them as-is (as-is是什么意思?)
    • German 大小写:按照 Wikipedia 中的每个词大写的概率决定其大小写
  • Network Parameters
    • character embedding size: 25
    • character level LSTM hidden size: 50
    • word level LSTM hidden size: 200
    • for OOV: initialize an unknown embedding by uniformly sampling [ − 3 / e m b , 3 / e m b ] [-\sqrt{3/emb},\sqrt{3/emb}] [3/emb ,3/emb ], emb=100
    • replace each number with 0 when input to the character level Bi-LSTM
  • Network training
    • SGD with momentum
    • 30 epochs => select the best model on the target Lan.
    • learning rate
      • initial: η 0 = 0.015 \eta_0=0.015 η0=0.015
      • update: η t = η 0 1 + ρ t \eta_t = \frac{\eta_0}{1+\rho t} ηt=1+ρtη0
        • t: the number of completed epoch
        • ρ = 0.05 \rho = 0.05 ρ=0.05: decay rate
    • batch size: 10
    • evaluate per 150 batches
    • dropout:
      • inputs to the word-level Bi-LSTM: rate=0.5
      • outputs of the word-level Bi-LSTM: rate=0.5
      • outputs of the sel-attention layer:
        • rate=0.5 when using the translated data
        • rate=0.2 when using cheap-translation data
    • word embeddings are not fine-tuned during traininig.


  • best in Spanish & Dutch
  • competitive in German
    • rich morphology & compund words => embeddings less reliable
    • a noiser embedding space alignment => lower the quality of BWE-based translation.
  • Why does translation work better?
    • Common Space
      • trained with the source WE + source character sequence => applied on the target side
      • worst: discrepance between the two embedding space
    • Replace
      • trained with the target WE + source character sequence
    • Translation
      • trained with the target WE + target character sequence
      • best, especially in German => Capitalization

这篇关于2018.9. Neural Cross-Lingual Named Entity Recognition 阅读笔记的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!




