2017. cheap translation for Cross-lingual NER 阅读笔记

本文主要是介绍2017. cheap translation for Cross-lingual NER 阅读笔记，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

cheap translation for Cross-lingual NER, Illinois Champaign

目标
- 面向 low-resource Lan. 的 NER
- 仅依赖 very minimal resources.
方法概述
- 利用 lexicon translate: annotated high-resource Lan. => target Lan. (training data)
- 再利用 translated target data 学习一个 monolingual NER model
结果
- 当 target lan. 的 wilipedia 数据可获得时，本文的方法可以提升 wikipedia-based methods 得到 SOA 的结果。

core idea: translate annotated high-re. Lan. => target Lan.
- using: a lexicon, rather than parallel text
- challenge: a source word <=> a phrase translation table in target Lan.
  - prominence score: normalized coocurrence counts in each candidate set
  - greeding decoding method
  - choose a candidate, copy labels to it
  - options from the lexicon are resolved by a LM multiplied by the prominence score of each option (Stolcke, 2002)
- Caution: no reordering of the translated result
Result
- annotated data in target Lan.
- in spite of mistakes, the context around the entities (matters for NER) is reasonably well-preserved

translate English to target lan. (lexicon coverage)
Yoruba
- normalize the text (remove all pronunciation sys.) => make the data less sparse
Bengali & Tamil
- omit using the word suface as a feature
result
- 14.6 points of F1 improv. over the baseline
- proposed approach is othogonal to other approaches, and can be combined with great effects.

obtained by grounding words & phrases to English Wikipedia pages
using the categories of the linked page as NER features for the surface text.
result
- improve scores for all 7 lan.
- improvement at European Lan > other Lan. => it’s advantageous to select a source lan. similar to the target lan.

translate English into other Lan. by Google
aligne the source-target data using fast align
- the alignments can be noisy given the relatively small amount of text
project labels across alignments
result
- “As with the other approaches, Brown cluster features are an important signal.”

no wikifier features: Wikipedia size is too small
Dictionary: LORELEI + Rolston => 116k
Name Substitution: untranslatable => replace with a randomly selected NE from the gazetteer list corresponding to the tag. => sentence is not fluent, but NEs are fluent in the target text.
Stemming (词干分析): remove all possible suffixes
omit form feature for Bengali & Tamil
result
- stemming helps => makes the features more dense

manual mapping: 100 words
edit-distance mapping: a modified edit-distance algo.
cross-lingual CCA with word vectors
- using CCA project Uyghur and Uzbed monolingual vctor into a share semantic space
- closest Uyghur word to each Uzbek word.
result
- lang. spec. no combine: not much
- combine ALL lang. specific techs: 10 points higher
- different training data covers many angles

这篇关于2017. cheap translation for Cross-lingual NER 阅读笔记的文章就介绍到这儿，希望我们推荐的文章对编程师们有所帮助！