Keynote this morning at #INLG2019: Philipp Koehn with "Challenges for Neural Sequence Generation Models - Insights from Machine Translation"
#MachineTranslation is #NaturalLanguageGeneration inspired by other ideas
Koehn: Of course we saw headlines about Google's NMT when it was producing nonsense based on the training data (talking about the second coming of Jesus, etc). Now we'll talk a bit more about "hallucinations", when neural MT goes bad, and how this relates to NLG
-
-
Koehn: neural MT breaks even on BLEU with SMT after about 100 million words of training data.pic.twitter.com/oADaFP0GUG
Show this thread -
See some examples of how limited training data results in much worse NMT translationspic.twitter.com/X0sxTvqtUi
Show this thread -
The green bars here are all a bit shorter than the blue, indicating that it does worse out of domain than SMTpic.twitter.com/WmnRnHwm7u
Show this thread -
Koehn: Used to be that expanding the beam for SMT yielded consistent improvements at the cost of more compute time. However if you expand the beam too much you start seeing clear degradations in NMT quality
Show this thread -
Koehn: We explored the impact of noise on NMT compared to SMT and found across a variety of kinds of noise that NMT is less robust to noise than SMT.
Show this thread -
Here's the most extreme example: NMT suffers hugely if you have untranslated sentences in your corpuspic.twitter.com/jM3QWr68uy
Show this thread -
Koehn: To address these problems we created the WMT 2019 filtering task, where the goal is to filter out noisy sentence pairs from the data before training. Looking at systems trained on different amounts of noise we see the same trend across a variety of systems: SMT > NMT
Show this thread -
Koehn: What is happening? Why is this happening? Visualization, probing internal states, and trying to trace back decisions to inputs are approaches we are taking to try to understand the problem.
Show this thread -
Koehn: One analysis looking to detect hallucinations, we looked at KL-divergence between the predictions of the NMT model for the next word and the LM. High KL-divergence indicates instances where the input matters more than the LM. This approach did not really succeed :/
Show this thread -
Koehn: So the next thing we tried was based on the idea of saliency from vision. Idea - if changes in input cause changes in output, then the input mattered. (makes me wonder how the size of the differences between the input and output relate to one another)
Show this thread -
Koehn: "Just looking at attention weights often doesn't tell you much", based on Ding et al. 2019 @ WMT, where they showed saliency was more informative wrt word alignments
Show this thread -
Koehn: Where to next? We could look at this as an ML problem, arguing that it's overfitting on the data, failing to generalize to data outside the model's comfort zone. Or perhaps arguing that it's 'exposure bias' - the models never see bizarre data during training time...
Show this thread -
... so then it doesn't know how to cope with this scenario at test time if it starts a sentence with an odd prefix.
Show this thread -
Koehn: So why don't we make the data more diverse? We can do data synthesis (e.g. by vocabulary replacement). I watered my flowers -> I/We/You watered my roses/plants/etc.
Show this thread -
Koehn: We can also look at paraphrasing, paraphrasing source, target, or both, and creating new training instances this way.
Show this thread -
Koehn: Paraphrasing helps more when you paraphrase the target side (Hu et al. 2019 @ ACL, ongoing work)
Show this thread -
-
Koehn: So far we've been talking about changing data, which is nice, but we can also change how we do ML. For example, adding noise is now a standard practice (e.g. dropout and label smoothing)
Show this thread -
Koehn: So what's the outlook? Open Questions - why are neural models overfitting? - can we detect hallucinations? - can we counter-act this behavior? with diversity in training data? with better machine learning?
Show this thread -
QA Discussion: what are your thoughts on pipelines? can this help with some of these generalization issues? these days the trend is in the other direction: folks even want to do audio-to-audio MT. so far the syntax-/phrase-based approaches we used to use have not helped with NMT
Show this thread -
QA Discussion: detecting hallucinations in target sentences, have you thought about quality in the input? usually we have to translat what we have to translate
Show this thread -
QA Discussion: one diff between NLG and MT is that we're starting w/structured meanings as opposed to text-to-text where the meaning may be more ambiguous. the parallel would MT w/back translation similarity to source text. some folks have looked at this...
Show this thread -
...but usually after realization is complete rather than online, recognizing where things go wrong as they go wrong
Show this thread -
QA Discussion: interested in saliency aspect, which has been used a bit in image description/captioning. to what extent can you learn more from manipulating the source texts? images can allow much more subtle manipulations than text
Show this thread
End of conversation
New conversation -