BERT's success in some benchmarks tests may be simply due to the exploitation of spurious statistical cues in the dataset. Without them it is no better then random.


Title:Probing Neural Network Comprehension of Natural Language Arguments

Authors:Timothy Niven, Hung- Yu Kao

Abstract: We are surprised to find that BERT's peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyze the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy. Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work.

This isn't too surprising at all. The same thing happened with the first round of VQA models (and the problem still probably persists, despite people's efforts to balance that dataset). Given how bad people are at simply randomly choosing a number, I don't know why we expect them to generate datasets without statistical imbalances.

39 points · 12 hours ago · edited 11 hours ago

Love that paper. Very simple and effective way of showing that these kinds of model don't properly "understand" and only exploit (bad) statistical cues. However, to that end I think it was clear to most people (maybe besides elon musk ;) ), that this is what Bert like models are doing. However, I still have seen 3 personal projects now where Bert improved a lot over word embedding based approaches with extremely low labels (100s). Also, this paper shows you the importance of a good metric.

Original Poster23 points · 11 hours ago

Oh no doubt... I do believe BERT has value, I doubt some of these benchmarks do... and when looking at what BERT "accomplishes" on these datasets it looks like we practically solved NLP, which creates a fake hype around these new technologies, that's what worries me.

Do you have any links for such projects? And dealing with low labels in general? I’m currently looking into trying BERT for a project.

Original Poster80 points · 13 hours ago

I feel like this should have made more waves than it did... We keep hearing about all of these new advances in NLP, with a new, better model every few months, achieving unrealistic results. But when someone actually probs the dataset it looks like these models haven't really learned anything of any meaning. These should really make us take a step back from optimizing models and take a hard look at those datasets and whether they really mean anything.

All this time these results really didn't make sense to me... as they require such a high level thinking, as well as a lot of world knowledge.

It seems to me that the point you’re making in this post is overgeneralizing the paper. Even in the title of this post you say “some” benchmarks (in this case the paper only talks about ART performance of BERT), but in this post you’re trying to say that new better NLP models in general haven’t learned anything of meaning. To make your point you’d have to point out some statistical anomaly in all the benchmarks that BERT improved upon from the then state of the art systems. I think however just in the eye test BERT does seem more effective in NLU tasks.

I agree with your overall point that if anything it’s clear that the benchmarks we use to judge these models imperfectly correlate with human judgment, but this is already widely known and studied problem. It is however quite difficult to come up with even better metrics that correlate better with human ratings.

I don't think this is a rational conclusion to draw from the paper. If you have some axe to grind with how deep NLP is done, then, sure, start a thread, but your rhetoric certainly isn't supported by the paper.

This is every ml/rl model... they don’t have brains, it’s just self-organizing statistics.

See also the HANS paper which also deserves more attention.

Original Poster5 points · 10 hours ago

Wow! almost exactly the same conclusion just on another dataset! Looks like a new, and very welcomed, trend...

I tried the openAI GPT2 both sizes on colab and man do they spit some BS for summarization tasks. Even the best non-ML approach doesn't spew out of input passage information.

Are you looking for an extractive summarizer?

Not to trivialize the paper (I really like their approach and conclusion) and recent advances in ML and NLP, but I think this simply confirms what many researchers and practititioners have suspected for a while.

That inadvertently, some reported advances are to certain degree, the product of overfitting to standardized datasets.

But there's a huge difference between suspecting something and demonstrating it, no?

I feel lots of the commenters may have mis-interpreted the paper? It only says these models (BERT and etc.) exploits statistical cues (the presence of "not" and others) for a specific task (ARCT) on a specific dataset. With adverserial samples introduced, BERT's performance was reduced to 50%, compared to 80% of untrained human, which makes sense if we look at BERT v.s. Human in other tasks that requires deep understanding of texts.

In no way did the paper say anything about BERT's ability to learn in other tasks - and it makes sense - learning algorithms never guarantees that the solution it finds is what you intend in the solution space.

Text is the representation of broader concepts in a more heuristic, symbolic way .

It makes sense that a system can’t derive an understanding more substantial than basic statistic correlation from purely a text input.

I would expect vqa-type systems to eventually prevail over other nlp type systems.

next paper: Human success in some benchmarks tests may be simply due to the exploitation of spurious statistical cues in the dataset.

1 point · just now

I think the main point of this paper is not to claim many of BERT successes are due to the exploitation of spurious cues. The purpose of the paper seems to demonstrate the flaw in a particular NLP task, using the strength of BERT. It is clear to everyone from the beginning that BERT or similar models have no chance to achieve such high accuracy on a task that requires deeper logical reasoning. The original BERT paper does not claim success in the ARCT task. The 77% result comes from the authors of this current paper. So the main message is "if BERT can achieve such a high result, then there must be something wrong with the task design.

cant wait to read this on the plane

Original Poster3 points · 4 hours ago

I know! as soon as I posed I noticed it but couldn't find where I can edit the title...

