Press J to jump to the feed. Press question mark to learn the rest of the keyboard shortcuts
306

BERT's success in some benchmarks tests may be simply due to the exploitation of spurious statistical cues in the dataset. Without them it is no better then random.

306

BERT's success in some benchmarks tests may be simply due to the exploitation of spurious statistical cues in the dataset. Without them it is no better then random.

33 comments
96% Upvoted
What are your thoughts? Log in or Sign uplog insign up
level 1

Title:Probing Neural Network Comprehension of Natural Language Arguments

Authors:Timothy Niven, Hung- Yu Kao

Abstract: We are surprised to find that BERT's peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyze the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy. Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work.

PDF Link | Landing Page | Read as web page on arXiv Vanity

level 1

This isn't too surprising at all. The same thing happened with the first round of VQA models (and the problem still probably persists, despite people's efforts to balance that dataset). Given how bad people are at simply randomly choosing a number, I don't know why we expect them to generate datasets without statistical imbalances.

level 1
39 points · 12 hours ago · edited 11 hours ago

Love that paper. Very simple and effective way of showing that these kinds of model don't properly "understand" and only exploit (bad) statistical cues. However, to that end I think it was clear to most people (maybe besides elon musk ;) ), that this is what Bert like models are doing. However, I still have seen 3 personal projects now where Bert improved a lot over word embedding based approaches with extremely low labels (100s). Also, this paper shows you the importance of a good metric.

level 2
Original Poster23 points · 11 hours ago

Oh no doubt... I do believe BERT has value, I doubt some of these benchmarks do... and when looking at what BERT "accomplishes" on these datasets it looks like we practically solved NLP, which creates a fake hype around these new technologies, that's what worries me.

level 2

Do you have any links for such projects? And dealing with low labels in general? I’m currently looking into trying BERT for a project.

level 1
Original Poster80 points · 13 hours ago

I feel like this should have made more waves than it did... We keep hearing about all of these new advances in NLP, with a new, better model every few months, achieving unrealistic results. But when someone actually probs the dataset it looks like these models haven't really learned anything of any meaning. These should really make us take a step back from optimizing models and take a hard look at those datasets and whether they really mean anything.

All this time these results really didn't make sense to me... as they require such a high level thinking, as well as a lot of world knowledge.

level 2

It seems to me that the point you’re making in this post is overgeneralizing the paper. Even in the title of this post you say “some” benchmarks (in this case the paper only talks about ART performance of BERT), but in this post you’re trying to say that new better NLP models in general haven’t learned anything of meaning. To make your point you’d have to point out some statistical anomaly in all the benchmarks that BERT improved upon from the then state of the art systems. I think however just in the eye test BERT does seem more effective in NLU tasks.

I agree with your overall point that if anything it’s clear that the benchmarks we use to judge these models imperfectly correlate with human judgment, but this is already widely known and studied problem. It is however quite difficult to come up with even better metrics that correlate better with human ratings.

level 2

I don't think this is a rational conclusion to draw from the paper. If you have some axe to grind with how deep NLP is done, then, sure, start a thread, but your rhetoric certainly isn't supported by the paper.

level 2

This is every ml/rl model... they don’t have brains, it’s just self-organizing statistics.

level 1

See also the HANS paper which also deserves more attention. https://arxiv.org/abs/1902.01007

level 2
Original Poster5 points · 10 hours ago

Wow! almost exactly the same conclusion just on another dataset! Looks like a new, and very welcomed, trend...

level 1

I tried the openAI GPT2 both sizes on colab and man do they spit some BS for summarization tasks. Even the best non-ML approach doesn't spew out of input passage information.

level 2

Are you looking for an extractive summarizer?

level 1

Not to trivialize the paper (I really like their approach and conclusion) and recent advances in ML and NLP, but I think this simply confirms what many researchers and practititioners have suspected for a while.

That inadvertently, some reported advances are to certain degree, the product of overfitting to standardized datasets.

level 2

But there's a huge difference between suspecting something and demonstrating it, no?

level 1

I feel lots of the commenters may have mis-interpreted the paper? It only says these models (BERT and etc.) exploits statistical cues (the presence of "not" and others) for a specific task (ARCT) on a specific dataset. With adverserial samples introduced, BERT's performance was reduced to 50%, compared to 80% of untrained human, which makes sense if we look at BERT v.s. Human in other tasks that requires deep understanding of texts.


In no way did the paper say anything about BERT's ability to learn in other tasks - and it makes sense - learning algorithms never guarantees that the solution it finds is what you intend in the solution space.

level 1

Text is the representation of broader concepts in a more heuristic, symbolic way .

It makes sense that a system can’t derive an understanding more substantial than basic statistic correlation from purely a text input.

I would expect vqa-type systems to eventually prevail over other nlp type systems.

level 1

next paper: Human success in some benchmarks tests may be simply due to the exploitation of spurious statistical cues in the dataset.

level 1
1 point · just now

I think the main point of this paper is not to claim many of BERT successes are due to the exploitation of spurious cues. The purpose of the paper seems to demonstrate the flaw in a particular NLP task, using the strength of BERT. It is clear to everyone from the beginning that BERT or similar models have no chance to achieve such high accuracy on a task that requires deeper logical reasoning. The original BERT paper does not claim success in the ARCT task. The 77% result comes from the authors of this current paper. So the main message is "if BERT can achieve such a high result, then there must be something wrong with the task design.

level 1

cant wait to read this on the plane

level 1

Than*

level 2
Original Poster3 points · 4 hours ago

I know! as soon as I posed I noticed it but couldn't find where I can edit the title...

More posts from the MachineLearning community
350

Intel's ultra-efficient AI chips can power prosthetics and self-driving cars They can crunch deep learning tasks 1,000 times faster than CPUs.

https://www.engadget.com/2019/07/15/intel-neuromorphic-pohoiki-beach-loihi-chips/

Even though the whole 5G thing didn't work out, Intel is is still working on hard on its Loihi "neuromorphic" deep-learning chips, modeled after the human brain. It unveiled a new system, code-named Pohoiki Beach, made up of 64 Loihi chips and 8 million so-called neurons. It's capable of crunching AI algorithms up to 1,000 faster and 10,000 times more efficiently than regular CPUs for use with autonomous driving, electronic robot skin, prosthetic limbs and more.

The Loihi chips are installed on a "Nahuku" board that contains from 8 to 32 Loihi chips. The Pohoiki Beach system contains multiple Nahuku boards that can be interfaced with Intel's Arria 10 FPGA developer's kit, as shown above.

Pohoiki Beach will be very good at neural-like tasks including sparse coding, path planning and simultaneous localization and mapping (SLAM). In layman's terms, those are all algorithms used for things like autonomous driving, indoor mapping for robots and efficient sensing systems. For instance, Intel said that the boards are being used to make certain types of prosthetic legs more adaptable, powering object tracking via new, efficient event cameras, giving tactile input to an iCub robot's electronic skin, and even automating a foosball table.

The Pohoiki system apparently performed just as well as GPU/CPU-based systems, while consuming a lot less power -- something that will be critical for self-contained autonomous vehicles, for instance. " We benchmarked the Loihi-run network and found it to be equally accurate while consuming 100 times less energy than a widely used CPU-run SLAM method for mobile robots," Rutgers' professor Konstantinos Michmizos told Intel.

Intel said that the system can easily scale up to handle more complex problems and later this year, it plans to release a Pohoiki Beach system that's over ten times larger, with up to 100 million neurons. Whether it can succeed in the red-hot, crowded AI hardware space remains to be seen, however.

350
125 comments
302

I taught a one day course on backpropagation & neural networks from scratch today - here are my materials:

https://github.com/ADGEfficiency/teaching-monolith/blob/master/backprop/intro-to-backprop.ipynb


Hopefully it is of of some use to someone :)

302
16 comments
298

Huggingface has released a new version of their open-source library of pretrained transformer models for NLP: PyTorch-Transformers 1.0 (formerly known as pytorch-pretrained-bert).


The library now comprises six architectures:

  • Google's BERT,

  • OpenAI's GPT & GPT-2,

  • Google/CMU's Transformer-XL & XLNet and

  • Facebook's XLM,

and a total of 27 pretrained model weights for these architectures.


The library focus on:

  • being superfast to learn & use (almost no abstractions),

  • providing SOTA examples scripts as starting points (text classification with GLUE, question answering with SQuAD and text generation using GPT, GPT-2, Transformer-XL, XLNet).


It also provides:

  • a unified API for models and tokenizers,

  • access to the hidden-states and attention weights,

  • compatibility with Torchscript...


Install: pip install pytorch-transformers

Quickstart: https://github.com/huggingface/pytorch-transformers#quick-tour

Release notes: https://github.com/huggingface/pytorch-transformers/releases/tag/v1.0.0

Documentation (work in progress): https://huggingface.co/pytorch-transformers/

298
22 comments
280
280
53 comments
262


Link: https://github.com/benedekrozemberczki/awesome-graph-classification


The repository covers techniques such as deep learning, graph kernels, statistical fingerprints and factorization. I monthly update it with new papers when something comes out with code.

262
11 comments
234

Hello, I wanted to share something our team has been working on for a while. I work on an early stage radiology imaging company where we have a blessing and curse of having too much medical imaging data. Something we found internally useful to build was a DICOM Decoder Op for TensorFlow. We are making this available open-source here: https://github.com/gradienthealth/gradient_decode_dicom.

DICOM is an extremely broad standard, so we try to cover the 90% case of image formats (PNG, TIFF, BMP, JPEG, JPEG2000) by relying on the past work folks have done for DCMTK. DCMTK is also largely considered an industry standard when it comes to parsing DICOMs. We also support multi-frame/multi-frame color images. Try images found here: https://barre.dev/medical/samples/. In the case an unsupported format is found, an empty Tensor is returned which can be filtered out. Reading the files directly off of bucket storage has allowed us to prevent data duplication of .dcm data (a single CT can be 300MB). You can play with the op in this Colab notebook: https://colab.research.google.com/drive/1MdjXN3XkYs_mSyVtdRK7zaCbzkjGub_B

We firmly believe that having open-source resources in healthcare is what will enable its use in practice, not AI trade secrets. We plan on opening more of our work in the future. DM me if there is interest in contributing to upcoming toolkits (the next one we are thinking of creating is an operation to decrypt+decompress gzip files). Also, lmk if there is interest in working with our dataset (~300M DICOMs + notes). The goal of these project collaborations is that they are ultimately open-sourced.

Anyway, give the operation a try. If there are problems with loading a file of interest, please make an issue on GitHub. Right now only Linux based systems are supported, and a Dockerfile example will be coming soon.

234
34 comments
222

I came across this interesting article about whether larger models + more data = progress in ML research.

How the Transformers broke NLP leaderboards

Excerpt:

The focus of this post is yet another problem with the leaderboards that is relatively recent. Its cause is simple: fundamentally, a model may be better than its competitors by building better representations from the available data - or it may simply use more data, and/or throw a deeper network at it. When we have a paper presenting a new model that also uses more data/compute than its competitors, credit attribution becomes hard.

The most popular NLP leaderboards are currently dominated by Transformer-based models. BERT received the best paper award at NAACL 2019 after months of holding SOTA on many leaderboards. Now the hot topic is XLNet that is said to overtake BERT on GLUE and some other benchmarks. Other Transformers include GPT-2, ERNIE, and the list is growing.

The problem we’re starting to face is that these models are HUGE. While the source code is available, in reality it is beyond the means of an average lab to reproduce these results, or to produce anything comparable. For instance, XLNet is trained on 32B tokens, and the price of using 500 TPUs for 2 days is over $250,000. Even fine-tuning this model is getting expensive.

Wait, this was supposed to happen!

On the one hand, this trend looks predictable, even inevitable: people with more resources will use more resources to get better performance. One could even argue that a huge model proves its scalability and fulfils the inherent promise of deep learning, i.e. being able to learn more complex patterns from more information. Nobody knows how much data we actually need to solve a given NLP task, but more should be better, and limiting data seems counter-productive.

On that view - well, from now on top-tier NLP research is going to be something possible only for industry. Academics will have to somehow up their game, either by getting more grants or by collaborating with high-performance computing centers. They are also welcome to switch to analysis, building something on top of the industry-provided huge models, or making datasets.

However, in terms of overall progress in NLP that might not be the best thing to do. The chief problem with the huge models is simply this:

“More data & compute = SOTA” is NOT research news.

If leaderboards are to highlight the actual progress, we need to incentivize new architectures rather than teams outspending each other. Obviously, huge pretrained models are valuable, but unless the authors show that their system consistently behaves differently from its competition with comparable data & compute, it is not clear whether they are presenting a model or a resource.

Furthermore, much of this research is not reproducible: nobody is going to spend $250,000 just to repeat XLNet training. Given the fact that its ablation study showed only 1-2% gain over BERT in 3 datasets out of 4, we don’t actually know for sure that its masking strategy is more successful than BERT’s.

At the same time, the development of leaner models is dis-incentivized, as their task is fundamentally harder and the leaderboard-oriented community only rewards the SOTA. That, in its turn, prices out of competitions academic teams, which will not result in students becoming better engineers when they graduate.

Entire article:

https://hackingsemantics.xyz/2019/leaderboards/

222
47 comments
214
214
48 comments
164

100+ Machine Learning Trading Strategies

https://github.com/firmai/machine-learning-asset-management


- Deep Learning

- Reinforcement Learning

- Evolutionary Strategies

- Stacked Models


If you are interested in industry machine learning for python, feel free to sign up to my newsletter: https://mailchi.mp/ec4942d52cc5/firmai

164
7 comments
160

Hi all! We are Noam Brown and Professor Tuomas Sandholm. We recently developed the poker AI Pluribus, which has proven capable of defeating elite human professionals in six-player no-limit Texas hold'em poker, the most widely-played poker format in the world. Poker was a long-standing challenge problem for AI due to the importance of hidden information, and Pluribus is the first AI breakthrough on a major benchmark game that has more than two players or two teams. Pluribus was trained using the equivalent of less than $150 worth of compute and runs in real time on 2 CPUs. You can read our blog post on this result here.

We are happy to answer your questions about Pluribus, the experiment, AI, imperfect-information games, Carnegie Mellon, Facebook AI Research, or any other questions you might have! A few of the pros Pluribus played against may also jump in if anyone has questions about what it's like playing against the bot, participating in the experiment, or playing professional poker.

We are opening this thread to questions now and will be here starting at 10AM ET on Friday, July 19th to answer them.

EDIT: Thanks for the questions everyone! We're going to call it quits now. If you have any additional questions though, feel free to post them and we might get to them in the future.

160
138 comments
139

I enjoy learning about machine learning.

And I enjoy making videos.

So I made a video reviewing The Hundred-Page Machine Learning Book by Andriy Burkov.

I read it from the perspective of a machine learning engineer and still learned a bunch.

If you haven't checked out the book, it's a great concise read. There's nothing like a complex topic explained simply.

If you do watch the video, any advice on ways to improve or future reviews/topics would be greatly appreciated.

https://youtu.be/btLxTTkSZuY

139
14 comments
136

Dear r/MachineLearning, I am a STEM graduate who became interested after my Master's into making research in Machine Learning. I was promised a PhD, the opportunity to do cutting-edge research, real world applications and a "close" cooperation with industrial partners. But after having spent a few months reading and discussing with supervisors, a lot of work I am considered to do is centered around metaheuristic search and evolutionary computation. And although, I find it fascinating and there is some application to machine learning / DNNs, as well as companies like Uber and Cognizant are adopting it, I feel like it has too much of a niche quality and mainstream interest seems not to be catching-up with it. In case if there is any at all to begin with.

I thought it might be helpful to ask you guys, to get a neutral outside-of-the-box opinion.

Particularly, as over the last month I live and work in a scientific bubble and my prior background is not AI/ML or Computer Science to begin with. So anyone might just be able to claim anything to me without me having the ability to evaluate their claims or get any decent outside criticism.

Edit: Wow, I am overwhelmed by the response!

136
56 comments
131

Hi there, we're a London-based research team working on clinical applications of machine learning. Recently, we've been dealing a lot with clinical datasets that exceed 1M+ observations and 20K+ features. We found that traditional dimensionality reduction and feature extraction methods don't deal well with this data without subsampling and are actually quite poor at preserving both global and local structures of the data. To address these issues, we've been looking into Siamese Networks for non-linear dimensionality reduction and metric learning applications. We are making our work available through an open-source project: https://github.com/beringresearch/ivis


So far, we've applied ivis to single cell datasets, images, and free text - we're really keen to see what other applications could be enabled! We've also ran a large number of benchmarks looking at both accuracy of embeddings and processing speed - https://bering-ivis.readthedocs.io/en/latest/timings_benchmarks.html - and can see that ivis begins to stand out in datasets with 250K+ observations. We're really excited to make this project open source - there's so much for Siamese Networks beyond one-shot learning!


EDIT: wow - thank you so much for so many wonderful questions, comments, and criticisms! We had a lot of fun addressing them - we're now off to do some barbecuing before the evening is out, but we'll be back tomorrow to answer any further questions!

131
52 comments
124

I recently generated some new watch designs using StyleGAN, and I thought some of you may find it interesting. All 50,000 images I used to train were sourced from the /r/watches subreddit. These are the results I was able to achieve after 48 hours of training on a GTX 980 Ti. Considering the hardware Nvidia recommends, I'm pretty happy with it!

https://evigio.com/post/generating-new-watch-designs-with-stylegan

124
13 comments
95

Hello,


I have been working on implementing the model from the paper: [Few-Shot Adversarial Learning of Realistic Neural Talking Head Models](https://arxiv.org/abs/1905.08233v1) (Zakharov et al.) for my own projects and research. It uses a very interesting model of GAN and triggered my interest.


The paper has been out now for a couple of months and some implementations already exist out there although the results they show are not quite at the level of what is seen in the paper. For my implementation I added further recommendations given by the paper's author on various details that were unclear to me and other existing implementations upon only reading the paper. (added more depth to the network, adjusted adaIN parameters, ...).


Due to a lack of compute resources at my disposition and due to the model being very heavy, I only trained it on 5 epochs on a test dataset (15 times less epochs than in the paper and with a dataset 34 times smaller) but the results look promising so far for the relatively small amount of training that went into it.


Here an example of fake faces it generated from facial landmarks and embedding vectors:


More examples with the original faces for which the landmarks were extracted from can be seen on my github repo.


https://github.com/vincent-thevenin/Realistic-Neural-Talking-Head-Models


I also did an functioning demo that uses the webcam and an vector to create live fake faces from your own. I have a link to a video of that on my repo and the code for the demo will soon be uploaded.


If anyone is interested in using or training the model further or improving the project feel free to take a look and contribute :)


I ended up writing the paper from scratch for learning purposes but I would like to thank u/MrCaracara for doing the first implementation I know of.

95
5 comments
Continue browsing in r/MachineLearning