User Guide Code Submissions Forum
A broad ML benchmark suite for measuring performance of ML software frameworks, ML hardware accelerators, and ML cloud platforms.
Submission Deadline
July 31st, 2018
Overview
The MLPerf effort aims to build a common set of benchmarks that enables the machine learning (ML) field to measure system performance for both training and inference from mobile devices to cloud services. We believe that a widely accepted benchmark suite will benefit the entire community, including researchers, developers, builders of machine learning frameworks, cloud service providers, hardware manufacturers, application providers, and end users.
Supporting companies
Contributions by researchers from
Historical Inspiration
We are motivated in part by the System Performance Evaluation Consortium (SPEC) benchmark for general-purpose computing and the Transaction Processing Council (TPC) benchmark for database systems that drove rapid, measurable performance improvements in both fields for decades starting in the 1980s.
Goals
Learning from the 40 year history of benchmarks, MLPerf has these primary goals:
Accelerate progress in ML via fair and useful measurement
Serve both the commercial and research communities
Enable fair comparison of competing systems yet encourage innovation to improve the state-of-the-art of ML
Enforce replicability to ensure reliable results
Keep benchmarking effort affordable so all can participate
General Approach
Our approach is to select a set of ML problems, each defined by a dataset and quality target, then measure the wall clock time to train a model for each problem.
Image classification
Object detection

Speech to text
Translation
Recommendation
Sentiment Analysis
Reinforcement Learning
Broad and Representative Problems
Like the Fathom Benchmark, the MLPerf suite aims to reflect different areas of ML that are important to the commercial and research communities and where open datasets and models exist. Here is our current list of problems.
Closed and Open Model Divisions
Balancing fairness and innovation is difficult challenge for all benchmarks. Inspired by the Sort benchmarks, we take a two-pronged approach.

Ideally, the advances developed in the Open category will be incorporated into future generations of the Closed benchmarks.
1
The MLPerf Closed Model Division specifies the model to be used and restricts the values of hyper parameters, e.g. batch size and learning rate, with the emphasis being on fair comparisons of the hardware and software systems. (The Sort equivalent is called “Daytona,” alluding to the stock cars at the Daytona 500 mile race.)
2
In the MLPerf Open Model Division, competitors must solve the same problem using the same data set but with fewer restrictions, with the emphasis being on advancing the state-of-the-art of ML. (The Sort equivalent was called “Indy,” alluding to the even faster Formula One custom race cars designed for events like the Indianapolis 500.)
System Performance Metrics
Following the precedent of DAWNBench, the primary MLPerf metric is defined as the wall clock time to train a model to a target quality -- often hours or days. The target quality is based on the original publication result, less a small delta to allow for run-to-run variance.
Following SPEC’s precedent, we will publish a score that summarizes performance for our set of Closed or Open benchmarks: the geometric mean of results for the full suite.
SPEC also reports power (a useful proxy for cost), and DAWNBench reports cloud cost. MLPerf will report power for mobile or on-premise systems and cost for cloud systems.
Agile Benchmark Development
There are many details to specify to set the ground rules of MLPerf. Agile programming tells us that frequent feedback works better than heavyweight planning. Hence, we will rapidly iterate based on feedback from users in the ML community rather than to try to anticipate all potential issues in order to perfect the benchmark beforehand.
Organization
The current MLPerf effort is a collaboration of researchers and engineers representing interested organizations. It is our hope and plan that this effort grows into a much wider collaboration that spans many academic groups, companies, and other organizations. Indeed, the successful SPEC and TPC teams that define and evolve their benchmarks both consist of volunteer representatives of the many stakeholders. We invite the community to join us in making this benchmark suite even better!
Datasets and Model Sources

Image Classification

Dataset: Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. S.; Berg, A. C. & Li, F.-F. (2014), 'ImageNet Large Scale Visual Recognition Challenge', CoRR abs/1409.0575.

Model: He, K.; Zhang, X.; Ren, S. & Sun, J. (2015), 'Deep Residual Learning for Image Recognition', CoRR abs/1512.03385.

Object Identification

Dataset: Lin, T.-Y.; Maire, M.; Belongie, S. J.; Bourdev, L. D.; Girshick, R. B.; Hays, J.; Perona, P.; Ramanan, D.; Dollбr, P. & Zitnick, C. L. (2014), 'Microsoft COCO: Common Objects in Context', CoRR abs/1405.0312.

Model: He, K.; Gkioxari, G.; Dollбr, P. & Girshick, R. B. (2017), 'Mask R-CNN', CoRR abs/1703.06870.

Translation

Dataset: WMT English-German from Bojar, O.; Buck, C.; Federmann, C.; Haddow, B.; Koehn, P.; Monz, C.; Post, M. & Specia, L., ed.  (2014), Proceedings of the Ninth Workshop on Statistical Machine Translation, Association for Computational Linguistics, Baltimore, Maryland, USA.

Model: Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L. & Polosukhin, I. (2017), 'Attention Is All You Need', CoRR abs/1706.03762.

Speech-to-Text

Dataset: Panayotov, V.; Chen, G.; Povey, D. & Khudanpur, S. (2015), Librispeech: An ASR corpus based on public domain audio books, in '2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)', pp. 5206-5210.

Model: Amodei, D.; Anubhai, R.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Chen, J.; Chrzanowski, M.; Coates, A.; Diamos, G.; Elsen, E.; Engel, J.; Fan, L.; Fougner, C.; Han, T.; Hannun, A. Y.; Jun, B.; LeGresley, P.; Lin, L.; Narang, S.; Ng, A. Y.; Ozair, S.; Prenger, R.; Raiman, J.; Satheesh, S.; Seetapun, D.; Sengupta, S.; Wang, Y.; Wang, Z.; Wang, C.; Xiao, B.; Yogatama, D.; Zhan, J. & Zhu, Z. (2015), 'Deep Speech 2: End-to-End Speech Recognition in English and Mandarin', CoRR abs/1512.02595.

Recommendation

Dataset: Harper, F. M. & Konstan, J. A. (2015), 'The MovieLens Datasets: History and Context', ACM Trans. Interact. Intell. Syst. 5(4), 19:1--19:19.

Model: He, X.; Liao, L.; Zhang, H.; Nie, L.; Hu, X. & Chua, T.-S. (2017), 'Neural Collaborative Filtering', CoRR abs/1708.05031.

Sentiment Analysis

Dataset: Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y. & Potts, C. (2011), Learning Word Vectors for Sentiment Analysis, in 'Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies', Association for Computational Linguistics, Portland, Oregon, USA, pp. 142--150.

Model: Johnson, R. and Zhang, T. (2014), Effective use of word order for text categorization with convolutional neural networks, CoRR abs/1412.1058.

Reinforcement Learning

Dataset: Games from Iyama Yuta 6 Title Celebration, between contestants Murakawa Daisuke, Sakai Hideyuki, Yamada Kimio, Hyakuta Naoki, Yuki Satoshi, and Iyama Yuta.

Model: Tensorflow/minigo implementation by Andrew Jackson.

Contact
General questions: info@mlperf.org
Technical questions: support@mlperf.org
Join the announce list