Benchmark TensorFlow #66

soumith · Nov 11, 2015

Google's TensorFlow benchmarks are here!

I've run the benchmarks on the Imagenet Winners.
When I saw issues with the numbers, memory etc., I emailed @Yangqing to confirm what I'm seeing, and that it is expected.

With that disclaimer out of the way, here's some things that you should know about TensorFlow (as of the pip version that I installed today):

in-place ReLU seems non-existent in practice.
- Yangqing says: "right now there are little in-place operations in TensorFlow and we pretty much rely on the scheduler and the memory pool to allocate and deallocate memory"
Supports CuDNN R2. No R3 support yet, Yangqing says the next version they are going to support is likely R4.

Coming to the benchmarks:

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)
VGG with batchsize 64 goes Out of Memory. The largest batch-size I could fit is 32 (tried 32, 64).
I've also computed Torch7+CuDNN-R2 baselines for these batch-sizes.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN[R3]-fp32	96	32	64
Nervana-fp32	101	32	69
CuDNN[R2]	231	70	161
TensorFlow	423	101	322

Overfeat [fast] - Input 128x3x231x231

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN[R3]-fp32	326	113	213
fbfft	342	114	227
CuDNN[R2] *	810	234	576
TensorFlow	1290	324	966

OxfordNet [Model-A] - Input 32x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN[R2]	564	234	576
TensorFlow	1132	280	852

GoogleNet V1 - Input 16x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN[R2]	564	174	390
TensorFlow	619	54	565

Note that at batch size of 16, googlenet with CuDNN-R2 + Torch likely runs into dispatching overhead, so it's an exotic comparison, but not practically very interesting or encouraging.

There you go.

I'm assuming that the first release of TensorFlow is still quite unpolished, and that they will improve it over time with various memory and time optimizations baked in.

soumith · Nov 11, 2015

The benchmark scripts and raw outputs are located here: https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow

scott-gray · Nov 11, 2015

The lack of in place operations is rather surprising. Once you have the full DAG it should be rather easy to apply a liveness algorithm to it to optimize tensor allocations. For an example see this: http://www.diku.dk/hjemmesider/ansatte/torbenm/ICD/Register.pdf (just replace register with tensor).

I'm kind of curious if there's any support for automatically compounding operations together or of leveraging kernels that have some compounding built in (like the alpha/beta params of gemm). I'm pretty close to maximizing the amount of compounding that's possible in my benchmark networks. And because I write all my own kernels I can further compound things that aren't possible with closed source libraries like cuDNN. For example, I'm now able to compute the mean along the PQN dimension inside the conv and gemm kernels at no cost. This cuts down the bandwidth required by batch norm in fprop by a third.

Though on the whole I think TensorFlow seems like a great platform to build on. I'd say there's a good chance my kernels will make their way there sooner rather than later. You can find new benchmarks of my latest winograd kernels in the updated paper here: http://arxiv.org/abs/1509.09308

What I'll be working on next is basically going to be taking a lot of what I learned implementing winograd and refreshing all of my conv/pooling/gemm kernels to support very small minibatches at near full utilization. This should have a big impact on the level at which you can scale these networks and the speed at which they converge. Here's a great paper exploring this: http://arxiv.org/abs/1509.04210

yuzcccc · Nov 11, 2015

Hi, I strongly recommand to add mxnet https://github.com/dmlc/mxnet into comparision which in my opinion may be the fastest DL library :)

mavenlin · Nov 11, 2015

+1 for benchmarking mxnet, the fastest now.

strongbanker · Nov 11, 2015

+1 for benchmarking mxnet

fvisin · Nov 11, 2015

I would also love to see a comparison with Theano http://deeplearning.net/software/theano/ as it is another widely adopted deep learning library.

nkoumchatzky · Nov 11, 2015

Thanks for benchmarking!

aaronwro · Nov 11, 2015

+1 would love to see tensorflow benchmarked against mxnet, Theano, Autograd for Torch, and Caffe.

vincentvanhoucke · Nov 11, 2015

Thanks @soumith! Yes, our only launch criterion for convnets was 'GoogLeNet within distance from CuDNN[R2]', and we've punted on a lot of performance work, including upgrading CuDNN, until after the initial release. Expect a lot of movement on that front in the coming weeks.

soumith referenced this issue Nov 11, 2015
Closed
Benchmark tensorflow #65

soumith/convnet-benchmarks

HTTPS clone URL

Subversion checkout URL

Benchmark TensorFlow #66

Labels

Milestone

Assignee

9 participants

Benchmark tensorflow #65