Which GPU(s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning

2017-03-19 by Tim Dettmers 295 Comments

Deep learning is a field with intense computational requirements and the choice of your GPU will fundamentally determine your deep learning experience. With no GPU this might look like months of waiting for an experiment to finish, or running an experiment for a day or more only to see that the chosen parameters were off. With a good, solid GPU, one can quickly iterate over deep learning networks, and run experiments in days instead of months, hours instead of days, minutes instead of hours. So making the right choice when it comes to buying a GPU is critical. So how do you select the GPU which is right for you? This blog post will delve into that question and will lend you advice which will help you to make choice that is right for you.
TL;DR
Having a fast GPU is a very important aspect when one begins to learn deep learning as this allows for rapid gain in practical experience which is key to building the expertise with which you will be able to apply deep learning to new problems. Without this rapid feedback it just takes too much time to learn from one’s mistakes and it can be discouraging and frustrating to go on with deep learning. With GPUs I quickly learned how to apply deep learning on a range of Kaggle competitions and I managed to earn second place in the Partly Sunny with a Chance of Hashtags Kaggle competition using a deep learning approach, where it was the task to predict weather ratings for a given tweet. In the competition I used a rather large two layered deep neural network with rectified linear units and dropout for regularization and this deep net fitted barely into my 6GB GPU memory.

Should I get multiple GPUs?

Excited by what deep learning can do with GPUs I plunged myself into multi-GPU territory by assembling a small GPU cluster with InfiniBand 40Gbit/s interconnect. I was thrilled to see if even better results can be obtained with multiple GPUs.

I quickly found that it is not only very difficult to parallelize neural networks on multiple GPUs efficiently, but also that the speedup was only mediocre for dense neural networks. Small neural networks could be parallelized rather efficiently using data parallelism, but larger neural networks like I used in the Partly Sunny with a Chance of Hashtags Kaggle competition received almost no speedup.

Later I ventured further down the road and I developed a new 8-bit compression technique which enables you to parallelize dense or fully connected layers much more efficiently with model parallelism compared to 32-bit methods.

However, I also found that parallelization can be horribly frustrating. I naively optimized parallel algorithms for a range of problems, only to find that even with optimized custom code parallelism on multiple GPUs does not work well, given the effort that you have to put in . You need to be very aware of your hardware and how it interacts with deep learning algorithms to gauge if you can benefit from parallelization in the first place.

GPU pic — Setup in my main computer: You can see three GXT Titan and an InfiniBand card. Is this a good setup for doing deep learning?

Since then parallelism support for GPUs is more common, but still far off from universally available and efficient. The only deep learning library which currently implements efficient algorithms across GPUs and across computers is CNTK which uses Microsoft’s special parallelization algorithms of 1-bit quantization (efficient) and block momentum (very efficient). With CNTK and a cluster of 96 GPUs you can expect a new linear speed of about 90x-95x. Pytorch might be the next library which supports efficient parallelism across machines, but the library is not there yet. If you want to parallelize on one machine then your options are mainly CNTK, Torch, Pytorch. These library yield good speedups (3.6x-3.8x) and have predefined algorithms for parallelism on one machine across up to 4 GPUs. There are other libraries which support parallelism, but these are either slow (like TensorFlow with 2x-3x) or difficult to use for multiple GPUs (Theano) or both.

If you put value on parallelism I recommend using either Pytorch or CNTK.

Using Multiple GPUs Without Parallelism

Another advantage of using multiple GPUs, even if you do not parallelize algorithms, is that you can run multiple algorithms or experiments separately on each GPU. You gain no speedups, but you get more information of your performance by using different algorithms or parameters at once. This is highly useful if your main goal is to gain deep learning experience as quickly as possible and also it is very useful for researchers, who want try multiple versions of a new algorithm at the same time.

This is psychologically important if you want to learn deep learning. The shorter the intervals for performing a task and receiving feedback for that task, the better the brain able to integrate relevant memory pieces for that task into a coherent picture. If you train two convolutional nets on separate GPUs on small datasets you will more quickly get a feel for what is important to perform well; you will more readily be able to detect patterns in the cross validation error and interpret them correctly. You will be able to detect patterns which give you hints to what parameter or layer needs to be added, removed, or adjusted.

So overall, one can say that one GPU should be sufficient for almost any task but that multiple GPUs are becoming more and more important to accelerate your deep learning models. Multiple cheap GPUs are also excellent if you want to learn deep learning quickly. I personally have rather many small GPUs than one big one, even for my research experiments.

So what kind of accelerator should I get? NVIDIA GPU, AMD GPU, or Intel Xeon Phi?

NVIDIA’s standard libraries made it very easy to establish the first deep learning libraries in CUDA, while there were no such powerful standard libraries for AMD’s OpenCL. Right now, there are just no good deep learning libraries for AMD cards – so NVIDIA it is. Even if some OpenCL libraries would be available in the future I would stick with NVIDIA: The thing is that the GPU computing or GPGPU community is very large for CUDA and rather small for OpenCL. Thus, in the CUDA community, good open source solutions and solid advice for your programming is readily available.

Additionally, NVIDIA went all-in with respect to deep learning even though deep learning was just in it infancy. This bet paid off. While other companies now put money and effort behind deep learning they are still very behind due to their late start. Currently, using any software-hardware combination for deep learning other than NVIDIA-CUDA will lead to major frustrations.

In the case of Intel’s Xeon Phi it is advertised that you will be able to use standard C code and transform that code easily into accelerated Xeon Phi code. This feature might sounds quite interesting because you might think that you can rely on the vast resources of C code. However, in reality only very small portions of C code are supported so that this feature is not really useful and most portions of C that you will be able to run will be slow.

I worked on a Xeon Phi cluster with over 500 Xeon Phis and the frustrations with it had been endless. I could not run my unit tests because Xeon Phi MKL is not compatible with Python Numpy; I had to refactor large portions of code because the Intel Xeon Phi compiler is unable to make proper reductions for templates — for example for switch statements; I had to change my C interface because some C++11 features are just not supported by the Intel Xeon Phi compiler. All this led to frustrating refactorings which I had to perform without unit tests. It took ages. It was hell.

And then when my code finally executed, everything ran very slowly. There are bugs(?) or just problems in the thread scheduler(?) which cripple performance if the tensor sizes on which you operate change in succession. For example if you have differently sized fully connected layers, or dropout layers the Xeon Phi is slower than the CPU. I replicated this behavior in an isolated matrix-matrix multiplication example and sent it to Intel. I never heard back from them. So stay away from Xeon Phis if you want to do deep learning!

Fastest GPU for a given budget

TL;DR
Your fist question might be what is the most important feature for fast GPU performance for deep learning: Is it cuda cores? Clock speed? RAM size?

It is neither of these, but the most important feature for deep learning performance is memory bandwidth.

In short: GPUs are optimized for memory bandwidth while sacrificing for memory access time (latency). CPUs design to the the exact opposite: CPUs can do quick computations if small amounts of memory are involved for example multiplying a few numbers (3*6*9), but for operations on large amounts of memory like matrix multiplication (A*B*C) they are slow. GPUs excel at problems that involve large amounts of memory due to their memory bandwidth. Of course there are more intricate differences between GPUs and CPUs, and if you are interested why GPUs are such a good match for deep learning you can read more about it in my quora answer about this very question.

So if you want to buy a fast GPU, first and foremost look at the bandwidth of that GPU.

Evaluating GPUs via Their Memory Bandwidth

Bandwidth can directly be compared within an architecture, for example the performance of the Pascal cards like GTX 1080 vs. GTX 1070, can directly be compared by looking at their memory bandwidth alone. For example a GTX 1080 (320GB/s) is about 25% (320/256) faster than a GTX 1070 (256 GB/s). However, across architecture, for example Pascal vs. Maxwell like GTX 1080 vs. GTX Titan X cannot be compared directly due to how different architectures with different fabrication processes (in nanometers) utilize the given memory bandwidth differently. This makes everything a bit tricky, but overall bandwidth alone will give you a good overview over how fast a GPU roughly is. To determine the fastest GPU for a given budget one can use this Wikipedia page and look at Bandwidth in GB/s; the listed prices are quite accurate for newer cards (900 and 1000 series), but older cards are significantly cheaper than the listed prices – especially if you buy those cards via eBay. For example a regular GTX Titan X goes for around $550 on eBay.

Another important factor to consider however is that not all architectures are compatible with cuDNN. Since almost all deep learning libraries make use of cuDNN for convolutional operations this restricts the choice of GPUs to Kepler GPUs or better, that is GTX 600 series or above. On top of that, Kepler GPUs are generally quite slow. So this means you should prefer GPUs of the 900 or 1000 series for good performance.

To give a rough estimate of how the cards perform with respect to each other on deep learning tasks I constructed a simple chart of GPU equivalence. How to read this? For example one GTX 980 is as fast as 0.35 Titan X Pascal, or in other terms, a Titan X Pascal is almost three times faster than a GTX 980.

Please note that I do not have all these cards myself and I did not run deep learning benchmarks on all of these cards. The comparisons are derived from comparisons of the cards specs together with compute benchmarks (some cases of cryptocurrency mining are tasks which are computationally comparable to deep learning). So these are rough estimates. The real numbers could differ a little, but generally the error should be minimal and the order of cards should be correct. Also note, that small networks that under-utilize the GPU will make larger GPUs look bad. For example a small LSTM (128 hidden units; batch size > 64) on a GTX 1080 Ti will not be that much faster than running it on a GTX 1070. To get performance difference shown in the chart one needs to run larger networks, say a LSTM with 1024 hidden units (and batch size > 64). This is also important to keep in mind when choosing the GPU which is right for you.

Rough comparisons between GPU performance for large deep learning networks.

Generally, I would recommend the GTX 1080 Ti or GTX 1070. They are both excellent cards and if you have the money for a GTX 1080 Ti you should go ahead with that. The GTX 1070 is a bit cheaper and still faster than a regular GTX Titan X (Maxwell).. Both cards should be preferred over the GTX 980 Ti due to their increased memory of 11GB and 8GB (instead of 6GB).

A memory of 8GB might seem a bit small, but many tasks this is more than sufficient. For example for Kaggle competitions, most image datasets, deep style and natural language understanding tasks you will encounter few problems.

The GTX 1060 is the best entry GPU for when you want to try deep learning for the first time, or if you want to occasionally use it for Kaggle competition. I would not recommend the GTX 1060 variant with 3GB of memory, since the other variant’s 6GB memory can be quite limiting already. However, for many applications the 6GB is sufficient. The GTX 1060 is slower than a regular Titan X, but it is comparable in both performance and eBay price of the GTX 980.

In terms of bang for buck, the 10 series is quite well designed. The GTX 1060, GTX 1070 and GTX 1080 Ti stand out. The GTX 1060 is for beginners, the GTX 1070 a versatile option for startsups, and some parts of research and industry, and the GTX 1080 Ti stand solid as an all-around high-end option.

I generally would not recommend the NVIDIA Titan X (Pascal) as it is too pricey for its performance. Go instead with a GTX 1080 Ti. However, the NVIDIA Titan X (Pascal) still has its place among computer vision researchers which work on large datasets or video data. In these domains every GB of memory counts and the NVIDIA Titan X just has 1GB more than the GTX 1080 Ti and thus an advantage in this case. However, a better option in terms of bangs-for-buck is here a GTX Titan X (Maxwell) from eBay — a bit slower, but also sports a big 12GB memory.

However, most researchers do well with a GTX 1080 Ti. The one extra GB of memory is not needed for most research and most applications.

I personally would go with multiple GTX 1070 for research. I rather run a few more experiments which are a bit slower than running just one experiment which is faster. In NLP the memory constraints are not as tight as in computer vision and so a GTX 1070 is just fine for me. The tasks I work on and how I run my experiments determines the best choice for me, which is a GTX 1070.

You should reason in a similar fashion when you choose your GPU. Think about what tasks you work on and how you run your experiments and then try to find a GPU which suits these requirements.

The options are now more limited for people that have very little money for a GPU. GPU instances on Amazon web services are quite expensive and slow now and no longer pose a good option if you have less money. I do not recommend a GTX 970 as it is slow, still rather expensive even if bought in used condition ($150 on eBay) and there are memory problems associated with the card to boot. Instead, try to get the additional money to buy a GTX 1060 which is faster, has a larger memory and has no memory problems. If you just cannot afford a GTX 1060 I would go with a GTX 1050 Ti with 4GB of RAM. The 4GB can be limiting but you will be able to play around with deep learning and if you make some adjustments to models you can get good performance. A GTX 1050 Ti would be suitable for most Kaggle competitions although it might limit your competitiveness in some competitions.

Amazon Web Services (AWS) GPU instances

In the previous version of this blog post I recommended AWS GPU spot instances, but I would no longer recommend this option. The GPUs on AWS are now rather slow (one GTX 1080 is four times faster than a AWS GPU) and prices have shot up dramatically in the last months. It now again seems much more sensible to buy your own GPU.

Conclusion

With all the information in this article you should be able to reason which GPU to choose by balancing the required memory size, bandwidth in GB/s for speed and the price of the GPU, and this reasoning will be solid for many years to come. But right now my recommendation is to get a GTX 1080 Ti, or GTX 1070, if you can afford them; a GTX 1060 if you just start out with deep learning or you are constraint by money; if you have very little money, try to afford a GTX 1050 Ti; and if you are a computer vision researcher you might want to get a Titan X Pascal (or stick to existing GTX Titan Xs).

TL;DR advice

Best GPU overall: Titan X Pascal and GTX 1080 Ti
Cost efficient but expensive: GTX 1080 Ti, GTX 1070
Cost efficient and cheap: GTX 1060
I work with data sets > 250GB: Regular GTX Titan X or Titan X Pascal
I have little money: GTX 1060
I have almost no money: GTX 1050 Ti
I do Kaggle: GTX 1060 for any “normal” competition, or GTX 1080 Ti for “deep learning competitions”
I am a competitive computer vision researcher: Titan X Pascal or regular GTX Titan X
I am a researcher: GTX 1080 Ti. In some cases, like natural language processing, a GTX 1070 might also be a solid choice — check the memory requirements of your current models
I want to build a GPU cluster: This is really complicated, you can get some ideas here
I started deep learning and I am serious about it: Start with a GTX 1060. Depending of what area you choose next (startup, Kaggle, research, applied deep learning) sell your GTX 1060 and buy something more appropriate

Update 2017-03-19: Cleaned up blog post; added GTX 1080 Ti
Update 2016-07-23: Added Titan X Pascal and GTX 1060; updated recommendations
Update 2016-06-25: Reworked multi-GPU section; removed simple neural network memory section as no longer relevant; expanded convolutional memory section; truncated AWS section due to not being efficient anymore; added my opinion about the Xeon Phi; added updates for the GTX 1000 series
Update 2015-08-20: Added section for AWS GPU instances; added GTX 980 Ti to the comparison relation
Update 2015-04-22: GTX 580 no longer recommended; added performance relationships between cards
Update 2015-03-16: Updated GPU recommendations: GTX 970 and GTX 580
Update 2015-02-23: Updated GPU recommendations and memory calculations
Update 2014-09-28: Added emphasis for memory requirement of CNNs

Acknowledgements

I want to thank Mat Kelcey for helping me to debug and test custom code for the GTX 970; I want to thank Sander Dieleman for making me aware of the shortcomings of my GPU memory advice for convolutional nets; I want to thank Hannes Bretschneider for pointing out software dependency problems for the GTX 580; and I want to thank Oliver Griesel for pointing out notebook solutions for AWS instances.

[Image source: NVIDIA CUDA/C Programming Guide]

Comments

Trace says

2014-09-28 at 10:53

How much slower mid-level GPUs are? For example, I have a Mac with GeForce 750M, is it suitable for training DNN models?

Reply
- timdettmers says
  
  2014-09-28 at 11:33
  
  There is a GT 750M version with DDR3 memory and GDDR5 memory; the GDDR5 memory will be about thrice as fast as the DDR3 version. With a GDDR5 model you probably will run three to four times slower than typical desktop GPUs but you should see a good speedup of 5-8x over a desktop CPU as well. So a GDDR5 750M will be sufficient for running most deep learning models. If you have the DDR3 version, then it might be too slow for deep learning (smaller models might take a day; larger models a week or so).
  
  Reply
Lewis Cowles (@LewisCowles1) says

2014-10-05 at 09:04

is it any good for processing non-mathematical data or non-floating point via GPU? How about the handling of generating hashes and keypairs?

Reply
- timdettmers says
  
  2014-10-05 at 09:56
  
  Sometime it is good, but often it isn’t – it depends on the use-case. One applications of GPUs for hash generation is bitcoin mining. However the main measure of success in bitcoin mining (and cryptocurrency mining in general) is to generate as many hashes per watt of energy; GPUs are in the mid-field here, beating CPUs but are beaten by FPGA and other low-energy hardware.
  In the case of keypair generation, e.g. in mapreduce, you often do little computation, but lots of IO operations so that GPUs cannot be utilized efficiently. For many applications GPUs are significantly faster in one case, but not in another similar case, e.g. for some but not all regular expressions, and this is the main problem why GPUs are not used in other cases.
  
  Reply
James Dang (@JamesDanged) says

2014-10-19 at 11:23

Hi, nice writeup! Are you using single or double precision floats? You said divide by 4 for the byte size, which sounds like 32 bit floats, but then you point out that the Fermi cards are better than Kepler, which is more true when talking about double precision than single, as the Fermi cards have FP64 at 1/8th of FP32 while Kepler is 1/24th. Trying to decide myself whether to go with the cheaper Geforce cards or to spring for a Titan.

Reply
- timdettmers says
  
  2014-10-19 at 11:32
  
  Thanks for you comment James. Yes, deep learning is generally done with single precision computation, as the gains in precision do not improve the results greatly.
  
  It depends what types of neural network you want to train and how large they are. But I think a good decision would be to go for a 3GB GTX580 from ebay, and then upgrade to a GTX 1000 series card next year. The GTX 1000 series cards will probably be quite good for deep learning, so waiting for them might be a wise choice.
  
  Reply
enedene says

2014-12-28 at 20:51

Thank you for the great post. Could you say something about having a new card on order CPU?

For example I have 4 core Intel Q6600 from year 2007 with 8Gb of RAM (without possibility to upgrade). Could this be a bottleneck, if I choose to buy new GPU for CUDA and ML?

I’m also not sure which one is a better choice GTX 780 2Gb of RAM, vs GTX 970 4Gb of RAM. 780 has more cores, but are a bit slower…

http://www.game-debate.com/gpu/index.php?gid=2438&gid2=880&compare=geforce-gtx-970-4gb-vs-geforce-gtx-780

A nice list of characteristics, still, I’m not sure which would be a better choice. I would use the GPU for all kind of problems, perhaps some with smaller networks, but I wouldn’t be shy of trying something bigger when I feel conferrable enough.

What would you recommend?

Reply
- timdettmers says
  
  2014-12-29 at 13:33
  
  Hi enedene, thanks for your question!
  
  Your CPU should be sufficient and should slow you down only slightly (1-10%).
  
  My post is now a bit outdated as the new Maxwell GPUs have been released. The Maxwell architecture is much better than the Kepler architecture and so the GTX 970 is faster than the GTX 780 even though it has lower bandwidth. So I would recommend getting a GTX 970 over a GTX 780 (of course, a GTX 980 would be better still, but a GTX 970 will be fine for most things, even for larger nets).
  
  For low budgets I would still recommend a GTX 580 from eBay.
  
  I will update my post next week to reflect the new information.
  
  Reply
  - enedene says
    
    2014-12-29 at 14:52
    
    Thank you for the quick reply. I will most probably get GTX 970. Looking forward to your updated post, and competing against your on Kaggle.
    
    Reply
Anatoly says

2014-12-30 at 18:31

Hi Tim. What open-source package would you recommend if the objective was to classify non-image data? Most packages specifically are designed for classifying images

Reply
- timdettmers says
  
  2015-01-01 at 19:52
  
  I have only superficial experience with the most libraries, as I usually used my own implementations (which I adjusted from problem to problem). However, from what I know, Torch7 is a really strong for non-image data, but you will need to learn some lua to adjust some things here and there. I think pylearn2 is also a good candidate for non-image data, but if you are not used to theano then you will need some time to learn how to use it in the first place. Libraries like deepnet – which is programmed on top of cudamat – are much easier to use for non-image data, but the available algorithms are partially outdated and some algorithms are not available at all.
  
  I think you always have to change a few things in order to make it work for new data and so you might also want to check out libraries like caffe and see if you like the API better than other libraries. A neater API might outweigh the costs for needing to change stuff to make it work in the first place. So the best advice might be just to look a documentations and examples, try a few libraries, and then settle for something you like and can work with.
  
  Reply
Monica says

2015-01-16 at 18:41

Hi Tim. Do you have any references that explain why the convolutional kernels need more memory beyond that used by the network parameters. I am trying to figure out why Alex’s net needs just over 3.5Gb when the parameters alone only take ~0.4 Gb…what’s hogging the rest?!?

Reply
- timdettmers says
  
  2015-01-16 at 20:30
  
  Thanks for your comment Monica. This is indeed something I overlooked, which is actually a quite important issue when selecting a GPU. I hope to address this in an update I aim to write soon.
  
  To answer your question: The increase memory usage stems from memory that is allocated during the computation of the convolutions to increase computational efficiency: Because image patches overlap one saves a lot of computation when one saves some of the image values to then reused them for an overlapping image patch. Albeit at a cost of device memory, one can achieve tremendous increases in computational efficiency when one does cleverly as Alex does in his CUDA kernels. Other solutions that use fast Fourier transforms (FFTs) are said to be even faster than Alex’s implementation, but these do need even more memory.
  
  If you are aiming to train large convolutional nets, then a good option might be to get a normal GTX Titan from eBay. If you use convolutional nets heavily, two, or even four GTX 980 (much faster than a Titan) also make sense if you plan to use the convnet2 library which supports dual GPU training. However, be aware that NVIDIA might soon release a Maxwell GTX Titan equivalent which would be much better than the GTX 980 for this application.
  
  Reply
Mike says

2015-01-20 at 05:06

Hi Tim. Thanks for this very informative post.

Do you know how much of a boost Maxwell gives? I’m trying to decide between a GTX 850M with 4GB DDR3 and a Quadro K1100M with 2GB GDDR5. I understand that the K1100M is roughly equivalent to the 750M. Which gives the bigger boost: going from Kepler to Maxwell or from Geforce to Quadro (including from DDR3 to GDDR5)?

Thanks so much!

Reply
- timdettmers says
  
  2015-01-20 at 07:26
  
  Going from DDR3 to GDDR5 is a larger boost than going from Kepler to Maxwell. However, the Quadro K1100M has only a slightly faster bandwidth than the GTX 850M which will probably cancel out the benefits, so that both cards will perform at about the same level. If you want to use convolutional neural networks the 4GB memory on the GTX 850M might make the differnce; otherwise I would go with the cheaper option.
  
  Reply
  - Mike says
    
    2015-01-20 at 17:14
    
    Thanks!
    
    Reply
ragv says

2015-02-04 at 14:00

Hi, I am planning to replicate ImageNet object identification problem using CNNs as published in recent paper by G. Hinton et al… ( just as an exercise to learn about deep learning and CNNs ).

1. What GPU would you recommend considering I am student. I heard the original paper used 2 GTX 580 and yet took a week to train the 7 layer deep network? Is this true? Could the same be done using a single GTX 580 or GTX 970? How much time will it take to train the same on a GTX 970 or a single GTX 580 ? ( A week of time is okay for me )

2. What kind of modifications in the original implementation could I do ( like 5 or 6 hidden layers instead of 7, or lesser number of objects to detect etc. ), to make this little project of mine easier to implement on a lower budget while at the same time helping me learn about the deep nets and CNNs ?

3. What kind of libraries would you recommend for the same? Torch7 or pylearn2 / theano ( I am fairly proficient in python but not so much in lua ).

4. Is there a small scale implementation of this anywhere in github etc?

Also thanks a lot for the wonderful post.

Reply
- timdettmers says
  
  2015-02-04 at 15:37
  
  1. All GPUs with 4 GB should be able to run the network; you can run a bit smaller networks on one GTX 580; these networks will always take more than 5 days, even on the fastest GPUs
  2. Read about convolutional neural networks, then you will understand what the layers do and how you can use them. This is a good, thorough tutorial: http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/
  3. I would try pylearn2, convnet2, and caffe and pick which suits you best
  4. The implementations are generally general implementations, i.e. you run small and large networks with the same code, it is only a difference in a parameters to a function; if you mean by “small”, a less complex API I heard good things about the Lasagne library
  
  Reply
  - Jack says
    
    2015-02-04 at 18:36
    
    Hi Tim, super interesting article. What case did you use for the build that had the GPUs vertical?
    
    Reply
    - timdettmers says
      
      2015-02-04 at 18:44
      
      It looks like it is vertical, but it is not. I took that picture while my computer was laying on the ground. However, I use the Cooler Master HAF X for both of my computer. I bought this tower because it has a dedicated large fan for the GPU slot – in retrospect I am unsure if the fan is helping that much. There is another tower I saw that actually has vertical slots, but again I am unsure if that helps so much. I would probably opt for liquid cooling for my next system. It is more difficult to maintain, but has much better performance. With liquid cooling almost any case would go that fits the mainboard and GPUs.
      
      Reply
      - Jack says
        
        2015-02-04 at 19:33
        
        It looks like there is a bracket supporting the end of the cards, did that come with the case or did you put them in to support the cards?
gwern says

2015-02-23 at 20:08

(Duplicate paragraph: “I quickly found”.)

Reply
- timdettmers says
  
  2015-02-23 at 20:37
  
  Thanks, fixed.
  
  Reply
vikasing says

2015-02-25 at 10:16

Great article!

You did not talk about the number of cores present in a graphics card (CUDA cores in case of nVidia). My perception was that a card with more cores will always be better because more number of cores will lead to a better parallelism, hence the training might be faster, given that the memory is enough. Plz correct me if my understanding is wrong.

Which card would you suggest for RNNs and a data size of 15-20 Gb (wikipedia/freebase size)? A 960 would be good enough? Or should I go with a 970 one? 580 is not available in my country.

Reply
- timdettmers says
  
  2015-02-25 at 11:50
  
  Thanks for your comment. CUDA cores relate more closely to FLOPS and not to bandwidth, but it is the bandwidth that you want for deep learning. So cuda cores are a bad proxy for performance in deep learning. What you really want is a high memory bus width (e.g. 384 bits) and high memory clock (e.g. 7000MHz) – anything other than that hardly matters for deep learning.
  
  Mat Kelcey did some tests with theano for the GTX 970 and it seems that the GPU has no memory problems for compute – so the GTX 970 might be a good choice then.
  
  Reply
  - vikasing says
    
    2015-02-25 at 12:35
    
    Thanks a lot
    
    Reply
yakup says

2015-02-27 at 09:51

Hi Tim,

Thanks for your excellent blog posts. I am a statistician and I want to go into deep learning area. I have a budget of 1500-2000 $. Can you recommend me a good desktop system for deep learning purposes? From your blog post I know that I will get a gtx 980 but, what about cpu, ram, motherboard requirement?

Thanks

Reply
- timdettmers says
  
  2015-02-27 at 10:49
  
  Hi Yakup,
  I wanted to write a blog post with detailed advice about this topic sometimes in the next two weeks and if you can wait for that you might get some insights what hardware is right for you. But I also want to give you some general, less specific advice.
  
  If you might be getting more GPUs in the future, it is better that you will buy a motherboard with PCIe 3.0 and 7 PCIx16 slots (one GPU takes typically two slots). If you will use only 1-2 GPUs, then almost any motherboard with do (PCIe 2.0 would be also be okay). Plan to get a power supply unit (PSU) which has enough Watts to power all GPUs you will get in the future (e.g. if you will get a maximum of 4, then buy a +1400 Watts PSU). The CPU does not need to be fast or have many cores. Twice as many threads as you have GPUs is almost always sufficient (for Intel CPUs we mostly have: 1 core = 2 threads); any CPU with more than 3GHz is okay; less than 3GHz might give you a tiny penalty in speed of about 1-3%. Fast memory caches are often more important for CPUs, but in the big picture they also contribute little in overall performance; a typical CPU with slow memory will decrease the overall performance by a few percent.
  
  One can work around a small RAM by loading data sequentially from your hard drive into your RAM, but it is often more convenient to have a larger RAM; two times the RAM your GPU has gives you more freedom and flexibility (i.e. 8GB RAM for a GTX 980). A SSD will it make more comfortable to work, but similarly to the CPU offers little performance gains (0-2%; depends on the software implementation); a SSD is nice if you need to preprocess large amounts of data and save them into smaller batches, e.g. preprocessing 200GB of data and save them into batches of 2GB is a situation in which SSDs can save a lot of time. If you decide to get a SSD, a good rule might be to buy a SSD that is twice as large as your largest data set. If you get a SSD, you should also get a large hard drive where you can move old data sets to.
  
  So the bottom line is, a $1000 system should perform at least at 95% of a $2000 system; but a $2000 system offers more convenience and might save some time for preprocessing.
  
  Reply
  - Dewan says
    
    2015-02-27 at 15:52
    
    Hi Tim,
    
    Nice and very informative post. I have a question regarding processor. Would you suggest to build a computer with AMD processor (for example, AMD FX-8350 4.0GHz 8-Core Processor) over INTEL based processor for deep learning? I also do not know, AMD processor has PCI 3.0 support. Could you please give your thought on this?
    
    And thanks a lot for the wonderful post.
    
    Reply
    - timdettmers says
      
      2015-02-27 at 16:26
      
      Thanks for your comment, Dewan. An AMD CPU is just as good as a Intel CPU; in fact I might favor AMD over Intel CPUs because Intel CPU pack just too much unnecessary punch – one simply does not need so much processing power as all the computation is done by the GPU. The CPU is only used to transfer data to the GPU and to start kernels (which is little more than a function call). Transferring data means that the CPU should have a high memory clock and a memory controller with many channels. This is often not advertised on CPUs as it not so relevant for ordinary computation, but you want to choose the CPU with the larger memory bandwidth (memory clock times memory controller channels). The clock on the processor itself is less relevant here.
      
      A 4GHz 8 core AMD CPU might be a bit overkill. You could definitely settle for less without any degradation in performance. But what you say about PCIe 3.0 support is important (some new Haswell CPUs do not support 40 lanes, but only 32; I think most AMD CPUs support all 40 lanes). As I wrote above I will write a more detailed analysis in a week or two.
      
      Reply
yakup says

2015-02-28 at 09:50

Hey Tim,

Thanks for the excellent detailed post. I look forward to reading your other posts. Keep going

Reply
elanmart says

2015-02-28 at 14:05

Hey! Thanks for the great post!

I have one question, however:

I’m in the “started DL, serious about it” group and have a decent PC already, although without NVIDIA GPU. I’m also a 1st yr student, so GTX 980 is out of question 😉 The question is: what do You think about Amazon EC2? I could easily buy a GTX580, but I’m not sure if it’s the best way to spend my money. And when I think about more expensive cards (like 980 or the ones to be released in 2016) it seems like running a spot instance for 10 cents per hour is a much better choice.

What could be the main drawbacks of doing DL on EC2 instead of my own hardware?

Reply
- timdettmers says
  
  2015-03-01 at 08:58
  
  I think a Amazon web services (AWS) EC2 instance might be a great choice for you. AWS is great if you want to use a single or multiple separate GPUs (one GPU for one deep net). However, you cannot use them for multi-GPU computation (multiple GPUs for one deep net) as the virtualization cripples the PCIe bandwidth; there are rather complicated hacks that improve the bandwidth, but it is still bad. Everything beyond two GPUs will not work on AWS because their interconnect is way to slow for that.
  
  Reply
  - Gideon says
    
    2015-04-17 at 14:13
    
    Is the AWS single GPU limitation relevant to the new g2.8xlarge instance? (see https://aws.amazon.com/blogs/aws/new-g2-instance-type-with-4x-more-gpu-power/).
    
    Reply
    - Tim Dettmers says
      
      2015-04-17 at 14:24
      
      It seems to run the same GPUs as those in the g2.2xlarge which would still impede parallelization for neural networks, but I do not know for sure without some hard numbers. I bet that with custom patches 4 GPU parallelism is viable although still slow (probably one GTX Titan X will be faster than the 4 GPUs on the instance). More than 4 GPUs still will not work due to the poor interconnect.
      
      Reply
fayzur20 says

2015-03-02 at 11:17

Thanks for the explanation. Looking forward to read the other post.

Reply
Soumyajit Ganguly says

2015-03-07 at 10:37

Hi Tim,
I am a bit confused between buying your recommended GTX 580 and a new GTX 750 (maxwell). The models which I am getting in ebay are around 120 USD but they are 1.5GB models. One big problem with the 580 would be, buying a new PSU (500watt). As you stated the maxwell architecture is the best, then would the GTX 750 (512 CUDA cores, 1GB DDR5) be a good choice? It will be about 95 USD and I can also do without an expensive PSU.
My research area is mainly in text mining and nlp, not much of images. Other than this I would do Kaggle competetions.

Reply
- Tim Dettmers says
  
  2015-03-08 at 10:42
  
  A GTX 750 will be a bit slower than a GTX 580 which should be fine and more cost effective in your case. However, maybe you want to opt for the 2 GB version; with 1 GB it will be difficult to run convolutional nets; 2 GB will also be limiting of course, but you could use it on most Kaggle competitions I think.
  
  Reply
vinhnguyenx says

2015-03-11 at 12:58

great posts, Tim!
which deep learning framework you often use for your work may I ask?

Reply
- Tim Dettmers says
  
  2015-03-17 at 21:00
  
  I programmed my own library for my work on the parallelization of deep learning; otherwise I use Torch7 with which I am much more productive than with Caffe or Pylearn2/Theano.
  
  Reply
InternetJ says

2015-03-17 at 19:57

I know that these are not recommended, but 580 won’t work for me because of the lack of Torch 7 support: will the 660 or 660 Ti work with Torch 7? Is this possible to check before purchasing? Thank you!

Reply
- Tim Dettmers says
  
  2015-03-17 at 20:52
  
  The cuDNN component of Torch 7 needs a GPU with compute capability 3.5. A 660 or 660Ti will not work; You can find out which GPUs have which compute capability here.
  
  Reply
Timothy Scharf says

2015-03-23 at 14:14

Any comments on this new Maxwell architecture Titan X? $1000 US
http://www.pcworld.com/article/2898093/nvidia-fully-reveals-1000-titan-x-the-most-advanced-gpu-ever.html

seemingly has a massive memory bandwidth bump – for example the gtx 980 specs claim 224 GB/sec with the Maxwell architecture, this has 336 GB/sec (and also comes stock with 12GB VRAM!)

Along that line, are the memory bandwith specs not apples to apples comparisons across different Nvidia architectures?

i.e. the also 780ti claims 336GB/sec with the Kepler architecture – but you claim the 980 with 224GB/sec bandwidth can out benchmark it for basic neural net activities?

Appreciate this post

Reply
- Tim Dettmers says
  
  2015-03-23 at 15:26
  
  You can compare bandwidth within microarchitecture (Maxwell: GTX Titan X vs GTX 980, or Kepler: GTX 680 vs GTX 780), but across architectures you cannot do that (Maxwell card X vs Kepler card X). The very minute changes in the design of a microchip can make vast difference in bandwidth, FLOPS, or FLOPS/watt.
  
  Kepler was about FLOPS/watt and double precision performance for scientific computing (engineering, simulation etc.), but the complex design lead to poor utilization of the bandwidth (memory bus times memory clock). With Maxwell the NVIDIA engineers developed an architecture which has both energy efficiency and good bandwidth utilization, but the double precision suffered in turn — you just cannot everything. Thus Maxwell cards make great gaming and deep learning cards, but poor cards for scientific computing.
  
  The GTX Titan X is so fast, because it has a very large memory bus width (384 bit), an efficient architecture (Maxwell) and a high memory clock rate (7 Ghz) — and all this in one piece of hardware.
  
  Reply
adalyac says

2015-03-26 at 13:49

“a 6GB GPU is plenty for now” — don’t you get severely limited in the batch size (like, 30 max) for 10^8+ parameter convnets (eg simonyan very deep, googlenet)?

although I think some DL toolkits are starting to come with functionality of updating weights after >1 batch load/unload onto gpu, which I guess would result in theoretically unlimited batch size, though not sure how this would impact speed?

Reply
- Tim Dettmers says
  
  2015-03-26 at 14:27
  
  This is a good point, Alex. I think you can also get very good results with conv nets that feature less memory intensive architectures, but the field of deep learning is moving so fast, that 6 GB might soon be insufficient. Right now, I think one has still quite a bit of freedom with 6 GB of memory.
  
  A batch and activation unload/load procedure would be limited by the ~8GB/s bandwidth between GPU and CPU, so there will be definitely a decrease in performance if you unload/load a majority of the needed activation values. Because the bandwidth bottlenecks are very similar to parallelism, one can expect a decrease in performance of about 10-30% if you unload/load the whole net. So this would be an acceptable procedure for very large conv nets, however smaller nets with less parameters would still be more practical I think.
  
  Reply
Mario B. says

2015-04-01 at 23:56

What is your opinion about the different brands (EVGA, ASUS, MSI, GIGABYTE) of the video card for the same model?
Thanks for this post Tim, is very illustrating.

Reply
- Tim Dettmers says
  
  2015-04-04 at 07:27
  
  EVGA cards often have many extra features (dual BIOs, extra fan design) and a bit higher clock and/or memory, but their cards are more expensive too. However, with respect to price/performance it often depends from card to card which is the best one and one cannot make general conclusions from a brand. Overall, the fan design is often more important than the clock rates and extra features. The best way to determine the best brand, is often to look for references of how hot one card runs compared to another and then think if the price difference justifies the extra money.
  
  Most often though, one brand will be just as the next and the performance gains will be negligible — so going for the cheapest brand is a good strategy in most cases.
  
  Reply
Timothy Scharf says

2015-04-08 at 01:35

hey Tim,

you been a big help – I have included the results from CUDA bandwidth test (which is included in the samples file of the basic CUDA install.)

This is for a GTX 980 running on 64bit linux with i3770 CPU, and PCIe 2.0 lanes on motherboard.

This look reasonable?
Are they indicative of anything?

the device/host and host/device speeds are typically the bottleneck you speak of?

no reply necessary – just learning

thanks again

tim@ssd-tim ~/NVIDIA_CUDA-7.0_Samples/bin/x86_64/linux/release $ ./bandwidthTest
[CUDA Bandwidth Test] – Starting…
Running on…

Device 0: GeForce GTX 980
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12280.8

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12027.4

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 154402.9

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Reply
- Tim Dettmers says
  
  2015-04-08 at 06:28
  
  Looks quite reasonable; the bandwidth from host to device, and device to host is limited by either RAM or PCIe 2.0 and 12GB/s is faster than expected; 150GB/s is slower than the 224GB/s which the GTX 980 is capable of, but this is due to the small memory size of 30MB — so this looks fine.
  
  Reply
johno says

2015-04-21 at 14:13

Hi Tim, great post! I feel lucky that I chose a 580 a couple of years ago when I started experimenting with neural nets. If there had been an article like this then I wouldn’t have been so nervous!

I’m wondering if you have any quick tips for fresh Ubuntu installs with current nvidia cards? When I got my used system running a couple of years ago it took quite a while and I fought with drivers, my 580 wasn’t recognized, etc.. On the table next to me is a brand new build that I just put together that I’m hoping will streamline my ML work. It’s an intel X99 system with a Titan X (I bought into the hype!). Windows went on fine(although I will rarely use it) and Ubuntu will go on shortly. I’m not looking forward to wrestling with drivers…so any tips would be greatly appreciated. If you have a cheat-sheet or want to do a post, I’m sure it will be warmly welcomed by many…especially me!

Reply
- Tim Dettmers says
  
  2015-04-21 at 15:01
  
  Yeah, I also had my troubles with installing the latest drivers on ubuntu, but soon I got the hang of it. You want to do this:
  0. Download driver and remember the path where you saved the file
  1. Purge system from nvidia and nouveau driver
  2. Blacklist nouveau driver
  3. Reboot
  4. Ctrl + Alt + F1
  5. sudo service lightdm stop
  5. chmod +x driver_file
  6. sudo ./driverfile
  
  And you should be done. Sometimes I had troubles with stopping lightdm; you have two options:
  1. try sudo /etc/init.d/lightdm stop
  2. killing all lightdm processes (sudo killall lightdm or (1) ps ux | grep lightdm, (2) find process id, (3) sudo kill -9 id)
  
  For me the second option worked.
  
  You can find more details to the first steps here:
  http://askubuntu.com/questions/451221/ubuntu-14-04-install-nvidia-driver
  
  Reply
  - johno says
    
    2015-04-24 at 19:59
    
    Thanks for the reply Tim. I was able to get it all up and running pretty painlessly. Hopefully your response helps somebody else too…it’s nice to have this sort of information in one spot if it’s GPU+DNN related.
    
    On a performance note, my new system with the Titan X is more than 10 times faster on an MNIST training run than my other computer (i5-2500k + gx580 3Gb). And for fun, I cranked up the mini-batch size on a Caffe example (flicker finetuning) and got 41% validation accuracy in under an hour.
    
    I believe you hit a nerve with a couple of your blog posts…I think the type of information that you’re giving is quite valuable, especially to folks who haven’t done much of this stuff.
    
    One possible information portal could be a wiki where people can outline how they set up various environments (theano, caffe, torch, etc..) and the associated dependencies. Myself, I set up a few and I’m left with a few questions like for example…
    -given all the dependencies, which should be build versus apt-get versus pip? A holistic outlook would be a very education thing. I found myself building the base libraries and using the setup method for many python packages but after a while there were so many I started using apt-get and pip and adding things to my paths…blah blah…at the end everything works but I admin I lost track of all the details.
    
    I know that I’m not alone!! Having a wiki resource that I could contribute to during the process would be good for me and for others doing the same thing….instead of hunting down disparate sources and answering questing on stackoverflow.
    
    I mention this because you probably already have a ton of traffic because of a couple key posts that you have. Put a wiki up and I promise I’ll contribute! I’ll consider doing it myself as well…time…need more time!
    
    thanks again.
    
    Reply
    - Tim Dettmers says
      
      2015-04-27 at 12:31
      
      Thanks, johno, I am glad that you found my blog posts and comments useful. A wiki is a great idea and I am looking into that. Maybe when I move this site to a private host this will be easy to setup. Right now I do not have time for that, but I will probably migrate my blog in a two months or so.
      
      Reply
- Timothy Scharf says
  
  2015-04-23 at 04:04
  
  I am a bit of a novice but got it done in a few hours.
  
  My thoughts
  
  Try and start with a clean install of a NVIDIA supported linux distro ( unbuntu 14.04 LTS) is on there
  
  I used the Linux distro proprietary drivers, instead of downloading them from NVIDIA. X-org-edgers PPA has them and they keep them pretty current. This means you can install the actual NVIDIA driver via sudo apt-get, and also (more importantly) upgrade the driver in a few months when NVIDIA easily. It also blacklists Nouveau automatically. You can toggle between driver versions in the software manager as it shows you all the drivers you have.
  
  Once you have the driver working, you are most of the way there. I ran into a few troubles with the CUDA install, as sometimes your computer may have some libraries missing, or conflicts. But I got CUDA 7_0 going pretty quickly
  
  these two links helped
  
  http://bikulov.org/blog/2015/02/28/install-cuda-6-dot-5-on-clean-ubuntu-14-dot-04/
  
  http://developer.download.nvidia.com/compute/cuda/6_0/rel/docs/CUDA_Getting_Started_Linux.pdf
  
  there is gonna be some trial and error, be ready to reinstall ubuntu and take another try at it.
  
  good luck
  
  Reply
Jack says

2015-04-23 at 21:54

Hi Tim-
Does the platform you plan on DLing on matter? By this I mean x99, z97, AM3+, ect. X99 is able to utilize more threads and cores than z97, but I’m not sure if that helps at all, similar to cryptocurrency mining, where hardware besides the GPU dosent matter.

Reply
- Tim Dettmers says
  
  2015-04-24 at 04:07
  
  Hi Jack-
  Please have a look at my full hardware guide for details, but in short, hardware besides the GPU does not matter much (although a bit more than in cryptocurrency mining).
  
  Reply
  - Jack says
    
    2015-04-24 at 04:24
    
    Ok, sure, thanks.
    
    Reply
yakup says

2015-04-24 at 07:00

Hi Tim,
I have benefited from this excellent post. I have a question regarding amazon gpu instances. Can you give a rough estimate of the performance of amazon gpu? Like GTX TITAN X = ? amazon gpu.

Thanks,

Reply
- Tim Dettmers says
  
  2015-04-24 at 11:16
  
  Thanks, this was a good point, I added it to the blog post. The new AWS GPUs (g2.2 and g2.8 large) are about as fast as a GTX 680 (they are based on the same chip, but are slightly different to support virtualization). However, there are still some performance decreases due to virtualization for the memory transfer from CPU to GPU and between GPUs; this is hard to measure and should have little impact if you use just one GPU. If you perform multi-GPU computing the performance will degrade harshly.
  
  Reply
Dimiter says

2015-04-26 at 21:36

Hi Tim,
Thanks for sharing all this info.
I don’t understand the difference between GTX 980 from say Asus and Nvidia.
Obviously same architecture, but are they much different at all?
Why it seems hard to find Nvidia products in Europe?
Thanks

Reply
- Tim Dettmers says
  
  2015-04-27 at 12:46
  
  So this is the way how a GPU is produced and comes into your hands:
  1. NVIDIA designs a circuit for a GPU
  2. It makes a contract with a semiconductor producer (currently TSMC in Taiwan)
  3. The semiconductor producer produces the GPU and sends it to NVIDIA
  4. NVIDIA sends the GPU to companies such as ASUS, EVGA etc.
  5. ASUS, EVGA, etc. modify the GPU (clock speeds, fan — nothing fundamental, the chip stays the same)
  6. You buy the GPU from either 5. or 4.
  
  So while all GPUs are from NVIDIA you might buy a branded GPU from, say, ASUS. This GPU is the very same GPU as another GPU from, say, EVGA. Both GPUs run the very same chip. So essentially, all GPUs are the same (for a given chip).
  
  Some GPUs are not not available in other countries because of regulations (NVIDIA might have no license, but other brands have?) and because it might not be profitable to sell it there in the first place (you will need to have a different set of logistics for international trade; NVIDIA might not have the expertise and infrastructure for this, but regular hardware companies like ASUS, EVGA do).
  
  Reply
salemameen says

2015-04-30 at 20:51

Hi Tim,
Thank you for your advices I found them very very useful. I have many questions please and feel very to answer some of them. I have many choices to buy a powerful laptop or computer My budget is (£4000.00).
I would like to buy Mac Pro (cost nearly £3400.00) , so can I apply deep learning of this machine as it uses the OSX operating system and I want to use torch7 in my implementation. Second, I will buy Titan x then I have two choices, First, I will install TITAN X GPU in Mac Pro. Second, I will buy Alienware Amplifier (to use TITAN X) with Alienware 13 laptop. Could you please tell me if this possible and easy to make it because I am not a computer engineer, but I want to use deep learning in my research.
Best regards,
Salem

Reply
- Tim Dettmers says
  
  2015-05-01 at 04:56
  
  I googled the Alienware Amplifier and I read it only has 4GB/s of bandwidth internal and it might be that there are other problems. If you use a single GPU, this is not too much of a concern, but be prepared to deal with performance decreases in the range of 5-25%. If there are technical details that I overlooked the performance decrease might be much higher — you will need to look into that yourself.
  
  The GTX Titan X in a Mac Pro will do just fine I guess. While most deep learning libraries will work well with OSX there might be a few problems here and there, but I think torch7 will work fine.
  
  However, consider also that you will pay a heavy price for the aesthetics of apple products. You could buy a normal high end computer with 2 GTX Titan X and it will be still cheaper than a Mac Pro. Ubuntu or any other Linux-based OS need some time to get comfortable with, but they work just as well as OSX and often make programming easier than OSX does. So it is basically all down to aesthetics vs performance — that’s your call!
  
  Reply
  - salemameen says
    
    2015-05-01 at 12:29
    
    Is it easy to install GTX Titan X in a Mac Pro? Does it need external hardware or power supply or just plug in?
    
    Reply
salemameen says

2015-05-01 at 12:25

Many thanks Tim

Reply
salemameen says

2015-05-02 at 01:51

Is it easy to install GTX Titan X in a Mac Pro? Does it need external hardware or power supply or just plug in?

Reply
vasconl says

2015-05-19 at 23:10

Hi,

Nice article! You recommended all high-end cards. What about mid-range cards for those with a really tight budget? For example, the GT 740 line has a model with 4GB gddr5, 5000 mt/s mem clock, a 128 bus width and is rated at ~750 GFLOPS. Will such a card likely give a nice boost in neural net training (assuming it fits in the cards mem) over a mid-range quad-core CPU?

Thanks!

Reply
- Tim Dettmers says
  
  2015-05-20 at 13:59
  
  The GTX 740 with 4GB GDDR5 is a very good choice for a low budget. Maybe I should even include that option in my post for a very low budget. A GT 740 will definitely be faster than quad-core CPUs (probably anything between 3 to 7 times faster, depending on the CPU and the problem).
  
  Reply
Jay says

2015-05-30 at 10:31

Thanks for this great article. What do you think of the upcoming GTX 980 Ti? I have read it has 6GB and clock speed/cores closer to the Titan X. Rumoured to be $650-750. I was about to buy a new PC, but thought I might hold out as it’s coming in June.

Reply
- Tim Dettmers says
  
  2015-05-30 at 10:41
  
  The GTX 980 Ti seems to be great. 6GB of RAM is sufficient for most tasks (unless you use super large data sets, doing video classification, and use expensive convolutional architectures) and the speed is about the same. If you use Nervana System 16 bit kernels (which will be integrated into torch7) then there should be no issues with memory even with these expensive tasks.
  
  So the GTX 980 Ti seems to be the new best choice in terms of cost effectiveness.
  
  Reply
Kumar says

2015-06-07 at 11:15

Hi,
I am a novice at deep nets and would like to start with some very small convolutional nets. I was thinking of using a GTX750TI (in my part of the world it is not really very cheap for a student). I would convince my advisor to get a more expensive card after I would be able to show some results. Will it be sufficient to do a meaning convolutional net using Theano?

Reply
- Tim Dettmers says
  
  2015-06-07 at 12:01
  
  Your best choice in this situation will be to use an amazon web service GPU spot instance. These instances have small costs ($0.1 and hour or so) and you will able to produce results quickly and cheaply, after which your advisor might be willing to buy an expensive GPU. To save more costs, it would be best to prototype your solution on a CPU (just test that the code is working correctly) and then start up an AWS GPU instance and let your code run for a few hours/days. This should be the best solution.
  
  I think there are predefined AWS images which you can load, so that you do not have to install anything — google “AMI AWS + theano” or “AMI AWS + torch” to find more.
  
  Reply
  - Kumar says
    
    2015-06-08 at 10:44
    
    Thanks a lot for the suggestion. I will go ahead and try this.
    
    Reply
Jay says

2015-06-17 at 23:49

Will the Pascal GPUs have any special requirements, such as X99 or DDR4? I am currently planning a Z97 build with DDR3, but don’t want to be stuck in a years time! Thanks, J

Reply
- Tim Dettmers says
  
  2015-06-18 at 06:14
  
  According to the information that is available, Pascal will not need X99 or DDR4 (which would be quite limiting for sales), instead Pascal cards will just be like a normal card stuck in a PCIe slot with NVLink on top (just like SLI) and thus no new hardware is needed.
  
  Reply
Jay says

2015-06-19 at 07:36

Sweet, thanks.

Reply
mmm says

2015-06-24 at 10:07

About this:

“GTX Titan X = 0.66 GTX 980 = 0.6 GTX 970 = 0.5 GTX Titan = 0.40 GTX 580

GTX Titan X = 0.35 GTX 680 = 0.35 AWS GPU instance (g2.2 and g2.8) = 0.33 GTX 960”

Have you actually measured the times/used these gpus or are you “guessing”?

Thank you for the article!

Reply
- Tim Dettmers says
  
  2015-06-24 at 10:24
  
  Very good question!
  
  Because deep learning is bandwidth-bound, the performance of a GPU is determined by its bandwidth. However, this is only true for GPUs with the same architecture (Maxwell, Kepler, Fermi). So for example: The comparisons between GTX Titan X and GTX 980 should be quite accurate.
  
  Comparisons across architectures are more difficult and I cannot assess them objectively (because I do not have all the GPUs listed). To provide a relatively accurate measure I sought out information where a direct comparison was made across architecture. Some of these are opinion or “feeling”-based, other sources of information are not relevant (game performance measures), but there are some sources of information which are relatively objective (performance measures for bandwidth-bound cryptocurrency mining); so I weighted each piece of information according to its relevance and then I rounded everything to neat numbers for comparisons between architectures.
  
  So all in all, these measure are quite opinionated and do not rely on good evidence. But I think I can make more accurate estimates than people that do not know GPUs well. Therefore I think it is the right thing to include this somewhat inaccurate information here.
  
  Reply
  - Dmitry says
    
    2015-10-23 at 16:28
    
    Hi,
    
    Thanks a lot for the updated comparison. I bought a 780 Ti a year ago and it’s interesting how it compares to the newer cards? I use it for NLP tasks mainly, including RNNs, starting with LSTMs.
    
    Also, do I get it right that ‘GTX Titan X = 0.66 GTX 980’ means that 980 is actually 2/3 as fast as Titan X or the other way round?
    
    Reply
    - Tim Dettmers says
      
      2015-10-26 at 22:04
      
      A GTX 780 Ti is pretty much the same as a GTX Titan Black in terms of performance (slower than a GTX 980). Exactly, the 980 is about 2/3 the speed of a Titan X.
      
      Reply
need some says

2015-07-05 at 21:35

Can you comment on this note on the cuda-convnet page
https://code.google.com/p/cuda-convnet/wiki/Compiling
?

“Note: A Fermi-generation GPU (GTX 4xx, GTX 5xx, or Tesla equivalent) is required to run this code. Older GPUs won’t work. Newer (Kepler) GPUs also will work, but as the GTX 680 is a terrible, terrible GPU for non-gaming purposes, I would not recommend that you use it. (It will be slow). ”

I am probably in the “started DL, serious about it”-group, and would have probably bought the GTX 680 after reading your (great) article.

Reply
- Tim Dettmers says
  
  2015-07-08 at 07:28
  
  This is very much true. The performance of the GTX 680 is just bad. But because the Fermi GPUs (4xx and 5xx) are not compatible with the NVIDIA cuDNN library which is used by many deep learning frameworks, I do not recommend the GTX 5xx. The GTX 7xx series is much faster, but also much more expensive than a GTX 680 (except the GTX 960, which is about as fast as the GTX 680), so the GTX 680 despite being so slow, is the only viable choice (besides GTX 960) for a very low budget.
  
  As you can see in the comment of zeecrux, the GTX 960 might actually be better than the GTX 680 by quite a margin. So probably it is better to get a GTX 960 if you find a cheap on. If this is too expensive, settle for a GTX 580.
  
  Reply
  - need some says
    
    2015-07-09 at 08:16
    
    Ok, thank you! I can’t see any comment by zeecrux ? How bad is the performance of the GTX 960 ? Is it sufficient to have if you mainly want to get started with DL, play around with it, do the occasional kaggle comp, or is it not even worth spending the money in this case ? Buying a Titan X or GTX 980 is quite an investment for a beginner ?
    
    Reply
    - Tim Dettmers says
      
      2015-07-09 at 09:28
      
      Ah I did not realize, the comment of zeecrux was on my other blog post, the full hardware guide. Here is the comment:
      
      ImageNet on K40:
      Training is 19.2 secs / 20 iterations (5,120 images) – with cuDNN
      
      and GTX770:
      cuDNN Training: 24.3 secs / 20 iterations (5,120 images)
      (source: http://caffe.berkeleyvision.org/performance_hardware.html)
      
      I trained ImageNet model on a GTX 960 and have this result:
      Training is around 26 secs / 20 iterations (5,120 images) – with cuDNN
      
      A K40 is about as fast as a GTX Titan. So the GTX 960 is definitely faster and better than a GTX 680. It should be sufficient for most kaggle competitions and is a perfect card to get startet with deep learning.
      
      So it makes good sense to buy a GTX 960 and wait for Pascal to arrive in Q3/Q4 2016, instead of buying a GTX 980 Ti or GTX 980 now.
      
      Reply
Haider says

2015-07-07 at 03:50

Hi Tim,

Do you think it is better to buy Titan X now or waiting the new Pascal if I want to invest in just one GPU withing the coming 4 years?

Reply
- Tim Dettmers says
  
  2015-07-08 at 06:54
  
  The Pascal architecture should be a quite large upgrade when compared to Maxwell. However, you have to wait more than a year for them to arrive. If your current GPU is okay, I would wait. If you have no GPU at all, you can use AWS GPU instances, or but a GTX 970, and sell it after one year, to buy a Pascal card.
  
  Reply
serige says

2015-07-16 at 18:22

From what I read, GPU Direct RDMA is only available for workstation cards (Quadro/Tesla). But it seems like you are able to get your cluster to work with a few GTX Titan’s and IB cards here. Not sure what am I missing.

Reply
- Tim Dettmers says
  
  2015-07-16 at 18:36
  
  You will need a Mellanox InfiniBand card. For me a ConnectX-2 worked, but usually only ConnectX-3 and ConnectX-IB are supported. I never tested GPU Direct RDMA with Maxwell, so it might not work there.
  
  To get it working on Kepler devices, you will need the patch you find under downloads here (nvidia_peer_memory-1.0-0.tar.gz):
  http://www.mellanox.com/page/products_dyn?product_family=116
  
  Even with that I needed quite some time to configure everything, so prepare yourself for a long read of documentations and error google search queries.
  
  Reply
Joe Hoover says

2015-08-01 at 04:55

Hi Tim, thank you for posting and updating this, I’ve found it very helpful.

I do have a general question, though, about quadro cards, which I’ve noticed neither you nor many others discuss using for deep learning. I’m configuring a new machine and, due to some administrative constraints, it is easiest to go with a quadro k5000.

I had specced out a different machine with a GTX 980, but it’s looking like it will harder to purchase it. My questions are whether there is anything I should be aware of regarding using quadro cards for deep learning and whether you might be able to ball park the performance difference. We will probably be running moderately sized experiments and are comfortable losing some speed for the sake of convenience; however, if there would be a major difference between the 980 and k5000, then we might need to reconsider. I know it is difficult to make comparisons across architectures, but any wisdom that you might be able to share would be greatly appreciated.

Thanks!

Reply
- Tim Dettmers says
  
  2015-08-01 at 05:09
  
  The k5000 is based on a Kepler chip and has 173 GB/s memory bandwidth. Thus is should be a bit slower than a GTX 680.
  
  Reply
- Falak says
  
  2017-01-09 at 15:44
  
  Hi!
  I am in a similar situation. No comparison of quadro and geforce available anywhere. Just curious, which one did you end up buying and how did it work out?
  
  Reply
Vu Pham says

2015-08-03 at 04:25

Hi Tim, 1st i want to say that I’m truly extremely impressed with your blog, its very helpful.

Talking about the bandwidth of PCI Ex, have u ever heard about plx tech with their pex 8747 bridge (Chip). Anandtech has a good review on how does it work and effect on gaming: http://www.anandtech.com/show/6170/four-multigpu-z77-boards-from-280350-plx-pex-8747-featuring-gigabyte-asrock-ecs-and-evga. They even said that it can also replicate 4 x16 lanes on a cpu which is 28lanes.

Reply
- Tim Dettmers says
  
  2015-08-03 at 05:55
  
  Someone mentioned it before in the comments, but that was another mainboard with 48x PCIe 3.0 lanes; now that you say you can operate with 16x on all four GPUs I got curious and looked at the details.
  
  It turns out that this chip switches the data in a clever way, so that a GPU will have full bandwidth when it needs high speed. However, when all GPUs need high speed bandwidth, the chip is still limited by the 40 PCIe lanes that are available at the physical level. When we transfer data in deep learning we need to synchronize gradients (data parallelism) or output (model parallelism) across all GPUs to achieve meaningful parallelism, as such this chip will provide no speedups for deep learning, because all GPUs have to transfer at the same time.
  
  Transferring the data one after the other is most often not feasible, because we need to complete a full iteration of stochastic gradient descent in order to work on the next iterations. Delaying updates would be an option, but one would suffer losses in accuracy and the updates would not be that efficient anymore (4 delayed updates = 2-3 real updates?). This would make this approach rather useless.
  
  Reply
  - Vu Pham says
    
    2015-08-04 at 13:02
    
    thank for your detailed explanation.
    
    Reply
Alvas says

2015-08-14 at 00:35

Is it possible to use the GTX 960M for Deep Learning? http://www.geforce.com/hardware/notebook-gpus/geforce-gtx-960m/specifications. It has 2.5GB GDDR though. Maybe a pre-built specs with http://t.co/FTmEDrJDwb ?

Reply
- Tim Dettmers says
  
  2015-08-16 at 09:32
  
  A GTX 960M will be comparable in performance to a GTX 950. So you should see a good speedup using this GPU, but it will not be a huge speedup compared to other GPUs. However, compared to laptop CPUs the speedup will still be considerable. To do more serious deep learning work on a laptop you need more memory and preferably faster computation; a GTX 970M or GTX 980M should be very good for this.
  
  Reply
naveen DN says

2015-09-09 at 08:15

Hi Tim
I’m planning to build a pc mainly for kaggle and getting started with deep learning.
This is my first time.For my budget I’m thinking of going with
i7-4790k
GTX 960 4GB
Gigabyte GA-Z97X-UD3H-BK or Asus Z97-A 32GB DDR3 Intel Motherboard

I’m wishing to replace the gtx 960 or add another card later on …

Is this is a good build ? please offer your suggestions

Thanks in advance:)

Reply
- Tim Dettmers says
  
  2015-09-09 at 10:33
  
  Looks like a solid cheap build with one GPU. The build will suffice for a Pascal card once it comes available and thus should last about 4 years with a Pascal upgrade. The GTX 960 is a good choice to try things out, and use deep learning on kaggle. You will not able to build the best models, but models that are competitive with the top 10% in deep learning kaggle competitions. Once you get the hang of it, you can upgrade and you will be able to run the models that usually win those kaggle competitions.
  
  Reply
Vu Pham says

2015-09-13 at 15:18

Hi Tim,

Right now i’m in between 2 choices: 2 gtx 690 and a Titanx. Both come with same price. Which one do you think is better for conv net? Or Multimodal Recurrent Neural Net

Reply
- Tim Dettmers says
  
  2015-09-13 at 15:35
  
  I would definitely pick a GTX Titan X over two GTX 690, mainly because using two GTX 690 for parallelism is difficult and will be slower than a single Titan X. Running multiple algorithms (different algorithms on each GPU) on the two GTX 960 will be good, but a Titan X comes close to this due to its higher processing speed.
  
  Reply
Bjarke says

2015-09-16 at 14:29

Are there any important differences between the GTX 980 and the GTX 980 TI? It seems that we can only get the latter. While it seems faster, I’m not skilled enough in the area to know whether it has any issues related to using it for deep learning.

Reply
- Tim Dettmers says
  
  2015-09-21 at 07:10
  
  The GTX 980 Ti is as fast at the GTX Titan X (50% faster than GTX 980), but has 6GB of memory instead of 12GB. There are no issue with the card, it should work flawlessly.
  
  Reply
Dong Ta says

2015-09-16 at 22:54

What do you think of Titan X superclocked vs. regular Titan X? Are the up/down sides noticeable?

Reply
- Tim Dettmers says
  
  2015-09-21 at 07:11
  
  The upgrade should be unnoticeable (0-5% increased speed) and I would recommend a superclocked version only if you do not pay any additional money for that.
  
  Reply
  - Sergei Wallace says
    
    2016-01-05 at 09:28
    
    Possibly (probably) a dumb question but can you use a superclocked GPU with an non-superclocked GPU? Reason I ask is that a cheap used superclocked Titan Black is for sale on ebay as well as another cheap Titan Black (non-superclocked). Just want to make sure I wouldn’t be making some mistake by buying the second one if I decided to get two Titan black GPUs.
    
    p.s. thanks for the blog. Super helpful for all of us noobies.
    
    Reply
    - Tim Dettmers says
      
      2016-01-09 at 10:33
      
      Yes, this will work without any problem. I myself have been using 3 different kind of GTX Titan for many months. In deep learning the different of compute clock also makes hardly a difference, so that the GPUs will not diverge during parallel computation. So there should be no problems.
      
      Reply
Mattias Johansson says

2015-09-21 at 06:23

Hello Tim

Thank you very much for you in-depth hardware analysis (both this and the other one you did). I basically ended up buying a new computer based only on your ideas
I choose the GTX 960 and then I might upgrade next year if I feeling this is something for me.

But in a lot of places I read about this imagenet db. The problem there seems to be that i need to be a researcher (or in education) to download the data. Do you know anything about this? Is there any way for me as a private person (that is doing this for fun) to download the data? The reason why I want this dataset is because it is huge and it also would be fun to be able to compare how my nets works compared to other people.

If not, what other image databases except for CIFAR and MIST do you recommend?

Thanks agan.

Reply
- Tim Dettmers says
  
  2015-09-22 at 08:57
  
  Hello Mattias, I am afraid there is no way around the educational email address for downloading the dataset. It is really is a shame, but if these images would be exploited commercially then the whole system of free datasets would break down — so it is mainly due to legal reasons.
  
  There are other good image datasets like the google street view house number dataset; you can also work with Kaggle datasets that feature images, which has the advantage that you get immediate feedback how well you do and the forums are excellent to read up how the best competitors did receive their results.
  
  Reply
  - Mattias Johansson says
    
    2015-09-23 at 06:01
    
    Thanks for quick reply,
    I will look into both Kaggle and the street view data set then
    
    Reply
Michael Holm says

2015-09-23 at 21:05

Hello Tim,

Thank you for your article. I understand that researchers need a good GPU for training a top performing (convolutional) neural network. Can you share any thought on what compute power is required (or what is typically desired) for transfer learning (i.e. fine tuning of an existing model) and for model deployment?

Thank you!

Reply
Tony says

2015-09-24 at 14:12

Tim, Such a great article. I’m going back and forth between the titan z and the titan x. I can probably buy the titan z for ~$500 from my friend. I’m very confused as to how much memory it actually has. I see that it has 6gb x 2.

I guess my question is: Is the Titan Z have the same specs as the Titan X in terms of memory? How does this work from a deep learning perspective (currently using theano)

Many Thanks,

Reply
- Tony says
  
  2015-09-24 at 14:13
  
  One thing I should add is that I’m building RNN’s (specifically LSTM’s) with this Titan Z or Titan X. I’m also considering the 980 TI
  
  Reply
- Tim Dettmers says
  
  2015-09-24 at 15:15
  
  Please have a look at my answer on quora which deals exactly with this topic. Basically, I recommend you to go for the GTX Titan X. However, $500 for GTX Titan Z is also a good deal. Memory-wise, you can think of the GTX Titan Z, as two normal GTX Titan with a connection between the two GPUs — so two GPUs with 6GB of memory each.
  
  Reply
  - Tony says
    
    2015-09-25 at 18:31
    
    That makes much more sense. Thanks again — checked out your response on quora. You’ve really changed my views on how to set up deep learning systems. Can’t even begin to express how thankful I am.
    
    Reply
Tony says

2015-10-06 at 13:48

Hey Tim, not to bother too much. I bought a 980 Ti, and things have been great. However, I was just doing some searching, and saw that the AMD Radeon R9 390X is ~$400 on Newegg and has 8gb memory and 500gb bandwidth. These specs are roughly 30% better than the 980-TI for $650.

I was wondering what your thoughts are on this? Is AMD compute architecture slower compared to Nvidia Kepler architecture for deep learning? In the next month or so, I’m considering purchasing another card.

Based upon numbers, it seems that the AMD cards are much cheaper compared to Nvidia. I was hoping you could comment on this!

Reply
- Tim Dettmers says
  
  2015-10-06 at 14:40
  
  Theoretically the AMD card should be faster, but the problem is the software: Since no good software exists for AMD cards you will have to write most of the code yourself with an AMD card. Even if you manage to implement good convolutions the AMD card will likely perform worse than the NVIDIA one because the NVIDIA convolutional kernels have been optimized by a few dozen researchers for more than 3 years.
  
  NVIDIA Pascal cards will have up to 750-1000 GB/s memory bandwidth, so it is worth waiting for Pascal which probably will be released in about a year.
  
  Reply
  - Tony says
    
    2015-10-09 at 19:00
    
    Yea — I can’t wait for Pascal. For now, will just rock out with the 980 TI’s. Thanks alot!
    
    Reply
Nghia says

2015-10-27 at 16:02

Hi Tim,

Come across the internet for deep learning on this blog is great for newbie like me.
I have 2 choices in hands now: 1 GTX 980 4GB and 2 GTX 780 ti 3GB SLI. Which one do you recommend that should come to the hardware box for my deep learning research?
I am more favour of 2 780 Ti as learning from your writing on CUDA cores + memory bandwidth.

Thank you very much.
Nghia

Reply
- Tim Dettmers says
  
  2015-10-27 at 17:27
  
  I would favor the GTX 980 which will be much faster than 2 GTX 780 Ti even if you use the two cards in parallel. However, the 2 GTX 780 Ti will much better if you run independent algorithms and thus enables you to learn how to train deep learning algorithms successfully more quickly. On the other hand, the 3GB on them is rather limiting and will prevent you to train current state of the art convolutional networks. If you want to train convolutional networks I would suggest you choose the GTX 980 rather than 2 GTX 780 due to this.
  
  Reply
  - Nghia says
    
    2015-10-29 at 16:54
    
    Thank you very much for the advice.
    Is it possible to put all three card into one machine and that give me enough environment to learn parallelism programming and study deep learning with neuron network (torch7 & Lua)?
    
    A system with those 3 cards (780Ti x2 + 980 x1) will yield better performance overall or drag it down due to the hardware imparity and complexity?
    
    Reply
    - Tim Dettmers says
      
      2015-10-30 at 09:34
      
      Yes, you could run all three cards in one machine. However you can only select one type of GPU for your graphics; and for parallelism only the two 780 will work together. There might be problems with the driver though, and it might be that you need to select your Maxwell card (980) to be your graphics output.
      
      In a three card system you could tinker with parallelism with the 780s and switch to the 980 if you are short on memory. If you run NervanaGPU you could also use 16-bit floating point models, thus doubling your memory, however, NervanaGPU will not work on your Kepler 780 cards.
      
      Reply
      - Nghia Tran says
        
        2015-10-30 at 12:21
        
        Thank you very much, Tim.
        For the sake of study,
        From the specs:
        
        + The GTX 780Ti with 2880 CUDA cores + 3GB (384bit bandwidth), and double that with SLI
        
        + The GTX 980 with 2048 CUDA cores + 4GB (256bit bandwidth).
        
        Does VRAM 1GB/core difference make a big deal in deep learning?
        
        I will benchmark and post the result once I got hand on to run the system with above 2 configuration.
Mister says

2015-11-05 at 21:06

I have access to a NVIDIA Grid K2 card on a virtual machine and I have some questions related to this:

1. How does this card rank compared to the other models?
2. More importantly, are there any issues I should be aware of when using this card or just doing deep learning on a virtual machine in general?

I do not have the option of using any other machine than the one provided.

Reply
- Mister says
  
  2015-11-05 at 21:30
  
  And of course, thanks for some great articles! They are a big help.
  
  Reply
  - Tim Dettmers says
    
    2015-11-06 at 18:45
    
    You are welcome! I am glad that it helped!
    
    Reply
- Tim Dettmers says
  
  2015-11-06 at 18:44
  
  1. The Grid K2 card will roughly perform as good as a GTX 680, although its PCIe connection might be crippled due to virtualization.
  2. Depends highly on the hardware/software setup. Generally there should not be any issue other than problems with parallelism.
  
  Reply
  - Mister says
    
    2015-11-27 at 03:28
    
    Do you know what versions of CUDA it is comparible with? Would it work with CUDA 7.5?
    
    Reply
Alex says

2015-11-30 at 20:52

Hi! Fantastic article. Are there any on demand solution such as Amazon but with 980Ti on board? I can’t find any.

Reply
- Tim Dettmers says
  
  2015-12-01 at 15:00
  
  Amazon needs to use special GPUs which are virtualizable. Currently the best cards with such capability are kepler cards which are similar to the GTX 680. However, other vendors might have GPU servers for rent with better GPUs (as they do not use virtualization), but these server are often quite expensive.
  
  Reply
Marc-Philippe Huget says

2015-12-14 at 10:43

Hello Tim,

First of all, I bounced on your blog when looking for Deep Learning configuration and I loved your posts that confirm my thoughts.

I have two questions if you have time to answer them:
(1) For specific problems, I will train my DNN with ImageNet with some other classes, for this, I don’t care waiting for a while (well, a long while), when the DNN will be ready, do you know if configuration (one to four Titan X, 12GB each) will not delay too much when scene labelling images. I would like to have answers by seconds like Clarifai does. I guess this is dependent of the number of hidden layers I could have in my DNN

(2) Do you have enough long use of your configuration to provide feedback on MTBF for GPU cards? I guess like disks, running a system on 24/7 basis will impact the longevity of GPU cards…

Thanks in advance for your answers

mph

Reply
- Tim Dettmers says
  
  2015-12-15 at 22:03
  
  (1) Yes, this is highly dependent on the network architecture and it is difficult to say more about this. However, this benchmark page by Soumith Chintala might give you some hint what you can expect from your architecture given a certain depth and size of the data. Regarding parallelization: You usually use LSTMs for labelling scenes and these can be easily parallelized. However, running image recognition and labelling in tandem is difficult to parallelize. You are highly dependent on implementations of certain libraries here because it cost just too much time to implement it yourself. So I recommend to make your choice for the number of GPUs dependent on the software package you want to use.
  (2) I had no failures so far — but of course this is for a sample size of 1. I have heard from other people that use multiple GPUs that they had multiple failures in a year, but I think this is rather unusual. If you keep the temperatures below 80 degrees your GPUs should be just fine (theoretically).
  
  Reply
Stas says

2016-01-19 at 19:20

Awesome work, this article really clears out the questions I had about available GPU options for deep learning.

What can you say about the Jetson series, namely the latest TX1?
Is it recommended to get as an alternative to PC rig with desktop GPU’s?

Reply
- Tim Dettmers says
  
  2016-01-25 at 14:10
  
  I was also thinking about the idea to get a Jetson TX1 instead of a new laptop, but in the end it is more convenient and more efficient to have a small laptop and ssh into a desktop or an AWS GPU instance. A AWS GPU instance will be quite a bit faster than the Jetson TX1 so that the Jetson only makes sense if you really want to do mobile deep learning, or if you want to prototype algorithms for future generation of smartphones that will use the Tegra X1 GPU.
  
  Reply
Alexander says

2016-01-21 at 09:31

Hi Tim!

Thank you for the excellent blog post.

I use various neural nets (i.e. sometimes large, sometimes small) and hesitate to choose between GTX 970 and GTX 960. What is better if we set price factor aside?

– 970 is ~2x faster than 960, but as you say it has troubles.
– on the other hand, Nvidia had shown that GTX 980 has the same memory troubles > 3.5GB
http://www.pcper.com/news/Graphics-Cards/NVIDIA-Responds-GTX-970-35GB-Memory-Issue
If we take their information for granted, I don’t understand your point re. memory troubles in GTX 970 at all, because you do recommend GTX 980

Simply put, is GTX 970 still faster than GTX 960 on large nets or not? What concrete troubles we face using 970 on large nets?

Thank you again, Alexander

Reply
- Tim Dettmers says
  
  2016-01-25 at 14:15
  
  Hi Alexander,
  
  if you look at the screenshots again you see that the bandwidth for the GTX 980 does not drop when we increase the memory. So the GTX 980 does not have memory problems.
  
  Regarding your question of 960 vs. 970: The 970 is much better if you can stay below 3.5GB of memory, but much worse otherwise. If you train sometimes some large nets, but you are not insisting on very good results (rather you are satisfied with good results) I would go with the GTX 970. If you train something big and hit the 3.5GB barrier, just adjust your neural architecture to be a bit smaller and you should be alright (or you might try different things like 16-bit networks, or aggressive use of 1×1 convolutional kernels (inception) to keep the memory footprint small).
  
  Reply
  - Alexander says
    
    2016-01-25 at 21:31
    
    Thanks, Tim!
    Indeed, I overlooked the first screenshot, it makes a difference.
    Don’t understand Nvidia’ statement still, somehow they equated GTX 980 and 970 above 3.5 GB, but no matter.
    
    Reply
Alexander says

2016-02-07 at 13:23

Hey Tim!
I was thinking about GTX 970 issue again. According to the test, it loses bandwidth above 3.5GB. But what does it mean exactly?

-does it start affecting bandwidth for memory below 3.5GB as well? (I guess no)
-does it decrease GPU computing performance itself? (I guess no)
-what if input data allocated in GPU memory below 3.5GB, and only CNN weights allocated above 3.5GB? In that case upper 0.5GB shouldn’t be in use for data exchange and may not affect overall bandwidth? I understand we don’t control this allocation by default, but what in theory?

Reply
Hossein says

2016-02-17 at 15:39

Great post ,
I bought a GTX 750, considering your article I’m doomed right?
I have a question though, I have’nt tested this yet, but here it goes.
Do you think I can use VGGnet or Alex Krizhevsky net for Cifar10? GTX750 has 2G of ram and it GDDR5. CIFAR10 is only 60K, and of size 32*32*3! maybe it fits?! Im not sure.
What do you think on this ? May I be able to give pascal voc 2007 as well?

Thanks again

Reply
- Tim Dettmers says
  
  2016-02-17 at 22:42
  
  The GTX 750 will be a bit slow, but you should still be able to do some deep learning with it. If you are using libraries that support 16bit convolutional nets then you should be able to train Alexnet even on ImageNet; so CIFAR10 should not be a problem. To use VGG on CIFAR10 should work out, but maybe it might be a bit tight especially if you use 32bit networks. I have no experience with the PASCAL VOC2007 dataset, but the image sizes seem to be similar to ImageNet, thus AlexNet should work out, but probably not VGG, even with 16bits.
  
  Reply
  - Hossein says
    
    2016-03-09 at 05:51
    
    Thanks you very much .
    By the way I’m using Caffe and I guess it only supports 32 bit convnets. I’m already hitting my limit using a 4 conv layer network (1991Mbs or so ) and overall only 2~3 Mbs of GPU remains .
    Your article and help was of great help to me sir and I thank you from the bottom of my heart .
    God bless you
    Hossein
    
    Reply
Pawel Kozela says

2016-02-28 at 02:37

Hi Tim,

thanks for great guidelines !

In case somebody’s interested in the numbers – I’ve just bought a GTX 960 (http://www.gigabyte.com/products/product-page.aspx?pid=5400#ov) and I’m getting ~50% better performance than AWS G2.2 instance (keras / tensorflow backend).

Reply
- Tim Dettmers says
  
  2016-02-28 at 11:34
  
  Thank you, Pawel. That are very useful statistics!
  
  Reply
Wajahat says

2016-03-07 at 14:11

Hi Tim
Thanks a lot for this article. I was looking for something like this.
I have a quick question. What would be the expected speedup for ConvNets with a GTX Titan X vs Core i7 4770-3.4 Ghz?
A rough idea would do the job?

Best Regards
Wajahat

Reply
Hossein says

2016-03-09 at 14:08

I wonder what exactly happens when we exceed the 3.5G limit of the GTX970?!
Will it crash? if not how much slower does it get when it passes that limit?
I want to know, if passing the limit and getting slower, would it still be faster than the GTX960 ? If it is so , that would be great.
Has anyone ever observed or benchmarked this ? Have you?
Thanks again

Reply
Hossein says

2016-03-09 at 14:12

Thanks alot, actually I dont want to play with this card, I need its bandwidth and its memory to run some applications (a deep learning Framework called caffe ).
Currently I have a GTX750 2G GDDR5, I need 4Gig at least . at the very same time, I also need a higher bandwidth card.
I cant buy the GTX980, its too expensive for me, I was skeptical to go for the GTX960 4G or the GTX970 4G (3.5G).
basically, GTX960 is 128 bit and it gives me 112 G of bandwidth, while the GTX970 is 256 and gives me 192+G bandwidth.
My current cards bandwidth is only 80!
So I just need to know, Do I have access to the whole 4 gigabyte of vram? playing games aside?
Does it crash if it exceeds the 3.5G limit or it just gets slower?

Reply
- Hossein says
  
  2016-03-09 at 14:14
  
  I mistakenly posted this here ;! ( This was supposed to be in techpowerup!)
  
  Reply
Cheer says

2016-03-15 at 19:47

I am using GTX 970 and two 750 (with 1GD5, 2GD5)
But there is not big difference in speed.
Rather, it seems 750 is slightly faster than 970.
Would you tell me the reason?
Thanks.

Reply
- Tim Dettmers says
  
  2016-03-18 at 13:17
  
  Hmm this seems strange. It might be that the GTX 970 hit the memory limit and thus is running more slowly so that it gets overtaken by a GTX 750. On what kind of task have you tested this?
  
  Reply
Vu Pham says

2016-03-17 at 11:53

Hi Tim, Do you know (or recommend) any good DL project for “Graphic card testing” on Github? Recently I’m cooperating with a hardware retailer so they lend me bunch of NVIDIA graphic cards (titan, titanx, titan black, titan z, 980 ti, 980, 970, 780 Ti, 780…).

Reply
Steve says

2016-04-04 at 16:01

What if I install a single gtx 960 to a PCIe 2.0 slot instead of a 3.0?

Reply
- Tim Dettmers says
  
  2016-04-05 at 20:07
  
  It will be a bit slower to transfer data to the GPU, but for deep learning this is negligible. So not really a problem.
  
  Reply
Amir H. Jadidinejad says

2016-04-07 at 01:41

Thank you for sharing this. Please update the list with new Tesla P100 and compare it with TitanX.

Reply
- Tim Dettmers says
  
  2016-04-07 at 19:25
  
  I will probably do this on the weekend.
  
  Reply
Wajahat says

2016-04-14 at 13:05

Hi Tim
Thanks a lot for sharing such valuable information.
Do you know if it will be possible to use and external GPU enclosure for deep learning
such as a Razer core?
http://www.anandtech.com/show/10137/razer-core-thunderbolt-3-egfx-chassis-499399-amd-nvidia-shipping-in-april

Would there be any compromise on the efficiency?

Best Regards
Wjahat

Reply
- Tim Dettmers says
  
  2016-04-24 at 07:54
  
  There will be a penalty to get the data from your CPU to your GPU, but the performance on the GPU will not be impacted. Depending on the software and the network you are training, you can expect a 0-15% decrease in performance. This should still be better than the performance you could get for a good laptop GPU.
  
  Reply
Richard says

2016-04-28 at 15:28

What can I expect from a Quadro M2000M (see http://www.notebookcheck.net/NVIDIA-Quadro-M2000M.151581.0.html) with 4GB RAM in a “I started deep learning and I am serious about it” situation?

Reply
- Tim Dettmers says
  
  2016-05-08 at 14:57
  
  It will be comparable to a GTX 960.
  
  Reply
Haider says

2016-05-07 at 07:49

Hi

I keep coming to this great article. I was about to buy a 980 ti only when discovered that today nvidia announced the pascal gtx 1080 to be released in the end of may 2016. Maybe you want to put an update to your article with this fantastic performance and price of gtx 1080/1070.

Reply
- Tim Dettmers says
  
  2016-05-08 at 15:04
  
  I will update the blog post soon. I want to wait until some reliable performance statistics are available.
  
  Reply
Haider says

2016-05-07 at 07:57

By the way, the price difference between Asus, EVGA,… etc. vs the original nvidia seems pretty high. Titan x in Amazon priced around 1300 to 1400 usd vs 999usd in nvidia online store. Do you advise against buying the original nvidia? If yes, why? What is the difference? Which brand you prefer?

Many thanks Tim. Your posts are unique. We badly need hardware posts for deep learning!

Reply
- Tim Dettmers says
  
  2016-05-08 at 15:06
  
  For deep learning the performance of the NVIDIA one will be almost the same as ASUS, EVGA etc (probably about 0-3% difference in performance). The brands like EVGA might also add something like dual-boot BIOS for the card, but otherwise it is the same chip. So definitely go for the NVIDIA one.
  
  Reply
  - Haider says
    
    2016-07-29 at 21:06
    
    I read this interesting discussion about the difference in reliability, heat issues and future hardware failures of the reference design cards vs the OEM design cards:
    https://hashcat.net/forum/thread-4386.html
    
    The opinion was strongly against buying the OEM design cards. Especially for computing and 24/7 working of GPUs.
    
    I read all the 3 pages and it seems there is no citation or any scientific study backing up the opinion, but it seems he has a first hand of experience who bought thousands of NVidia cards before.
    
    So what is your comment about this? Should we avoid OEM design cards and stick with the original NVidia reference cards?
    
    Reply
    - Hayder Hussein says
      
      2017-01-31 at 02:06
      
      Answering my own question above:
      
      I asked the same question to the author of this blog post (Matt Bach) of Puget systems and he was kind to answer based on around 4000 Nvidia cards that they have installed in his company:
      https://www.pugetsystems.com/labs/articles/Most-Reliable-PC-Hardware-of-2016-872/
      
      I will quote the discussion happened in the comments of the above article, in case anybody is interested:
      
      Matt Bach :
      Interesting question and one that is a bit hard to answer since we don’t really track individual cards by usage. I will tell you, however, that we lean towards reference cards if the card is expected to be put under a heavy load or if multiple cards will be in a system. Many of the 3rd party designs like the EVGA ACX and ASUS STRIX series don’t have very good rear exhaust so the air tends to stay in the system and you have to vent it with the chassis fans. That is fine for a single card, but as soon as you stack multiple cards into a system it can produce a lot of heat that is hard to get rid of. The Linus video John posted in reply to your comment lines up pretty closely what we have seen in our testing.
      
      I did go ahead and pull some failure numbers from the last two years. This is looking at all the reference cards we sold (EVGA, ASUS, and PNY mostly) versus the EVGA ACX and ASUS STRIX cards (which are the only non-reference cards we tend to sell):
      
      Total Failures: Reference 1.8%, EVGA ACX 5.0%, ASUS STRIX 6.6%
      DOA/Shop Failures: Reference 1.0%, EVGA ACX 3.9%, ASUS STRIX 1.5%
      Field Failures: Reference .7%, EVGA ACX 1.1%, ASUS STRIX 3.4%
      
      Again, we don’t know the specific usage for each card, but this is looking at about 4,000 cards in total so it should average out pretty well. If anything, since we prefer to use the reference cards in 24/7 compute situations this is making the reference cards look worse than they actually are. The most telling is probably the field failure rate since that is where the cards fail over time. In that case, the reference are only a bit better than the EVGA ACX, but quite a bit better than the ASUS STRIX cards.
      
      Overall, I would definitely advise using the reference style cards for anything that is heavy load. We find them to work more reliably both out of the box and over time, and the fact that they exhaust out the rear really helps keep them cooler – especially when you have more than one card.
      
      Hayder Hussein:
      Recently Nvidia began selling their own cards by themselves (with a bit higher price). What will be your preference? The cards that Nvidia are manufacturing and selling by themselves or a third party reference design cards like EVGA or Asus ?
      
      Matt Bach :
      As far as I know, NVIDIA is only selling their own of the Titan X Pascal card. I think that was just because supply of the GPU core or memory is so tight that they couldn’t supply all the different manufacturers so they decided to sell it directly. I believe the goal is to get it to the different manufacturers eventually, but who knows when/if that will happen.
      
      If they start doing that for the other models too, there really shouldn’t be much of a difference between an NVIDIA branded card and a reference Asus/EVGA/whatever. Really hard to know if NVIDIA would have a different reliability than other brands but my gut instinct is that the difference would be minimal.
      
      Reply
      - Tim Dettmers says
        
        2017-02-01 at 15:10
        
        That is really insightful, thank you for your comment!
Haris Jabbar says

2016-05-13 at 07:43

Your blog posts have become a must-read for anyone starting on deep learning with GPUs. Very well written, especially for newbies.

I was wondering though if/when you will write about the new beast: GTX 1080? I am thinking of putting together a multi GPU workstation with these cards. If you could compare the 1080 with Titan or 900 series cards, that would be super useful for me (and i am sure quit a few other folks)

Reply
Thomas R says

2016-05-19 at 10:46

Thank you for this great article. What is your opinion about the new Pascal GPUs? How would you rank the GTX1080 and GTX1070 compared to the GTX Titan X? Is it better to buy the newer GTX 1080 or to buy a Titan X which has more memory?

Reply
- Tim Dettmers says
  
  2016-05-26 at 11:11
  
  Both cards are better. I do not have any hard data on this yet, but it seems that the GTX 1080 is just better — especially if you use 16-bit data.
  
  Reply
Amit says

2016-05-19 at 12:46

Hey,
Great Writeup.

I have a GTX 970M with i7 6700 (desktop CPU) on a Clevo laptop.
How good is GTX 970m for deep learning?

Reply
- Tim Dettmers says
  
  2016-05-26 at 11:10
  
  A GTX 970m is pretty okay, especially the 6GB variant will be enough to explore deep learning and fit some good models on data. However, you will not be able to fit state of the art models, or medium sized models in good time.
  
  Reply
Ricardo says

2016-05-21 at 08:58

Great article! I would love to see some benchmarks on actual deep learning tasks.

I was under the impression that single precision could potentially result in large errors. In large networks with small weights/gradients, won’t the limited precision propagate through the net causing a snowballing effect?

I admit I have not experimented with this, or tried calculating it, but this is what I think. I’ve been trying to get my hands on a Titan / Titan Black, but with what you suggest, it would be much better getting the new Pascal cards.

With that being said, how would ‘half precision’ do with deep learning then?

Reply
- Tim Dettmers says
  
  2016-05-23 at 09:48
  
  The problem with actual deep learning benchmarks is hat you need the actually hardware and I do not have all these GPUs.
  
  Working with low precision is just fine. The error is not high enough to cause problems. It was even shown that this is true for using single bits instead of floats since stochastic gradient descent only needs to minimize the expectation of the log likelihood, not the log likelihood of mini-batches.
  
  Yes, Pascal will be better than Titan or Titan Black. Half-precision will double performance on Pascal since half-floating computations are supported. This is not true for Kepler or Maxwell, where you can store 16-bit floats, but not compute with them (you need to cast them into 32-bits).
  
  Reply
Rishikesh says

2016-05-26 at 11:43

What about new nvidia GPUs like GTX 1080 and GTX 1070, please review these after they released on the perspective of deep learning. Nvidia claim that GTX 1080 performance beat GTX Titan GPU, Is it true for Deep learning task ?
I am about to buy new GPU for deep learning task so please suggest me which GPU, I should buy with budget vs performance ratio ?

Reply
mariano says

2016-06-02 at 23:29

I was able to use Tensorflow, the last google machine learning framework with and NVidia GTX 960 on Ubuntu 16.04. It’s not officially supported but can be used.
I’ve posted a tutorial about how to install it here:
http://stackoverflow.com/questions/37600808/how-to-install-tensorflow-from-source-with-unofficial-gpus-support-in-ubuntu-16

Reply
Suriya says

2016-06-06 at 15:34

Hi,
Very nice post! Found it really useful and I felt GeForce 980 suggestion for Kaggle competitions really apt. However, I am wondering how good are the mobile versions of the GeForce series for Kaggle such as 940M, 960M, 980M and so on. Any thoughts on this?

Reply
- Tim Dettmers says
  
  2016-06-11 at 16:14
  
  I think for Kaggle anything >=6GB of memory will do just fine. If you have a slower 6GB card then you have to wait longer but it is still much faster than a laptop CPU, and although slower than a desktop you still get a nice speedup and a good deep learning experience. Getter one of the fast cards is however often a money issue as laptops that have them are exceptionally expensive. So a laptop card is good for tinkering and getting some good results on kaggle competition. However, if you really want to win a deep learning kaggle competition computational power is often very important and then only the high end desktop cards will do.
  
  Reply
Harvey says

2016-06-21 at 00:38

Tim, what is the InfiniBand 40Gbit/s interconnect card for? Do I absolutely need the card if I was going to do a muti-GPU solution? And are all three of your Titan X cards are connected using SLI solutions?

Reply
- Tim Dettmers says
  
  2016-06-23 at 15:53
  
  You only need InfiniBand if you want to connect multiple computers. For multiple GPUs you just need multiple PCIe slots and a CPU that supports enough lanes. SLI is only used for games, but not for CUDA compute.
  
  Reply
Niko Bertrand says

2016-06-28 at 14:37

Hi Tim, thanks for updating the article! Your blog helped me a lot in increasing my understanding of Machine Learning and the Technologies behind it.

Up to now I mostly used AWS or Azure for my computations, but I am planning a new PC build. Unfortunately I have still some unanswered questions where even the mighty Google could not help!
Anyways, I was wondering whether you use all your GPU(s) for Monitor output as well? I read a lot about screen tearing / blank screens/ X Stopping for a few seconds while running the algorithms.
A possible solution would be to get a dedicated GPU for Display output such as the GTX 950 running at x8 to connect 3 monitors while having 2 GTX 1080 at x16 Speed just for computation. What is your opinion / experience regarding this matter?

Furthermore as my current PC’s CPU, only has 16 PCIe Lanes but an IGPU build in. Could I use the IGPU for graphics output while a 1080 is build in for computation? I have found a thread on Quora but the only feedback given was to get a CPU with 40PCIe Lanes. Of course this is true but as cash does not grow on trees and AMD Zen and new Skylake Extreme Chipsets are on the horizon.

Your feedback is highly appreciated and thanks in advance!

Reply
- Tim Dettmers says
  
  2016-06-28 at 15:16
  
  Personally, I never had any problems with video output from GPUs on which I also do computation.
  
  The integrated iGPU is independent of your dedicated GTX 1080 and does not eat up any lanes. So you can easily run graphics from your iGPU and compute with 16 lanes from your GTX 1080.
  
  The only problem you might encounter are problems with the intel/NVIDIA driver combination. I would not care too much about the performance reduction (0-5%) and I have yet to see problems with using a GPU for both graphics and compute at the same time and I have worked with about 5-6 different setups.
  
  So I would try the iGPU + GTX 1080 setup and if you run into errors just use the GTX 1080 for everything.
  
  Reply
Swapnil says

2016-06-30 at 01:23

Hi, I am trying to find a Kepler card (CC >= 3.0/3.5) for my research. Could you please suggest one ? I had GeForce GTX Titan X earlier but I could not use it for dual purpose i.e. for computation and display driver. Titan X does not allow this. So I’m searching a Kepler card which allows dual purpose. Kindly suggest one.

Reply
- Tim Dettmers says
  
  2016-06-30 at 20:59
  
  Try to recheck your configuration. I am running deep learing and a display driver on a GTX Titan X for quite some time and it is running just fine.
  
  Reply
Niko Bertrand says

2016-06-30 at 20:19

Thanks for the quick reply!

One final question, which may sound completely stupid.
Reference Cards vs Custom GPUs?
Often the Clock speed and sometime VRAM speed is OC by default on many non reference cards, but if I look through builds and if included screenshots. It seems that mostly reference cards are used. The only reason I could think of is the “predictable” cooler height.

Thanks again!

Reply
- Niko Bertrand says
  
  2016-06-30 at 20:23
  
  Ups meant Founders Edition vs reference cards,.. happens if the mind wanders elsewhere!
  
  Reply
Haider says

2016-07-03 at 17:42

Thanks Tim. Great update to this article, as usual.

I was eager to see any info on the support of half precision (16 bit) processing in GTX 1080. Some articles were speculating few days before their release that they might be inactivated by nVidia and reserving this feature for future nVidia P100 pascal cards. However after around1 month from releasing the1000 gtx series, nobody seems to mention anything related to this important feature. This alone, if had been enabled in GTX will make it up to ~1.5 to 2x with more TFlops/s processing power in comparison to maxwell GPUs including Titan X. And as you mentioned it will add the bonus for less memory requirements up to half. However it is still not clear whether the accuracy of the NN will be the same in comparison to the single precision and whether we can do half precision for all the parameters. Which of course are important to estimate how much will be the speedup and how much less is the memory requirement for a given task.

Reply
Haider says

2016-07-03 at 17:47

Thanks Tim. Great update to this article, as usual.

I was eager to see any info on the support of half precision (16 bit) processing in GTX 1080. Some articles were speculating few days before their release that they might be inactivated by nVidia and reserving this feature for future nVidia P100 pascal cards. However after around1 month from releasing the1000 gtx series, nobody seems to mention anything related to this important feature. This alone, if had been enabled in GTX will make it up to ~1.5 to 2x with more TFlops/s processing power in comparison to maxwell GPUs including Titan X. And as you mentioned it will add the bonus for less memory requirements up to half. However it is still not clear whether the accuracy of the NN will be the same in comparison to the single precision and whether we can do half precision for all the parameters. Which of course are important to estimate how much will be the speedup and how much less is the memory requirement for a given task.

Reply
Joshua Stanton says

2016-07-05 at 11:53

Hey there!

great article! I am currently trying to flicker train GoogleLeNet on 400 of my own images using my SLI 780ti’s. But i keep getting errors. such as cannot find file -> dir to file location(one of the images im training it on) but the file is there and the correct dir is in the train file. do you have any idea why this would be? also in the guide i followed to do this the guy had 4gb vram and used batch of 40 with 256×256 images i did the same but with batch size of 30 to account for the 3gb vram. am i doing something wrong here? how can i optimise the training to work on my video card? i appreciate any help you can give! thanks Josh!

Reply
Rudra says

2016-07-15 at 17:23

Hi Tim,

Thanks for a great article, it helped a lot.
I am a beginner in the field of deep learning and have built and used only a couple of architectures on my CPU (currently a student so decided not to invest on GPU’s right away).
I have a question regarding the amount of CUDA programming required if I decide to do some sort of research in this field. I have mostly implemented my vanilla models in Keras and learning lasagne so that I can come up with novel architecture.

Reply
- Tim Dettmers says
  
  2016-07-19 at 07:22
  
  I know quite many researchers whose CUDA skills are not the best. You often need CUDA skills to implement efficient implementations of novel procedures or to optimize the flow of operations in existing architectures, but if you want to come of with novel architectures and can live with a slight performance loss, then no or very little CUDA skills are required. Sometimes you will have cases where you cannot progress due to your lacking CUDA skills, but this is rarely an issue. So do not waste your time with CUDA!
  
  Reply
  - Rudra says
    
    2016-07-21 at 10:03
    
    Thanks for the reply
    
    Reply
Ken says

2016-07-17 at 15:56

Is a GT 635 capable of cudnn (conv_dnn) acceleration? in theory its a GK_208 kepler chip with 2GB of mem. I know its a crap card but its the only Nvidia card I had lying around. I have not been able to get GPU acceleration on WIN 8.1 to work – so wanted to ask if its my theano/cuda/keras installation thats the issue, or if its the card.. before I throw any money at the problem and buy a better GPU 960+. Should I go to windows 10?

Reply
- Tim Dettmers says
  
  2016-07-19 at 07:19
  
  Your card, although crappy, is a kepler card and should work just fine. Windows could be the issue here. Often it is not well supported by deep learning frameworks. You could try CNTK which has better windows support. If you try CNTK it is important that you follow this install tutorial step-by-step from top to bottom.
  
  I would not recommend Windows for doing deep learning as you will often run into problems. I would encourage you to try to switch to Ubuntu. Although the experience is not as great when you make the switch, you will soon find that it is much superior for deep learning.
  
  Reply
  - Ken says
    
    2016-07-20 at 17:40
    
    Thanks Tim, I did eventually get the GT 635 working under WIN 8.1 – on my Dell, about a 2.7x improvement over my Mac Pro’s 6 core xeons. Getting things going on OSX was much easier. I still don’t think the GT 635 is using cuDNN (cuDNN not available) but I’ll have to play – I get the sense I could get another 2x with it. The 2GB of Vram sucks, I really have to limit the batch sizes I can work with.
    
    Reply
Ken says

2016-07-20 at 17:20

What are you thoughts on the GTX 1060? an easy replacement for the 960 on the low end? Over-clocking the 1060 looks like it can get close to a FE 1070 minus 2GB of memory. Thoughts?

Reply
George says

2016-07-28 at 22:14

Hello Tim,

I’ve been following your blog for 3 months now and since then I have been waiting to buy a GTX Titan X Pascal. However, there are rumors about Nvidia releasing the Volta architecture on the next year with HBM2. What your thoughts about the investment on a Pascal architecture based GPU currently? Thank you.

Reply
- Tim Dettmers says
  
  2016-08-04 at 06:40
  
  If you already have a good Maxwell GPU the wait for Volta might well be worth it. However, this of course depends on your applications and then of course you can always sell your Pascal GPU once Volta hits the market. Both options have its pro and cons.
  
  Reply
Alex telitsine says

2016-07-29 at 06:15

I’m curious if unified memory in cuda 8 will work for dual 1080,
Then theoretically dual 1080 nvlink setup will crush tyranny in memory and flops ?

Reply
- Tim Dettmers says
  
  2016-08-04 at 06:42
  
  Unified memory is more a theoretical than practical concept right now. The CPU and GPU memory is still managed by the same mechanism as before and just that the transfers are hidden. Currently you will not see any benefits for this over Maxwell GPUs.
  
  Reply
Anonymous says

2016-07-31 at 20:56

My brother recommended I would possibly like this blog.
He used to be totally right. This submit actually made my day.

You can not believe just how much time I had spent for this info!
Thanks!

Reply
- Tim Dettmers says
  
  2016-08-04 at 06:46
  
  I am glad to hear that you and your brother found my blog post helpful Thank you!
  
  Reply
vikram says

2016-08-02 at 10:47

hi

my confusions is
1) quadro series k2000 and higher capable enough for deep learning beginning.
2)keplar , maxwel, pascal how much difference dose it make on performance as a beginner
3)gtx titan x is pascal or maxwell
5)parameters to be considered for comparison of GPU as far as deep learning is concern
4)please suggest any GPU according to you

Reply
- Tim Dettmers says
  
  2016-08-04 at 06:49
  
  1) A k2000 will be okay and you can do some tests, but its memory and performance will not be sufficient for larger datasets
  2) Get Maxwell or Pascal GPU if you have the money; Kepler is slow
  3) There is a one Titan X for Pascal and one for Maxwell
  5) Look at memory bandwidth mostly
  4) GTX 1060
  
  Reply
Vikram says

2016-08-04 at 06:54

Exceptionally excellent blog
Thank-you you so much for your valuable reply

Reply
Ondrej says

2016-08-04 at 14:48

Hi Tim,

first of all, thank you for your awesome articles about deep learning. They have been very usefull for me. Since I use Caffe and CNTK framework for deep learning and GPU computing speed is very important, encouraged by your last article update (GTX Titan X Pascal = 0.7 GTX 1080 = 0.5 GTX 980 Ti) and very positive reviews on Internet, I decided to upgrade my GTX 980 Ti (Maxwell) with brand new GTX 1080 (Pascal). In order to compare performance of both architectures – new Pascal with old Maxwell (and of course because I just want to see how well my new GTX 1080 performs to justify expense I benchmarked both cards in Caffe (CNTK is not cuDNN 5 ready yet). To my big surprise new GTX 1080 is about 20% slower in AlexNet training than old GTX 980 Ti. I realized two benchmarks in order to compare performance in different operating systems but with practically same results. The reason why I chosed different versions of CUDA and cuDNN is that Pascal architecture is supported only in CUDA 8RC and cuDNN 5.0 and Maxwell architecture performs better in CUDA 7.5 and cuDNN 4.0 (otherwise you get poor performance).
Maybe I have done something wrong in my benchmark (but I’m not aware anything…), could you give me some advice how to improve training perfomance on GTX 1080 with Caffe? Is there any other framework which support Pascal architecture with full speed?

First benchmark:
OS: Windows 7 64bit
Nvidia drivers: 368.81
Caffe buid for GTX 1080 : Visual studio 2013 64bit, CUDA 8RC, cuDNN 5.0
Caffe buid for GTX 980 Ti: Visual studio 2013 64bit, CUDA 7.5, cuDNN 4.0
Caffe buid for GTX 980 Ti: Visual studio 2013 64bit, CUDA 7.5, cuDNN 5.0
Caffe buid for GTX 980 Ti: Visual studio 2013 64bit, CUDA 7.5, cuDNN 5.1
GTX 1080 performance : 4512 samples/sec.
GTX 980 Ti perfomance: 5407 samples/sec. (cuDNN 4.0) best performance
GTX 980 Ti perfomance: 4305 samples/sec. (cuDNN 5.0)
GTX 980 Ti perfomance: 4364 samples/sec. (cuDNN 5.1)

Second benchmark:
OS: Ubuntu 16.04.1
Nvidia drivers: 367.35
Caffe buid for GTX 1080 : gcc 5.4.0, CUDA 8RC, cuDNN 5.0
Caffe buid for GTX 980 Ti: gcc 5.4.0, CUDA 7.5, cuDNN 4.0
GTX 1080 performance : 4563 samples/sec.
GTX 980 Ti perfomance: 5385 samples/sec.

Thank you very much,
Ondrej

Reply
- Tim Dettmers says
  
  2016-08-05 at 05:27
  
  Thank you Ondrej for sharing — these are some very insightful results!
  
  I am not entirely sure how convolutional algorithm selection works in Caffe, but this might be the main reason for the performance discrepancy. The cards might have better performance for certain kernel sizes and for certain convolutional algorithms. But all in all these are quite some hard numbers and there is little room for arguing. I think I need to update my blog post with some new numbers. To learn that the performance of Maxwell cards is such much better with cuDNN 4.0 is also very valuable. I will definitely add this in an update to the blog post.
  
  Thanks again for sharing all this information!
  
  Reply
stoo says

2016-08-07 at 18:29

The best article about choosing GPUs for deep learning I’ve ever read!
As a CNN learner with few budget, I decide to buy a GTX 1060 replacing the old Quardro K620. Since GTX 1060 does not support SLI and you wrote that “using the PCIe 3.0 interface for communication in multi-GPU applications”. I am a little worry about upgrading later soon. Should I buy a GTX 1070 instead of GTX 1060? Thanks.

Reply
- Tim Dettmers says
  
  2016-08-08 at 06:44
  
  Maybe this was a bit confusing, but you do not need SLI for deep learning applications. The GPUs communicate via the channels that are imprinted on the motherboard. So you can use multiple GTX 1060 in parallel without any problem.
  
  Reply
Pablo Castillo says

2016-08-10 at 16:41

Thanks for sharing your knowledge about this topics.
Regards.

Reply
Tharun says

2016-08-11 at 16:34

Dear Tim,

Extremely thankful for the info provided in this post.
We have GPU server on which CUDA 6.0 is installed and it has two Tesla T10 graphic cards. I have a question if I can use this GPU system for deep learning as Tesla T10 is quite old as of now. I am facing some hardware issues with installing caffe on this server. It has ubuntu 12.04 LTS as OS.

Thanks in advance
Tharun

Reply
- Tim Dettmers says
  
  2016-08-13 at 21:47
  
  The T10 Tesla chip has a too low compute capability so that you will not be able to use cuDNN. Without that you can still run some deep learning libraries but your options will be limited and training will be slow. You might want to just use your CPU or try to get a better GPU.
  
  Reply
Alex N says

2016-08-11 at 23:05

Hi Tim
Your site contains such a wealth a knowledge. Thank you.

I am interested in having your opinion on cooling the GPU. I contacted NVIDIA to ask what cooling solutions they would recommend on a GTX Titan X Pascal in regards to deep learning and they suggested that no additional cooling was required. Furthermore, they would discourage adding any cooling devices (such as EK WB) as it would void the warranty. What are your thoughts? Is the new Titan Pascal that cooling efficient? If not, is there a device you would recommend in particular?
Also, I am mostly interested in RNN and I plan on starting with just one GPU. Would you recommend a second GPU in light of the new SLI bridge offered by NVIDIA? Do you think it could deliver increased performance on single experiment?

Reply
- Alex N says
  
  2016-08-12 at 15:14
  
  I would also like to add that looking at the DevBox components,
  No particular cooling is added except for sufficient GPU spacing and upgraded front fans.
  http://developer.download.nvidia.com/assets/cuda/secure/DIGITS/DIGITS_DEVBOX_DESIGN_GUIDE.pdf?autho=1471007267_ccd7e14b5902fa555f7e26e1ff2fe1ee&file=DIGITS_DEVBOX_DESIGN_GUIDE.pdf
  
  Reply
  - Tim Dettmers says
    
    2016-08-13 at 21:57
    
    From my experience addition fans for your case are negligible (less than 5 degrees differences; often as low as 1-2 degrees). Increasing the GPU fan speed by 1% often has a larger effect than additional case fans.
    
    Reply
- Tim Dettmers says
  
  2016-08-13 at 21:55
  
  If you only run a single Titan X Pascal then you will indeed be fine without any other cooling solution. Sometimes it will be necessary to increase the fan speed to keep the GPU below 80 degrees, but the sound level for that is still bearable. If you use more GPUs air cooling is still fine, but when the workstation is in the same room then noise from the fans can become an issue as well as the heat (it is nice in winter, then you do not need any additional heating in your room, even if it is freezing outside). If you have multiple GPUs then moving the server to another room and just cranking up the GPU fans and accessing your server remotely is often a very practical option. If those options are not for you water cooling offers a very good solution.
  
  Reply
Chad says

2016-08-24 at 19:00

Hi Tim,

Thanks for the great article and thanks for continuing to update it!

Am I correct that the Pascal Titan X doesn’t support FP16 computations? So if TensorFlow or Theano (or one’s library of choice) starts fully supporting FP16, would the GTX 1080 then be better than the new Titan X as it would have larger effective (FP16) memory? Do But perhaps I am missing something…

Is it clear yet whether FP16 will always be sufficient or might FP32 prove necessary in some cases?

Thanks!

Reply
- Tim Dettmers says
  
  2016-08-25 at 06:05
  
  Hey Chad, the GTX 1080 also does not support FP16 which is a shame. We will have to wait for Volta for this I guess. Probably FP16 will be sufficient for most things, since there are already many approaches which work well with lower precision, but we just have to wait.
  
  Reply
  - Chad says
    
    2016-08-25 at 17:16
    
    ah, ok. got it. Thanks a lot!
    
    Reply
  - Mike K says
    
    2016-08-26 at 00:16
    
    Which GTX, if any, support int8? Does Tensorflow support int8? Thanks for the great blog.
    
    Reply
    - Tim Dettmers says
      
      2016-08-26 at 04:23
      
      All GPUs support int8, both signed and unsigned; in CUDA this is just a signed or unsigned char. I think you can do regular computation just fine. However, I do not know how the support for Tensorflow is, but in general most the deep learning frameworks do not have support for computations on 8-bit tensors. You might have to work closer to the CUDA code to implement a solution, but it is definitely possible. If work with 8-bit data on the GPU, you can also input 32-bit floats and then cast them to 8-bits in the CUDA kernel; this is what torch does in its 1-bit quantization routines for example.
      
      Reply
CW says

2016-08-27 at 08:48

Hi Tim,

Would multi lower tier gpu serve better than single high tier gpu given similar cost?
Eg.
3 x 1070
vs
1 x Titan X Pascal

Which would you recommend?

Reply
- Tim Dettmers says
  
  2016-08-28 at 19:09
  
  Here is one of my quora answers which deals exactly with this problem. The cards in that example are different, but the same is true for the new cards.
  
  Reply
  - CW says
    
    2016-08-30 at 12:49
    
    Thank for the reply.
    I am not sure if I understand the answer correctly, is it pcie bandwidth the bottleneck bandwidth you are referring which is around 8GB/s due to multiple cards compared to single larger memory titan X’s bandwidth of 336Gb/s?
    One more question, does slower ddr ram bandwidth will impact the performance of deeplearning ?
    
    Reply
    - Tim Dettmers says
      
      2016-09-03 at 04:22
      
      That is correct, for multiple cards the bottleneck will be the connection between the cards which in this case is the PCIe connection. Slower DDR RAM bandwidth almost decreases performance by as much as the bandwidth is lower, so it is quite important. This comparison however is not valid between different GPU series e.g. invalid for Maxwell vs. Pascal.
      
      Reply
Mo says

2016-09-02 at 21:14

Hi Tim. Great article.
One question: I have been given a Quadro M6000 24GB. How do you think it compares to a Titan or Titan X for deep learning (specifically Tensorflow)? I’ve used a Titan before and I am hoping that at least it wouldn’t be slower.

Thank you.

Reply
- Tim Dettmers says
  
  2016-09-03 at 04:28
  
  The Quadro M6000 is an excellent card! I do not recommend it because it is not very cost efficient. However, the very large memory and high speed which is equivalent to a regular GTX Titan X is quite impressive. On normal cards, you do not have more than 12GB of RAM which means you can train very large models on your M6000. So I would definitely stick to it!
  
  Reply
  - Mo says
    
    2016-09-04 at 22:23
    
    Awesome, thanks for the quick response.
    
    Reply
Juan says

2016-09-04 at 16:41

Hey Tim, thank you so muuuch for your article!! I am in the “I started deep learning and I am serious about it” group and will buy a GTX 1060 for it. I am more specifically interested in autonomous vehicle and Simultaneous Localization and Mapping. You article has helped me clarify my currents needs and match it with a GPU and budget.

You have a new follower here!

Thanks!

Reply
- Tim Dettmers says
  
  2016-09-04 at 21:40
  
  Thank you for your kind words — I am glad that I you found my article helpful!
  
  Reply
Tim says

2016-09-13 at 01:30

Hey Tim,

Thank you for this fantastic article. I have learned a lot in these past couple of weeks on how to build a good computer for deep learning.

My question is rather simple, but I have not found an answer yet on the web: should I buy one Titan X Pascal or two GTX 1080s?

Thank you very much for your time,
Tim

Reply
- Tim Dettmers says
  
  2016-09-13 at 06:18
  
  Hey Tim,
  
  In the past I would have recommended one faster bigger GPU over two smaller, more cost-efficient ones, but I am not so sure anymore. The parallelization in deep learning software gets better and better and if you do not parallelize your code you can just run two nets at a time. However, if you really want to work on large datasets or memory-intensive domains like video, then a Titan X Pascal might be the way to go. I think it highly depends on the application. If you do not necessarily need the extra memory — that means you work mostly on applications rather than research and you are using deep learning as a tool to get good results, rather than a tool to get the best results — then two GTX 1080 should be better. Otherwise go for the Titan X Pascal.
  Tim
  
  Reply
Eric says

2016-09-13 at 22:59

Tim!

Thanks so much for your article. It was instrumental in me buying the Maxwell Titan X about a year ago. Now, I’ve upgraded to 4 Pascal Titan X cards, but I’m having some problems getting performance to scale using data parallelism.

I’m trying to chunk a batch of images into 4 chunks and classify them (using caffe) on the 4 cards in parallel using 4 separate processes.

I’ve confirmed the processes are running on the separate cards as expected, but performance degrades as I add new cards. For example, if it takes me 0.4 sec / image on 1 card alone, when I run 2 cards in parallel, they each take about 0.7 sec / image.

Have you had any experience using multiple Pascal Titan X’s in this manner? Am I just missing something about the setup/driver install?

Thanks!

Reply
- Tim Dettmers says
  
  2016-09-15 at 12:20
  
  Were you getting better performance on your Maxwell Titan X? It also depends heavily on your network architecture; what kind of architecture were you using? Data parallelism in convolutional layers should yield good speedups, as do deep recurrent layers in general. However, if you are using data parallelism on fully connected layers this might lead to the slowdown that you are seeing — in that case the bandwidth between GPUs is just not high enough.
  
  Reply
Tim says

2016-09-14 at 17:14

Hi Tim,

Thank you very much for the fast answer.

I just have one more question that is related to the CPU. I understand that having more lanes is better when working with multiple GPUs as the CPU will have enough bandwidth to sustain them. However, in the case of having just one GPU is it necessary to have more than 16 or 28 lanes? I was looking at the *Intel Core i7-5930K 3.5GHz 6-Core Processor*, which has 40 lanes (and is the cheapest in that category) but also requires an LGA 2011 and DDR4 memory which are expensive. Is this going to be too much of an overkill for the Titan X Pascal?

Thank you for your time!
Tim

Reply
- Tim Dettmers says
  
  2016-09-15 at 12:24
  
  If you are having only 1 card, then 16 lanes will be all that you need. If you upgrade to two GPUs you want to have either 32+ lanes (16 lanes for each) or just stick with 16 lanes (8 lanes for each) since the slowest GPU will always draw down the other one (24 lanes means 16x + 8x lanes, and for parallelism this will be bottlenecked by the 8 lanes). Even if you are using 8 lanes, the drop in performance may be negligible for some architectures (recurrent nets with many times steps; convolutional layers) or some parallel algorithms (1-bit quantization, block momentum). So you should be more than fine with 16 or 28 lanes.
  
  Reply
im says

2016-09-28 at 19:30

I compared quadro k2200 with m4000.
k2200 won surprisingly m4000 in the simple network.
I am looking for a higher performance single-slot GPU than k2200.

How about k4200 ?
quadro k4200 ( single-slot and single precision = 2,072.4 gflops )

Reply
- Tim Dettmers says
  
  2016-10-03 at 15:11
  
  Check your benchmarks and if they are representative of usual deep learning performance. The K2200 should not be faster than a M4000. What kind of simple network were you testing on?
  
  Reply
  - im says
    
    2016-10-04 at 03:49
    
    I tested the simple network on a chainer default example as below.
    python examples/mnist/train_mnist.py –gpu 0
    
    result
    K2200
    avg : 14288.94418 images/sec
    M4000
    avg : 13617.58361 images/sec
    
    However I confirmed that the M4000 is faster than a K2200 in the complex netork like alexnet.
    
    [convnet-benchmarks]
    ./convnet-benchmarks/chainer/run.sh
    
    result
    K2200
    alexnet : 639ms, overfeet : 2195ms
    M4000
    alexnet : 315ms, overfeet : 1142ms
    
    I think that GPU clock is effective in the simple network.
    Is this correct ?
    
    GPU Clock
    K2200
    1045 MHz
    M4000
    772 MHz
    
    Shading Units
    K2200
    640
    M4000
    1664
    
    Reply
Kristofer says

2016-10-01 at 18:21

Hi! Great article, very informative. However, I want to point out that the NVIDIA Geforce GTX Titan X and the NVIDIA Titan X are two different graphics cards (yes the naming is a little bit confusing). The Geforce GTX Titan X has Maxwell microarchitecture, while the Titan X has the newer Pascal microarchitecture. Hence there is no “GTX Titan X Pascal”.

Reply
- Tim Dettmers says
  
  2016-10-03 at 15:01
  
  Ah this is actually true. I did not realize that! Thanks for pointing that out! Probably it is still best to add Maxwell/Pascal to not confuse people, but I should remove the GTX part.
  
  Reply
chanhyuk jung says

2016-10-09 at 17:26

I live at a place where 200kwh costs 19.92 dollars and 600kwh cost 194.7 dollars. The electricity bills grows exponentially. I usually train unsupervised learning algorithms on 8 terabytes of video. Which gpu or gpus should I buy?

Reply
chanhyuk jung says

2016-10-09 at 17:47

I live at a place where 200kwh costs 19.92 dollars and 600kwh cost 194.7 dollars. The electricity bills grows exponentially. I usually train unsupervised learning algorithms on 8 terabytes of video. Which gpu or gpus should I get? The titan x pascal had the most bandwidth per watt but it’s a lot expensive for the little gain of performance per watt.

Reply
andrey kim says

2016-10-29 at 21:13

Great article. I am just a noob at this and learning . not a researcher, but application guy. I have an old mac pro 2008 with 32gb of ram (fb-dimm in 8 channel) on dual xeon quad core at 2.8ghz.(8 core, 8 thread) I’ve been using gtx750ti with 4gb on deep mask/sharp mask on torch. COCO image set took 5 days to train through 300 epoch on deep mask. I am wondering how much performance increase would I see going to GTX 1070? or I am wondering if I could add second gtx750ti that matches the one I got instead for 8gb of ram. (have room for 2 gpu)

thanks for everything

Reply
- Tim Dettmers says
  
  2016-11-07 at 11:23
  
  Adding a GTX 750Ti will not increase your overall memory since you will need to make use of data parallelism where the same model rests on all GPUs (the model is not distributed among GPU so you will see no memory savings). In terms of speed, and upgrade to a GTX 1070 should be better than two GTX 750Ti and also significantly easier in terms of programming (no multi-GPU programming needed). So I would go with the GTX 1070
  
  Reply
navdeep singh says

2016-11-02 at 20:25

Hi Tim, thanks for an insightful article!
I picked up new 13″ macbook with thunderbolt 3 ports, i am thinking of a setup using GTX-1080 using eGFX enclosure – http://www.anandtech.com/show/10783/powercolor-announces-devil-box-thunderbolt-3-external-gpu-enclosure . What do you think of this idea?

Reply
- Tim Dettmers says
  
  2016-11-07 at 11:29
  
  It should work okay. There might be some performance problems when you transfer data from CPU to GPU. For most cases this should not be a problem, but if your software does not buffer data on the GPU (sending the next mini-batch while the current mini-batch is being processed) then there might be quite a performance hit. However, this performance hit is due to software and not hardware, so you should be able to write some code to fix performance issues. In general the performance should be good in most cases with around 90% performance.
  
  Reply
Carlo says

2016-11-09 at 14:49

Hi
I want to test multiple neural networks against each other using encog. For that i want to get a nvidia card. After reading your article i think about getting the 1060 but since most calculations in encog using double precision would the 780 ti be a better fit? The data file will not be large and i do not use images.
Thanks

Reply
- Tim Dettmers says
  
  2016-11-10 at 23:23
  
  The GTX 780 Ti would still be slow for double precision. Try to get a GTX Titan (regular) or GTX Titan Black they have excellent double precision performance and work generally quite okay even in 32-bit mode.
  
  Reply
aws training says

2016-11-28 at 12:16

Thanks for sharing this- good stuff! Keep up the great work, we look forward to reading more from you in the future!

Reply
James says

2016-11-28 at 13:16

Hi Tim,

Really useful post, thanks.

I wondered about the Titan Black – looking online the memory bandwidth, 6GB of memory, single and double precision are better than a 1060 and at current eBay prices, are about 10-15% cheaper than a 1060.

Other than the lower power of the 1060 and warranty, would there be any reason to choose the 1060 over a Titan Black?

Thank you

Reply
- Tim Dettmers says
  
  2016-11-29 at 15:58
  
  The architecture of the GTX 1060 is more efficient than the Titan Black architecture. Thus for speed, the GTX 1060 should still be faster, but probably not by much. So the GTX Titan Black is a solid choice, especially if you also want to use the double precision.
  
  Reply
Markus says

2016-12-06 at 03:08

Hi Tim,
thanks for the article, it’s the most useful I found during my 14-hour google-marathon!
I’m very new to deep learning, starting with YOLO I’ve found that my gtx 670 with 2GB is seriously limiting what I can explore. Inferring from your article, if I stack multiple GPUs for CNNs, the memory will in principle add up, right? I’m asking because I will make a decision between a used Maxwell Titan X or a 1070/1080, and my main concern is the memory, thus I would like to know if it is a reasonable memory upgrade option (for CNNs) to add a second card at some point when they are cheaper. Furthermore, if the 1080 and the used Maxwell Titan X are the same price, as this a good deal?

Also, I’m concerned with the FP16x2 feature for the 1070/1080, adding only one FP16x2 core every 128 FP32 cores: If I’m using FP16, the driver might report my card is FP16v2
capable, and thus a framework might use these few FP16v2 cores instead of emulating FP16 arithmetics by promoting to FP32. Is this a valid worst-case scenario for e.g. caffe/torch/… or am I confusing something here? Also, I’ve read that before Pascal, there is effectively no storage benefit from FP16, as the numbers need to be promoted to FP32 anyway. I can only understand this if the data needs to be promoted before fetching it to the registers for computation, is this right?

Thank you

Reply
- Tim Dettmers says
  
  2016-12-13 at 12:22
  
  Hi Markus,
  
  unfortunately the memory will not stack up, since you probably will use data parallelism to parallelize your models (the only way of parallelism which really works well and is fast).
  
  If you can get a used Maxwell Titan X cheap this is a solid choice. I personally would not mind the minor slowdown compare the the added flexibility, so I would go for the Titan X as well here.
  
  Currently, you do not need to worry about FP16. Current code will make use of FP16 memory, but FP32 computations so that the slow FP16 compute units on the GTX 10 series will not come into play. All of this probably only becomes relevant with the next Pascal generation or even only with Volta.
  
  I hope this answered all of your questions. Let me know if things are still unclear.
  
  Reply
ywteh says

2016-12-13 at 16:12

Hi Tim, thanks for a great article! I’m just wondering if you had experience with installing the GTX or Titan X on rackmount servers? Or if you have recommendations for articles or providers on the web? (I’m in UK). I am having a long running discussion with IT support about whether it is possible, as we couldn’t find any big providers that would put together such a system. The main issue seems to revolve around cooling as IT says that Teslas are passive cooled while Titan X are active cooled, and may interfere with the server’s cooling system.

Reply
- Tim Dettmers says
  
  2016-12-13 at 22:04
  
  I think the passively cooled Teslas still have a 2-PCIe width, so that should not be a problem. If the current Tesla cards are 1-PCIe width, then it will be a problem and Titan Xs will not be an option.
  
  Cooling might indeed also an issue. If the passively cooled Teslas have intricate cooling fins then their cooling combined with active server cooling might indeed be much superior to what Titan Xs can offer. Cooling systems for clusters can be quite complicated and this might lead to Titan Xs breaking the system.
  
  Another issue might be just buying Titan Xs in bulk. NVIDIA does not sell them in bulk, so you will only be able to equip a small cluster with these cards (this is also the reason why you do not find any providers for such as system).
  
  Hope this helps!
  
  Reply
Neo says

2016-12-15 at 11:05

Hi Tim, I found a interesting thing recently.
I tried one Keras(both theano and tensorflow were tested) project on three different computing platforms:
A : ssd+i5 3470(3.2GHz)+GTX750Ti (2G)
B: ssd+E52620 v3+TITAN X (12G)
C: HDD+i56300HQ(2.6GHz)+GTX965M(4G)

With the same setting of cuda 8.0 and cuDNN 5.0 , A & B got similar GPU performance. However, I cannot understand why C is about 5 times slower than A. I guessed C could perform better than A before the experiment.

Reply
- Tim Dettmers says
  
  2016-12-16 at 16:47
  
  As I understand it Keras might not prefetch data. On certain problems this might introduce some latency when you load data, and loading data from hard disk is slower than SSD. If the data is loaded into memory by your code, this is however unlikely the problem. What strikes me is that A and B should not be equally fast. C could also be slow due to the laptop motherboard which has a poor or reduced PCIe connection, but usually this should not be such a big problem. Of course this could still happen for certain datasets.
  
  Reply
Nader says

2016-12-25 at 00:29

How about gtx 1070 SLI ?

Reply
An Tran says

2017-01-03 at 16:25

“However, training may also take longer, especially the last stages of training where it becomes more and more important to have accurate gradients.” WHY the last stages of training is important? Any justification

Reply
- Tim Dettmers says
  
  2017-01-03 at 21:34
  
  It is easy to improve from a pretty bad solution to an okay solution, but it is very difficult to improve from a good solution to a very good solution. Improving our 100 meter dash time by a second is probably not so difficult, while for an Olympic athlete it is sheer impossible because they already operate at a very high level. This goes the same for neural net and their solution accuracy.
  
  Reply
Nader says

2017-01-03 at 23:32

Does having an AMD card support Theano / Keras ?

Reply
Maciej Wieczorek says

2017-01-29 at 03:16

Amazon has introduced a new class of instances: Accelerated Computing Instances (P2), with 12GB K80 GPUs. These are much, much better than the older G2 instances, and go for $0.90/hr. Does this change anything in your analysis?

Reply
- Tim Dettmers says
  
  2017-01-29 at 22:55
  
  With such an instance you get one K80 for $0.9/h which means $21.6/day and $648/month. If you use your GPU for more than one GPU month (runtime) then it probably gets cheaper and cheaper to buy your own GPU. I do not think it makes really sense for most people.
  
  It is probably a good option for people doing Kaggle competitions since most of the time will be spend still on feature engineering and ensembling. For researchs, startups, and people who learn deep learning it is probably still more attractive to buy a GPU.
  
  Reply
Hesam M says

2017-02-02 at 16:01

Hello,

I decided to buy a GTX 1060 or GTX 1070 card to try with Deep Learning, but I am curious if the RAM size of The GPU or its bandwidth/speed will affect the ACCURACY of the final model or not, by comparing these two specific GPU cards.

in the other word, I want to know selecting the GTX 1060 will just cause longer training time over GTX 1070, or it will affect the accuracy of the model either.

Reply
- Tim Dettmers says
  
  2017-02-10 at 17:40
  
  Hi Hesam, the two cards will yield the same accuracy. There are some elements in the GPU which are non-deterministic for some operations and thus the results will not be the same, but they always be of similar accuracy.
  
  Reply
Joe says

2017-02-02 at 22:59

Hi, it’s been a pleasure to read this article! Thanks!

Have you done any comparison of 2 x Titan X against 4 x GTX 1080? Or maybe you have some thoughts regarding it?

Reply
- Tim Dettmers says
  
  2017-02-10 at 17:44
  
  The speed of 4x 1080 vs 2 Titan X is difficult to measure, because parallelism is still not well supported for most frameworks and the speedups are often poor. If you look however at all GPUs separately, then it depends on how much memory your tasks needs. If 8GB are okay, 4x 1080 are definitely better than 2x Titan X, if not, then 2x Titan X are better.
  
  Reply
Erik says

2017-02-11 at 12:37

Hi Tim,
first of all thank you for this great article.

I understand that the memory clock speed is quite important and depending on which graphics card manufacturer/line I choose, there will be up to a 10% difference.
Here is a good overview[German]
http://www.pcgameshardware.de/Pascal-Codename-265448/Specials/Geforce-GTX-1080-Custom-Designs-Herstellerkarten-1198846/

I am going to buy a 1080 and I am wondering if it makes sense to get such an OC one.
Do you have any experience with / advice on this?

Thank you for an answer,
Erik

Reply
- Tim Dettmers says
  
  2017-02-14 at 19:29
  
  OC GPUs are good for gaming, but they hardly make a difference for deep learning. You are better of buying a GPU with other features such as better cooling. When I tested overclocking on my GPUs it was difficult to measure any improvement. Maybe you will get something in the range of 1-3% improved performance for OC GPUs — so not so much worth it if you need to pay extra for OC.
  
  Reply
Ink says

2017-02-26 at 18:20

Could you add AWS’s new P2 instance into comparison? Thank you very much!

Reply
- Tim Dettmers says
  
  2017-02-27 at 00:04
  
  The last time I checked the new GPU instances were not viable due to their pricing. Only in some limited scenarios, where you need deep learning hardware for a very short time do AWS GPU instances make economic sense. Often it is better to buy a GPU even if it is a cheaper, slower one. With that you will get much more GPU accelerated hours for your money compared to AWS instances. If money is less of in issue AWS instances also make sense to fire up some temporary compute power for a few experiments or training a new model for startups.
  
  Reply
Eric PB says

2017-03-01 at 21:26

Hi Tim,

With the release of the GTX 1080 Ti and the revamp+reprice of GTX 1060/70/80, would you change anything in your TL;DR section, especially vs Pascal Titan X ?

Links to key points:
– GTX 1080 Ti: http://wccftech.com/nvidia-geforce-gtx-1080-ti-unleash-699-usd/
– Revamp+reprice of GTX 1060/70/80 etc.: http://wccftech.com/nvidia-geforce-gtx-1080-1070-1060-official-price-cut-specs-upgrade/

Cheers,

E.

Reply
- Tim Dettmers says
  
  2017-03-06 at 10:57
  
  Thank you so much for the links. I will have to look at those details, make up my mind, and update the blog post. On a first look it seems that the GTX 1070 8GB would be really the way to go for most people. Just a lot of bang for the buck. The NVIDIA Titan X seems to become obsolete for 95% of people (only vision researchers that need to squeeze every last bit of RAM for should use it) and the GTX 1080 Ti will be the way to go if you want fast compute.
  
  Reply
Thomas Rupp says

2017-03-02 at 22:49

Hi Tim,
Thank you for the great article and answering our questions.

NVIDIA just announced their new GTX 1080 TI. I heard that it shall even outperform the Titan X Pascal in gaming. I did not read anything about the performance of the GTX 1080 TI in Machine Learning / Deep Learning yet.

I am building a PC at the moment and have some parts already. Since the Titan X was not available over the last few weeks, I could still get the GTX 1080 TI instead.

1.) What is better, the GTX 1080 TI or Titan X? If the difference is very small, I would choose the cheaper 1080 TI and upgrade to Volta in a year or so. Is the only difference the 11 GB instead of 12 and a little bit faster clock or are some features disabled that could make problems with deep learning?

2.) Is half precision available in the GTX 1080 TI and/or the Titan X? I thought that it is only available in the much more expensive Tesla cards, but after reading through the Replies here, I am not sure anymore. To be more precise, I only care of the half precision (float 16) when it brings a considerable speed improvement (In Tesla roughly twice as fast compared to float 32). If it is available but with the same speed as float 32, I obviously do not need it.

Looking forward to your reply.
Thomas

Reply
- Thomas Rupp says
  
  2017-03-04 at 08:43
  
  Hi Tim,
  
  one more question: How much of an issue will the 11GB of the GTX 1080 TI be compared to the 12GB on the Titan X? Does that mean that I cannot run many of the published models that were created by people on a 12GB GPU? Do people usually fill up all of the memory available by creating deep nets that just fit in their GPU memory?
  
  Reply
  - Tim Dettmers says
    
    2017-03-06 at 11:13
    
    Some of the very state of the art models might not run on some of the datasets. But in general, this is a no-issue. You will still be able to run the same models, but instead of 1000 layers you will only have something like 900 layers. If you are not someone which does cutting edge computer vision research, then you should be fine with the GTX 1080 Ti.
    
    Alternatively, you can always run these models in 16-bit on most frameworks just fine. This is so, because most models make use of 32-bit memory. This thus requires a bit of extra work to convert the existing models to 16-bit (usually a few lines of code), but most models should run. I think you will be able to run more than 99% of the state of the art models in deep learning, and about 90% of the state of the art models in computer vision. So definitely go for a GTX 1080 Ti if you can wait for so long.
    
    Reply
- Tim Dettmers says
  
  2017-03-06 at 11:03
  
  You are welcome. I am always happy if my blog posts are useful!
  
  1.) I would definitely go with the GTX 1080 TI due to price/performance. The extra memory on the Titan X is only useful in a very few cases. However beware, it might take some time between announcement, release and when the GTX 1080 Ti is finally delivered to your doorstep — make sure you have that spare time. Make also sure you preorder it; when new GPUs are released their supply is usually sold within a week or less. You do not want to wait until the next batch is produced.
  
  2.) Half precision is implemented on the software layer, but not on the hardware layer for these cards. This means you can use 16-bit computation but software libraries will instead upcast it to 24-bit to do computation (which is equivalent to 32-bit computational speed. This means that you can benefit from the reduced memory size, but not yet from the increased computation speed of 16-bit computation. You only see this in the P100 which nobody can afford and probably you will only see it for consumer cards in Volta series cards which will be released next year.
  
  Reply
Ark Aung says

2017-03-09 at 23:28

Now that GTX1080Ti is out, would you recommend that over Titan X?

Reply
Anonymous says

2017-03-13 at 21:17

Considering the incoming refresh of Geforce 100, should I purchase entry-level 1060x6GB now or will there be something more interesting in the near future?

Reply
- Tim Dettmers says
  
  2017-03-19 at 18:03
  
  The GTX 1060 will remain a solid choice. I would pick the GTX 1060 if I were you.
  
  Reply
Chris says

2017-03-19 at 18:16

Thanks for the brilliant summary! Wish I have read this before the purchase of 1080, I would have bought 1070 instead as it seems a better option for value, for the kind of NLP tasks I have at hand.

Reply
Ilya Kavalerov says

2015-05-14 at 14:29

How come there are no mentions of tesla cards? You can buy them used on eBay for much cheaper than the more mainstream gaming cards. Is the only reason that it requires more hardware work since they are either too big for most regular PC boxes, or require fan assembly since they are often sold air-cooled, or are there some other costs I’m overlooking?

I also ask b/c in the literature I see more mention of GTX than teslas, but I don’t see a reason for the preference other than ease of installation and potential cost saving only at small scales.

Reply
Yuriy Filonov says

2015-07-19 at 00:36

Hey Tim, perfect article! Thanks a lot for such a thorough review. Don’t you know whether it’s possible to leverage several GPUs computational power for DNNs through running them (cards) in a cross fire mode? Won’t this give a GPU parallelism “for free” with no need to make complex software adjustment?

Reply
Tim Dettmers says

2015-05-14 at 15:10

If you look at the price of used Tesla cards and their performance then they are almost always worse than any GTX GPU. I never mentioned it, because it is highly unlikely that you can find a used Tesla card which beats other cards at performance/price. The cheapest Tesla K20 on eBay (6 months of data) went for about $1000 and is equivalent to a GTX 680, the cheapest K10 went for $260, but the 2nd cheapest was above $1000 and the K10 is slower than a GTX 680. A GTX 680 now goes for about $180. Other cards from the GTX 900 series are significantly faster than the Tesla cards and most often also cheaper.

Tesla cards from the Fermi generation are affordable, but I would not recommend them, because you cannot use standard deep learning software with them (also they are very slow because they are so dated). On top you have of course the cooling issues etc — so Tesla cards make only sense if you are exceptional lucky to snatch one for a very low price.

Reply
Tim Dettmers says

2015-07-19 at 07:51

AMD CrossFireX as well as NVIDIA SLI are built to exchange framebuffer information across two GPUs. It seems that these interfaces can only be used for that, mainly because they are to slow for ordinary parallel algorithms. This will change with a new NVLink interface NVIDIA is building which will supersede the PCIe 3.0 interface for GPU computing. However, as of now, you cannot get around using the PCIe 3.0 interface for communication in multi-GPU applications.

Reply