Skip to main content

C++ benchmark – std::vector VS std::list VS std::deque

Last week, I wrote a benchmark comparing the performance of std::vector and std::list on different workloads. This previous article received a lot of comments and several suggestions to improve it. The present article is an improvement over the previous article.

In this article, I will compare the performance of std::vector, std::list and std::deque on several different workloads and with different data types. In this article, when I talk about a list refers to std::list, a vector refers to std::vector and deque to std::deque.

It is generally said that a list should be used when random insert and remove will be performed (performed in O(1) versus O(n) for a vector or a deque). If we look only at the complexity, the scale of linear search in both data structures should be equivalent, complexity being in O(n). When random insert/replace operations are performed on a vector or a deque, all the subsequent data needs to be moved and so each element will be copied. That is why the size of the data type is an important factor when comparing those two data structures. Because the size of the data type will play an important role on the cost of copying an element.

However, in practice, there is a huge difference: the usage of the memory caches. All the data in a vector is contiguous where the std::list allocates separately memory for each element. How does that change the results in practice ? The deque is a data structure aiming at having the advantages of both data structures without their drawbacks, we will see how it perform in practice. Complexity analysis does not take the memory hierarchy into level. I believe that in practice, memory hierarchy usage is as important as complexity analysis.

Keep in mind that all the tests performed are made on vector, list and deque even if other data structures could be better suited to the given workload.

In the graphs and in the text, n is used to refer to the number of elements of the collection.

All the tests performed have been performed on an Intel Core i7 Q 820  @ 1.73GHz. The code has been compiled in 64 bits with GCC 4.7.2 with -02 and -march=native. The code has been compiled with C++11 support (-std=c++11).

For each graph, the vertical axis represent the amount of time necessary to perform the operations, so the lower values are the better. The horizontal axis is always the number of elements of the collection. For some graph, the logarithmic scale could be clearer, a button is available after each graph to change the vertical scale to a logarithmic scale.

The data types are varying in size, they hold an array of longs and the size of the array varies to change the size of the data type. The non-trivial data type is made of two longs and has very stupid assignment operator and copy constructor that just does some maths (totally meaningless but costly). One may argue that is not a common copy constructor neither a common assignment operator and one will be right, however, the important point here is that it is costly operators which is enough for this benchmark.

Fill

The first test that is performed is to fill the data structures by adding elements to the back of the container (using push_back). Two variations of vector are used, vector_pre being a std::vector using vector::reserve at the beginning, resulting in only one allocation of memory.

Lets see the results with a very small data type:

fill_back - 8 byteslistvectordequevector_…100000200000300000400000500000600000700000800000900000100000005,00010,00015,00020,00025,000Number of elementsus
xlistvectordequevector_pre
1000002,5452712,012317
2000004,927552998334
3000007,3109441,707595
4000009,4639362,0561,099
50000012,5911,1402,6421,058
60000014,3511,8943,1251,237
70000016,5611,9953,6861,208
80000018,8202,6484,2911,365
90000020,8322,7774,9622,268
100000023,4303,0155,3962,585

The pre-allocated vector is the fastest by a small margin and the list is 3 times slower than a vector. deque and vector.

If we consider higher data type:

fill_back - 4096 byteslistvectordequevector_…10000020000030000040000050000060000070000080000090000010000000200,000400,000600,000800,0001,000,0001,200,000Number of elementsus
xlistvectordequevector_pre
100000104,86755,54566,85221,738
200000226,215108,289136,03542,532
300000340,910198,343153,44660,317
400000445,035217,325269,31680,616
500000559,619236,576189,613101,371
600000688,422391,354303,729122,447
700000799,902405,771426,373138,868
800000921,441415,707537,057160,637
9000001,006,331439,635263,650177,052
10000001,113,690464,416372,000199,434

This time vector and deque are performing at about the same speed. The pre-allocated vector is clearly the winner here. The variations in the results of deque and vector are probably coming from my system that doesn't like allocating so much memory back and forth at this speed.

Finally, if we use a non-trivial data type:

fill_back - 16 byteslistvectordequevector_…1000002000003000004000005000006000007000008000009000001000000020,00040,00060,00080,000Number of elementsus
xlistvectordequevector_pre
1000008,0938,12310,2518,095
20000015,43315,30516,06113,897
30000025,96424,64324,45019,954
40000033,41430,32232,14827,171
50000040,41637,81740,75235,058
60000048,99148,59448,78541,049
70000055,05955,12455,09247,609
80000063,68861,36064,50555,659
90000070,55067,63672,32960,952
100000079,27173,53379,52267,787

All data structures are performing more or less the same, with vector_pre being the fastest.

For push_back operations, pre-allocated vectors is a very good choice if the size is known in advance. The others performs more of less the same.

I would have expected a better result for pre-allocated vector. If someone find an explanation for such a small margin, I'm interested.

Linear Search

The first operation is that is tested is the search. The container is filled with all the numbers in [0, N] and shuffled. Then, each number in [0,N] is searched in the container with std::find that performs a simple linear search. In theory, all the data structures should perform the same if we consider their complexity.

linear_search - 8 bytesdequelistvector10002000300040005000600070008000900010000050,000100,000150,000200,000Number of elementsus
xdequelistvector
10005931,098318
20002,9275,3071,271
30005,89112,2283,020
40008,66324,4155,081
500012,85936,3168,066
600018,49355,05711,463
700025,05774,34416,022
800038,98099,99021,051
900044,951127,57526,650
1000052,281158,21632,557

It is clear from the graph that the list has very poor performance for searching. The growth is much worse for a list than for a vector or a deque.

The only reason is the usage of the cache line. When a data is accessed, the data is fetched from the main memory to the cache. Not only the accessed data is accessed, but a whole cacheline is fetched. As the elements in a vector are contiguous, when you access an element, the next element is automatically in the cache. As the main memory is orders of magnitude slower than the cache, this makes a huge difference. In the list case, the processor spends its whole time waiting for data being fetched from memory to the cache, at each fetch, the processor fetches a lot of unnecessary data that are almost always useless.

The deque is a bit slower than the vector, that is logical because here there are more cache misses due to the segmented parts.

If we take a bigger data type:

linear_search - 128 bytesdequelistvector100020003000400050006000700080009000100000100,000200,000300,000400,000500,000600,000Number of elementsus
xdequelistvector
10001,1162,683776
20004,98316,6753,537
300012,25544,37910,874
400023,21283,02620,189
500037,392133,35333,609
600055,295193,42847,636
700074,877261,31463,911
8000100,903340,15784,647
9000126,299435,816107,922
10000156,386545,160135,680

The list is still much slower than the others, but what is interesting is that gap between the deque and the array is decreasing. Let's try with a 4KB data type:

linear_search - 4096 bytesdequelistvector100020003000400050006000700080009000100000250,000500,000750,0001,000,0001,250,0001,500,000Number of elementsus
xdequelistvector
10004,2587,1904,445
200020,58438,41119,825
300048,236113,18955,341
400087,475223,174118,453
5000136,945362,421191,967
6000197,856530,943281,252
7000273,359726,323387,940
8000351,223954,463511,276
9000447,5251,211,581652,269
10000551,5561,497,916807,161

The performance of the list are still poor but the gap is decreasing. The interesting point is that deque is now faster than vector. I'm not really sure of the reason of this result. It is possible that it comes only from this special size. One thing is sure, the bigger the data size, the more cache misses the processor will get because elements don't fit in cache lines.

For search, list is clearly slow where deque and vector have about the same performance. It seems that deque is faster than a vector for very large data sizes.

Random Insert (+Linear Search)

In the case of random insert, in theory, the list should be much faster, its insert operation being in O(1) versus O(n) for a vector or a deque.

The container is filled with all the numbers in [0, N] and shuffled. Then, 1000 random values are inserted at a random position in the container. The random position is found by linear search. In both cases, the complexity of the search is O(n), the only difference comes from the insert that follow the search. We saw before that the performance of the list were poor for searching, so we'll see if the fast insertion can compensate the slow search.

random_insert - 8 bytesdequelistvector100002000030000400005000060000700008000090000100000050100150200Number of elementsms
xdequelistvector
100008278
20000154514
30000226321
40000297427
50000378738
600004310544
700005011448
800006113055
900006613961
1000007015568

List is clearly slower than the other two data structures that exhibit the same performance. This comes from the very slow linear search. Even if the two other data structures have to move a lot of data, the copy is cheap for small data types.

Let's increase the size a bit:

random_insert - 32 bytesdequelistvector100002000030000400005000060000700008000090000100000050100150200250300Number of elementsms
xdequelistvector
10000215325
20000398048
300005710368
400007112290
5000088146112
60000102165130
70000124190152
80000140214175
90000157238195
100000174268213

The result are interesting. The list is still the slowest but with a smaller margin. This time deque is faster than the vector by a small margin.

Again, increasing the data size:

random_insert - 128 bytesdequelistvector10000200003000040000500006000070000800009000010000005001,0001,5002,000Number of elementsms
xdequelistvector
10000648089
20000108128154
30000158182248
40000212248347
50000281348469
60000402443735
700005696431,034
800007677751,347
900009781,0021,614
1000001,1901,2021,962

This time, the vector is clearly the looser and deque and list have the same performance. We can say that with a size of 128 bytes, the time to move a lot of the elements is more expensive than searching in the list.

A huge data type gives us clearer results:

random_insert - 4096 bytesdequelistvector100002000030000400005000060000700008000090000100000020,00040,00060,00080,000Number of elementsms
xdequelistvector
100004,4301788,074
200007,91831114,121
3000011,04344420,014
4000013,80655526,783
5000017,42169433,519
6000020,66390439,175
7000023,5991,14745,111
8000026,7361,47050,887
9000029,5241,94060,139
10000032,0052,53465,098

The list is more than 20 times faster than the vector and an order of magnitude faster than the deque ! The deque is also twice faster than the vector.

The fact than the deque is faster than vector is quite simple. When an insertion is made in a deque, the elements can either moved to the end or the beginning. The closer point will be chosen. An insert in the middle is the most costly operation with O(n/2) complexity. It is always more efficient to insert elements in a deque than in vector because at least twice less elements will be moved.

If we look at the non-trivial data type:

random_insert - 16 bytesdequelistvector10000200003000040000500006000070000800009000010000001,0002,0003,0004,000Number of elementsms
xdequelistvector
1000023041425
2000037665705
30000552841,054
400006921011,345
500008621191,661
600001,0031411,984
700001,1861552,277
800001,3581722,681
900001,5401862,965
1000001,6582033,236

The results are about the same as for the previous graph, but the data type is only 16B. The cost of the copy constructors and assignment operators is very important for vector and deque. The list doesn't care because no copy neither assignment of the existing elements is made during insertions (only the inserted element is copied).

Random Remove

In theory, random remove is the same case than random insert. Now that we've seen the results with random insert, we could expect the same behavior for random remove.

The container is filled with all the numbers in [0, N] and shuffled. Then, 1000 random values are removed from a random position in the container.

If we take the same data sizes as the random insert case:

random_remove - 8 bytesdequelistvector100002000030000400005000060000700008000090000100000050100150200Number of elementsms
xdequelistvector
100006195
20000124111
30000205518
40000276825
50000348133
600004310140
700004911345
800005912652
900006713861
1000007215765

random_remove - 128 bytesdequelistvector10000200003000040000500006000070000800009000010000005001,0001,5002,0002,500Number of elementsms
xdequelistvector
10000404063
200008583134
30000127132198
40000181189282
50000245263473
60000363376664
70000524502960
800007436881,343
900009778121,639
1000001,2281,0172,004

random_remove - 4096 bytesdequelistvector100002000030000400005000060000700008000090000100000010,00020,00030,00040,00050,00060,000Number of elementsms
xdequelistvector
100002,9061095,649
200006,19023311,760
300009,37935918,218
4000012,84049023,634
5000016,02758530,046
6000018,91877336,100
7000022,21399942,453
8000025,7881,31748,793
9000028,9751,76255,043
10000030,8602,12859,791

random_remove - 16 bytesdequelistvector10000200003000040000500006000070000800009000010000001,0002,0003,0004,000Number of elementsms
xdequelistvector
1000014927294
2000031950608
3000048168934
40000638891,236
500007941081,547
600009541201,894
700001,1011442,185
800001,2531602,513
900001,3991772,812
1000001,5951943,108

The behavior of random remove is the same as the behavior of random insert, for the same reasons. The results are not very interesting, so, let's get to the next workload.

Push Front

The next operation that we will compare is inserting elements in front of the collection. This is the worst case for vector, because after each insertion, all the previously inserted will be moved and copied. For a list or a deque, it does not make a difference compared to pushing to the back.

So let's see the results:

fill_front - 8 bytesdequelistvector10000200003000040000500006000070000800009000010000001,0002,0003,0004,0005,000Number of elementsms
xdequelistvector
100000033
2000000135
3000000313
4000000585
5000001913
60000011,327
70000011,823
80000012,405
90000023,107
100000024,017

The results are crystal-clear and as expected, vector is very bad at inserting elements to the front. The list and the deque results are almost invisible in the graph because it is a free operation for the two data structures. This does not need further explanations. There is no need to change the data size, it will only make vector much slower and my processor hotter.

Sort

The next operation that is tested is the time necessary to sort the data structures. For the vector and the deque std::sort is used and for a list the member function sort is used.

sort - 8 bytesdequelistvector10000020000030000040000050000060000070000080000090000010000000100200300400500600Number of elementsms
xdequelistvector
1000009256
200000196114
3000002911522
4000004017530
5000005023339
6000006032148
7000007137857
8000008545766
9000009551774
100000010859383

For a small data type, the list is several times slower than the other two data structures. This is again due to the very poor spatial locality of the list during the search. vector is slightly faster than a deque, but the difference is not very significant.

If we increase the size:

sort - 128 bytesdequelistvector10000020000030000040000050000060000070000080000090000010000000100200300400500600Number of elementsms
xdequelistvector
100000253220
200000658048
30000010314380
400000136197113
500000180246149
600000223340181
700000274396222
800000302469266
900000358514303
1000000395579337

The order remains the same but the difference between the list and the other is decreasing.

With a 1KB data type:

sort - 1024 bytesdequelistvector100000200000300000400000500000600000700000800000900000100000005001,0001,5002,0002,500Number of elementsms
xdequelistvector
10000017639168
20000038994376
300000620168595
400000859228823
5000001,1002851,059
6000001,3553921,301
7000001,6094521,555
8000001,8445391,797
9000002,1115972,054
10000002,3976702,278

The list is almost five times faster than the vector and the deque which are both performing the same (with a very slight advantage for vector).

If we use the non-trivial data type:

sort - 16 bytesdequelistvector100000200000300000400000500000600000700000800000900000100000002004006008001,0001,200Number of elementsms
xdequelistvector
100000922689
20000019570188
300000301135296
400000410195399
500000519255510
600000638350623
700000763410729
800000858492846
900000971552954
10000001,0906281,072

Again, the cost of the operators of this type have a strong impact on the vector and deque.

Destruction

The next test is to calculate the time necessary to the destruction of a container. The containers are dynamically allocated, are filled with n numbers and then their destruction time (via delete) is computed.

destruction - 8 bytesdequelistvector100000200000300000400000500000600000700000800000900000100000005,00010,00015,00020,000Number of elementsus
xdequelistvector
100000341,4890
200000702,8380
3000001024,6770
4000001426,0720
5000001737,7370
6000002158,8280
70000035310,5991
80000032112,1150
90000035513,9321
100000041015,3450

The results are already interesting. The vector is almost free to destroy, which is logical because that incurs only freeing one array and the vector itself. The deque is slower due to the freeing of each segments. But the list is much more costly than the other two, more than an order of magnitude slower. This is expected because the list have to free the dynamic memory of each node and also has to iterate through all the elements which we saw was slow.

If we increase the data type:

destruction - 128 bytesdequelistvector1000002000003000004000005000006000007000008000009000001000000010,00020,00030,00040,00050,000Number of elementsus
xdequelistvector
1000008984,4031
2000002,4888,4991
3000004,09112,4991,430
4000005,46116,3791,909
5000006,72921,1282,459
6000008,16425,7192,729
7000009,51731,0463,227
80000010,87134,5503,756
90000012,39237,1764,163
100000013,76240,1194,523

This time we can see that the deque is three times slower than a vector and that the list is still an order of magnitude slower than a vector ! However, the is less difference than before.

With our biggest data type, now:

destruction - 4096 bytesdequelistvector1000002000003000004000005000006000007000008000009000001000000050,000100,000150,000200,000250,000Number of elementsus
xdequelistvector
10000020,57522,43415,499
20000044,23447,25429,848
30000067,19669,37439,818
40000089,25391,12854,229
500000108,689112,55768,090
600000131,751135,76475,063
700000150,801155,61090,761
800000172,365176,957102,830
900000192,575193,897112,728
1000000211,507215,274126,348

There is no more difference between list and deque. The vector is still twice faster than them.

Even if the vector is always faster than the list and deque, keep in mind that the graphs for destruction are in microseconds and so the operations are not very costly. It could make a difference is very time-sensitive application but unlikely in most applications. Moreover, destruction is made only once per data structure, generally, it is not a very important operation.

Number Crunching

Finally, we can also test a number crunching operation. Here, random elements are inserted into the container that is kept sorted. It means, that the position where the element has to be inserted is first searched by iterating through elements and the inserted. As we talk about number crunching, only 8 bytes elements are tested.

number_crunchingdequelistvector100002000030000400005000060000700008000090000100000010,00020,00030,00040,00050,000Number of elementsms
xdequelistvector
100003918733
200001501,247134
300003393,380310
400006236,513547
5000095810,757864
600001,39416,0981,257
700001,89422,6231,713
800002,47930,6562,249
900003,16239,4512,858
1000003,93249,9063,576

Even if there is only 100'000 elements, the list is already an order of magnitude slower than the other two data structures. If we look a the curves of the results, it is easy to see that this will be only worse with higher collection sizes. The list is absolutely not adapted for number crunching operations due to its poor spatial locality.

Conclusion

To conclude, we can get some facts about each data structure:

  • std::list is very very slow to iterate through the collection due to its very poor spatial locality.
  • std::vector and std::deque perform always faster than std::list with very small data
  • std::list handles very well large elements
  • std::deque performs better than a std::vector for inserting at random positions (especially at the front, which is constant time)
  • std::deque and std::vector do not support very well data types with high cost of copy/assignment

This draw simple conclusions on usage of each data structure:

  • Number crunching: use std::vector or std::deque
  • Linear search: use std::vector or std::deque
  • Random Insert/Remove:
    • Small data size: use std::vector
    • Large element size: use std::list (unless if intended principally for searching)
  • Non-trivial data type: use std::list unless you need the container especially for searching. But for multiple modifications of the container, it will be very slow.
  • Push to front: use std::deque or std::list

I have to say that before writing this new version of the benchmark I did not know std::deque a lot. This is a very good data structure that is very good at inserting at both ends and even in the middle while exposing a very good spatial locality. Even if sometimes slower than a vector, when the operations involves both searching and inserting in the middle, I would say that this structure should be preferred over vectors, especially for data types of medium sizes.

If you have the time, in practice, the best way to decide is always to benchmark each version, or even to try another data structures. Two operations with the same Big O complexity can perform quite differently in practice.

I hope that you found this article interesting. If you have any comment or have an idea about an other workload that you would like to test, don't hesitate to post a comment ;) If you have a question on results, don't hesitate as well.

The code source of the benchmark is available online: https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp

The older version of the article is still available: C++ benchmark – std::vector VS std::list

Comments

vector_pre
vector_pre
vector_pre