C++ benchmark – std::vector VS std::list VS std::deque

Baptiste Wicht

Last week, I wrote a benchmark comparing the performance of std::vector and std::list on different workloads. This previous article received a lot of comments and several suggestions to improve it. The present article is an improvement over the previous article.

In this article, I will compare the performance of std::vector, std::list and std::deque on several different workloads and with different data types. In this article, when I talk about a list refers to std::list, a vector refers to std::vector and deque to std::deque.

It is generally said that a list should be used when random insert and remove will be performed (performed in O(1) versus O(n) for a vector or a deque). If we look only at the complexity, the scale of linear search in both data structures should be equivalent, complexity being in O(n). When random insert/replace operations are performed on a vector or a deque, all the subsequent data needs to be moved and so each element will be copied. That is why the size of the data type is an important factor when comparing those two data structures. Because the size of the data type will play an important role on the cost of copying an element.

However, in practice, there is a huge difference: the usage of the memory caches. All the data in a vector is contiguous where the std::list allocates separately memory for each element. How does that change the results in practice ? The deque is a data structure aiming at having the advantages of both data structures without their drawbacks, we will see how it perform in practice. Complexity analysis does not take the memory hierarchy into level. I believe that in practice, memory hierarchy usage is as important as complexity analysis.

Keep in mind that all the tests performed are made on vector, list and deque even if other data structures could be better suited to the given workload.

In the graphs and in the text, n is used to refer to the number of elements of the collection.

All the tests performed have been performed on an Intel Core i7 Q 820 @ 1.73GHz. The code has been compiled in 64 bits with GCC 4.7.2 with -02 and -march=native. The code has been compiled with C++11 support (-std=c++11).

For each graph, the vertical axis represent the amount of time necessary to perform the operations, so the lower values are the better. The horizontal axis is always the number of elements of the collection. For some graph, the logarithmic scale could be clearer, a button is available after each graph to change the vertical scale to a logarithmic scale.

The data types are varying in size, they hold an array of longs and the size of the array varies to change the size of the data type. The non-trivial data type is made of two longs and has very stupid assignment operator and copy constructor that just does some maths (totally meaningless but costly). One may argue that is not a common copy constructor neither a common assignment operator and one will be right, however, the important point here is that it is costly operators which is enough for this benchmark.

Fill

The first test that is performed is to fill the data structures by adding elements to the back of the container (using push_back). Two variations of vector are used, vector_pre being a std::vector using vector::reserve at the beginning, resulting in only one allocation of memory.

Lets see the results with a very small data type:

x	list	vector	deque	vector_pre
100000	2,545	271	2,012	317
200000	4,927	552	998	334
300000	7,310	944	1,707	595
400000	9,463	936	2,056	1,099
500000	12,591	1,140	2,642	1,058
600000	14,351	1,894	3,125	1,237
700000	16,561	1,995	3,686	1,208
800000	18,820	2,648	4,291	1,365
900000	20,832	2,777	4,962	2,268
1000000	23,430	3,015	5,396	2,585

The pre-allocated vector is the fastest by a small margin and the list is 3 times slower than a vector. deque and vector.

If we consider higher data type:

x	list	vector	deque	vector_pre
100000	104,867	55,545	66,852	21,738
200000	226,215	108,289	136,035	42,532
300000	340,910	198,343	153,446	60,317
400000	445,035	217,325	269,316	80,616
500000	559,619	236,576	189,613	101,371
600000	688,422	391,354	303,729	122,447
700000	799,902	405,771	426,373	138,868
800000	921,441	415,707	537,057	160,637
900000	1,006,331	439,635	263,650	177,052
1000000	1,113,690	464,416	372,000	199,434

This time vector and deque are performing at about the same speed. The pre-allocated vector is clearly the winner here. The variations in the results of deque and vector are probably coming from my system that doesn't like allocating so much memory back and forth at this speed.

Finally, if we use a non-trivial data type:

x	list	vector	deque	vector_pre
100000	8,093	8,123	10,251	8,095
200000	15,433	15,305	16,061	13,897
300000	25,964	24,643	24,450	19,954
400000	33,414	30,322	32,148	27,171
500000	40,416	37,817	40,752	35,058
600000	48,991	48,594	48,785	41,049
700000	55,059	55,124	55,092	47,609
800000	63,688	61,360	64,505	55,659
900000	70,550	67,636	72,329	60,952
1000000	79,271	73,533	79,522	67,787

All data structures are performing more or less the same, with vector_pre being the fastest.

For push_back operations, pre-allocated vectors is a very good choice if the size is known in advance. The others performs more of less the same.

I would have expected a better result for pre-allocated vector. If someone find an explanation for such a small margin, I'm interested.

Linear Search

The first operation is that is tested is the search. The container is filled with all the numbers in [0, N] and shuffled. Then, each number in [0,N] is searched in the container with std::find that performs a simple linear search. In theory, all the data structures should perform the same if we consider their complexity.

x	deque	list	vector
1000	593	1,098	318
2000	2,927	5,307	1,271
3000	5,891	12,228	3,020
4000	8,663	24,415	5,081
5000	12,859	36,316	8,066
6000	18,493	55,057	11,463
7000	25,057	74,344	16,022
8000	38,980	99,990	21,051
9000	44,951	127,575	26,650
10000	52,281	158,216	32,557

It is clear from the graph that the list has very poor performance for searching. The growth is much worse for a list than for a vector or a deque.

The only reason is the usage of the cache line. When a data is accessed, the data is fetched from the main memory to the cache. Not only the accessed data is accessed, but a whole cacheline is fetched. As the elements in a vector are contiguous, when you access an element, the next element is automatically in the cache. As the main memory is orders of magnitude slower than the cache, this makes a huge difference. In the list case, the processor spends its whole time waiting for data being fetched from memory to the cache, at each fetch, the processor fetches a lot of unnecessary data that are almost always useless.

The deque is a bit slower than the vector, that is logical because here there are more cache misses due to the segmented parts.

If we take a bigger data type:

x	deque	list	vector
1000	1,116	2,683	776
2000	4,983	16,675	3,537
3000	12,255	44,379	10,874
4000	23,212	83,026	20,189
5000	37,392	133,353	33,609
6000	55,295	193,428	47,636
7000	74,877	261,314	63,911
8000	100,903	340,157	84,647
9000	126,299	435,816	107,922
10000	156,386	545,160	135,680

The list is still much slower than the others, but what is interesting is that gap between the deque and the array is decreasing. Let's try with a 4KB data type:

x deque list vector
1000 4,258 7,190 4,445
2000 20,584 38,411 19,825
3000 48,236 113,189 55,341
4000 87,475 223,174 118,453
5000 136,945 362,421 191,967
6000 197,856 530,943 281,252
7000 273,359 726,323 387,940
8000 351,223 954,463 511,276
9000 447,525 1,211,581 652,269
10000 551,556 1,497,916 807,161

x	deque	list	vector
1000	4,258	7,190	4,445
2000	20,584	38,411	19,825
3000	48,236	113,189	55,341
4000	87,475	223,174	118,453
5000	136,945	362,421	191,967
6000	197,856	530,943	281,252
7000	273,359	726,323	387,940
8000	351,223	954,463	511,276
9000	447,525	1,211,581	652,269
10000	551,556	1,497,916	807,161

The performance of the list are still poor but the gap is decreasing. The interesting point is that deque is now faster than vector. I'm not really sure of the reason of this result. It is possible that it comes only from this special size. One thing is sure, the bigger the data size, the more cache misses the processor will get because elements don't fit in cache lines.

For search, list is clearly slow where deque and vector have about the same performance. It seems that deque is faster than a vector for very large data sizes.

Random Insert (+Linear Search)

In the case of random insert, in theory, the list should be much faster, its insert operation being in O(1) versus O(n) for a vector or a deque.

The container is filled with all the numbers in [0, N] and shuffled. Then, 1000 random values are inserted at a random position in the container. The random position is found by linear search. In both cases, the complexity of the search is O(n), the only difference comes from the insert that follow the search. We saw before that the performance of the list were poor for searching, so we'll see if the fast insertion can compensate the slow search.

x	deque	list	vector
10000	8	27	8
20000	15	45	14
30000	22	63	21
40000	29	74	27
50000	37	87	38
60000	43	105	44
70000	50	114	48
80000	61	130	55
90000	66	139	61
100000	70	155	68

List is clearly slower than the other two data structures that exhibit the same performance. This comes from the very slow linear search. Even if the two other data structures have to move a lot of data, the copy is cheap for small data types.

Let's increase the size a bit:

x	deque	list	vector
10000	21	53	25
20000	39	80	48
30000	57	103	68
40000	71	122	90
50000	88	146	112
60000	102	165	130
70000	124	190	152
80000	140	214	175
90000	157	238	195
100000	174	268	213

The result are interesting. The list is still the slowest but with a smaller margin. This time deque is faster than the vector by a small margin.

Again, increasing the data size:

x	deque	list	vector
10000	64	80	89
20000	108	128	154
30000	158	182	248
40000	212	248	347
50000	281	348	469
60000	402	443	735
70000	569	643	1,034
80000	767	775	1,347
90000	978	1,002	1,614
100000	1,190	1,202	1,962

This time, the vector is clearly the looser and deque and list have the same performance. We can say that with a size of 128 bytes, the time to move a lot of the elements is more expensive than searching in the list.

A huge data type gives us clearer results:

x	deque	list	vector
10000	4,430	178	8,074
20000	7,918	311	14,121
30000	11,043	444	20,014
40000	13,806	555	26,783
50000	17,421	694	33,519
60000	20,663	904	39,175
70000	23,599	1,147	45,111
80000	26,736	1,470	50,887
90000	29,524	1,940	60,139
100000	32,005	2,534	65,098

The list is more than 20 times faster than the vector and an order of magnitude faster than the deque ! The deque is also twice faster than the vector.

The fact than the deque is faster than vector is quite simple. When an insertion is made in a deque, the elements can either moved to the end or the beginning. The closer point will be chosen. An insert in the middle is the most costly operation with O(n/2) complexity. It is always more efficient to insert elements in a deque than in vector because at least twice less elements will be moved.

If we look at the non-trivial data type:

x	deque	list	vector
10000	230	41	425
20000	376	65	705
30000	552	84	1,054
40000	692	101	1,345
50000	862	119	1,661
60000	1,003	141	1,984
70000	1,186	155	2,277
80000	1,358	172	2,681
90000	1,540	186	2,965
100000	1,658	203	3,236

The results are about the same as for the previous graph, but the data type is only 16B. The cost of the copy constructors and assignment operators is very important for vector and deque. The list doesn't care because no copy neither assignment of the existing elements is made during insertions (only the inserted element is copied).

Random Remove

In theory, random remove is the same case than random insert. Now that we've seen the results with random insert, we could expect the same behavior for random remove.

The container is filled with all the numbers in [0, N] and shuffled. Then, 1000 random values are removed from a random position in the container.

If we take the same data sizes as the random insert case:

x	deque	list	vector
10000	6	19	5
20000	12	41	11
30000	20	55	18
40000	27	68	25
50000	34	81	33
60000	43	101	40
70000	49	113	45
80000	59	126	52
90000	67	138	61
100000	72	157	65

x	deque	list	vector
10000	40	40	63
20000	85	83	134
30000	127	132	198
40000	181	189	282
50000	245	263	473
60000	363	376	664
70000	524	502	960
80000	743	688	1,343
90000	977	812	1,639
100000	1,228	1,017	2,004

x	deque	list	vector
10000	2,906	109	5,649
20000	6,190	233	11,760
30000	9,379	359	18,218
40000	12,840	490	23,634
50000	16,027	585	30,046
60000	18,918	773	36,100
70000	22,213	999	42,453
80000	25,788	1,317	48,793
90000	28,975	1,762	55,043
100000	30,860	2,128	59,791

x	deque	list	vector
10000	149	27	294
20000	319	50	608
30000	481	68	934
40000	638	89	1,236
50000	794	108	1,547
60000	954	120	1,894
70000	1,101	144	2,185
80000	1,253	160	2,513
90000	1,399	177	2,812
100000	1,595	194	3,108

The behavior of random remove is the same as the behavior of random insert, for the same reasons. The results are not very interesting, so, let's get to the next workload.

Push Front

The next operation that we will compare is inserting elements in front of the collection. This is the worst case for vector, because after each insertion, all the previously inserted will be moved and copied. For a list or a deque, it does not make a difference compared to pushing to the back.

So let's see the results:

x	list	vector
10000	0	33
20000	0	135
30000	0	313
40000	0	585
50000	1	913
60000	1	1,327
70000	1	1,823
80000	1	2,405
90000	2	3,107
100000	2	4,017

The results are crystal-clear and as expected, vector is very bad at inserting elements to the front. The list and the deque results are almost invisible in the graph because it is a free operation for the two data structures. This does not need further explanations. There is no need to change the data size, it will only make vector much slower and my processor hotter.

Sort

The next operation that is tested is the time necessary to sort the data structures. For the vector and the deque std::sort is used and for a list the member function sort is used.

x	deque	list	vector
100000	9	25	6
200000	19	61	14
300000	29	115	22
400000	40	175	30
500000	50	233	39
600000	60	321	48
700000	71	378	57
800000	85	457	66
900000	95	517	74
1000000	108	593	83

For a small data type, the list is several times slower than the other two data structures. This is again due to the very poor spatial locality of the list during the search. vector is slightly faster than a deque, but the difference is not very significant.

If we increase the size:

x	deque	list	vector
100000	25	32	20
200000	65	80	48
300000	103	143	80
400000	136	197	113
500000	180	246	149
600000	223	340	181
700000	274	396	222
800000	302	469	266
900000	358	514	303
1000000	395	579	337

The order remains the same but the difference between the list and the other is decreasing.

With a 1KB data type:

x	deque	list	vector
100000	176	39	168
200000	389	94	376
300000	620	168	595
400000	859	228	823
500000	1,100	285	1,059
600000	1,355	392	1,301
700000	1,609	452	1,555
800000	1,844	539	1,797
900000	2,111	597	2,054
1000000	2,397	670	2,278

The list is almost five times faster than the vector and the deque which are both performing the same (with a very slight advantage for vector).

If we use the non-trivial data type:

x	deque	list	vector
100000	92	26	89
200000	195	70	188
300000	301	135	296
400000	410	195	399
500000	519	255	510
600000	638	350	623
700000	763	410	729
800000	858	492	846
900000	971	552	954
1000000	1,090	628	1,072

Again, the cost of the operators of this type have a strong impact on the vector and deque.

Destruction

The next test is to calculate the time necessary to the destruction of a container. The containers are dynamically allocated, are filled with n numbers and then their destruction time (via delete) is computed.

x	deque	list	vector
100000	34	1,489	0
200000	70	2,838	0
300000	102	4,677	0
400000	142	6,072	0
500000	173	7,737	0
600000	215	8,828	0
700000	353	10,599	1
800000	321	12,115	0
900000	355	13,932	1
1000000	410	15,345	0

The results are already interesting. The vector is almost free to destroy, which is logical because that incurs only freeing one array and the vector itself. The deque is slower due to the freeing of each segments. But the list is much more costly than the other two, more than an order of magnitude slower. This is expected because the list have to free the dynamic memory of each node and also has to iterate through all the elements which we saw was slow.

If we increase the data type:

x	deque	list	vector
100000	898	4,403	1
200000	2,488	8,499	1
300000	4,091	12,499	1,430
400000	5,461	16,379	1,909
500000	6,729	21,128	2,459
600000	8,164	25,719	2,729
700000	9,517	31,046	3,227
800000	10,871	34,550	3,756
900000	12,392	37,176	4,163
1000000	13,762	40,119	4,523

This time we can see that the deque is three times slower than a vector and that the list is still an order of magnitude slower than a vector ! However, the is less difference than before.

With our biggest data type, now:

x	deque	list	vector
100000	20,575	22,434	15,499
200000	44,234	47,254	29,848
300000	67,196	69,374	39,818
400000	89,253	91,128	54,229
500000	108,689	112,557	68,090
600000	131,751	135,764	75,063
700000	150,801	155,610	90,761
800000	172,365	176,957	102,830
900000	192,575	193,897	112,728
1000000	211,507	215,274	126,348

There is no more difference between list and deque. The vector is still twice faster than them.

Even if the vector is always faster than the list and deque, keep in mind that the graphs for destruction are in microseconds and so the operations are not very costly. It could make a difference is very time-sensitive application but unlikely in most applications. Moreover, destruction is made only once per data structure, generally, it is not a very important operation.

Number Crunching

Finally, we can also test a number crunching operation. Here, random elements are inserted into the container that is kept sorted. It means, that the position where the element has to be inserted is first searched by iterating through elements and the inserted. As we talk about number crunching, only 8 bytes elements are tested.

x	deque	list	vector
10000	39	187	33
20000	150	1,247	134
30000	339	3,380	310
40000	623	6,513	547
50000	958	10,757	864
60000	1,394	16,098	1,257
70000	1,894	22,623	1,713
80000	2,479	30,656	2,249
90000	3,162	39,451	2,858
100000	3,932	49,906	3,576

Even if there is only 100'000 elements, the list is already an order of magnitude slower than the other two data structures. If we look a the curves of the results, it is easy to see that this will be only worse with higher collection sizes. The list is absolutely not adapted for number crunching operations due to its poor spatial locality.

Conclusion

To conclude, we can get some facts about each data structure:

std::list is very very slow to iterate through the collection due to its very poor spatial locality.
std::vector and std::deque perform always faster than std::list with very small data
std::list handles very well large elements
std::deque performs better than a std::vector for inserting at random positions (especially at the front, which is constant time)
std::deque and std::vector do not support very well data types with high cost of copy/assignment

This draw simple conclusions on usage of each data structure:

Number crunching: use std::vector or std::deque
Linear search: use std::vector or std::deque
Random Insert/Remove:
- Small data size: use std::vector
- Large element size: use std::list (unless if intended principally for searching)
Non-trivial data type: use std::list unless you need the container especially for searching. But for multiple modifications of the container, it will be very slow.
Push to front: use std::deque or std::list

I have to say that before writing this new version of the benchmark I did not know std::deque a lot. This is a very good data structure that is very good at inserting at both ends and even in the middle while exposing a very good spatial locality. Even if sometimes slower than a vector, when the operations involves both searching and inserting in the middle, I would say that this structure should be preferred over vectors, especially for data types of medium sizes.

If you have the time, in practice, the best way to decide is always to benchmark each version, or even to try another data structures. Two operations with the same Big O complexity can perform quite differently in practice.

I hope that you found this article interesting. If you have any comment or have an idea about an other workload that you would like to test, don't hesitate to post a comment ;) If you have a question on results, don't hesitate as well.

The code source of the benchmark is available online: https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp

The older version of the article is still available: C++ benchmark – std::vector VS std::list