User account creation filtered due to spam.
C++, gcc 3.3.2, Red Hat Linux 7.2, kernel 2.4.20, dual AMD Athlon MP 2200+. When two threads are simultaneously adding elements to local standard container objects (including map, set and ext/hash_map), I observe a performance degradation by a factor of 6x, when the program is running on two cpus compared to running on a single CPU. (Whereas you may expect an improvement of 2x). This seems to a problem with the default allocator, since running the program with GLIBCPP_FORCE_NEW set, removes the problem, and dramatically speeds up the program on two cpus. (Whereas you may expect a slight degradation). The example program creates a number of threads, each of which runs an infinite loop, repeatedly constructing a std::map<int,int> on the stack, inserting a number of elements, then letting it go out of scope. It takes a single parameter specifying how many threads to create - if unspecified it defaults to two threads. When the program is run with two threads, on my dual CPU machine, I observe both processors staying about 60% idle, with each at 15% user and 25% system load. Why such high system load? When I force the program onto a single CPU (by running another compute-bound app), the two threads each consume a full 50% of the CPU, with only a minimal system load, and it runs 6x faster. By the way, I observed the same problem with gcc-3.2.1, and on another machine running a different version of linux. Following is the output of the g++ -v -save-temps command. I will attach the compressed preprocessed .ii file after this bug is submitted (since I can't see how to do it on this page). Further below is the unpreprocessed source file, in case that is useful too. thanks, Dan Evison devison at pacificit dot co dot nz --------------------------------------------------------------------- $ /usr/local/gcc-3.3.2/bin/g++ -v -save-temps -pthread -D _PTHREADS -D _REENTRANT maptest.cpp Reading specs from /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux- gnu/3.3.2/specs Configured with: ../gcc-3.3.2/configure --enable-shared --with-gnu-as --with- gnu-ld --enable-threads=posix --enable-languages=c++,java --disable-nls -- prefix=/usr/local/gcc-3.3.2 Thread model: posix gcc version 3.3.2 /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-gnu/3.3.2/cc1plus -E - D__GNUG__=3 -quiet -v -D__GNUC__=3 -D__GNUC_MINOR__=3 -D__GNUC_PATCHLEVEL__=2 - D_GNU_SOURCE -D_REENTRANT -D _PTHREADS -D _REENTRANT maptest.cpp maptest.ii ignoring nonexistent directory "/usr/local/gcc-3.3.2/i686-pc-linux-gnu/include" #include "..." search starts here: #include <...> search starts here: /usr/local/gcc-3.3.2/include/c++/3.3.2 /usr/local/gcc-3.3.2/include/c++/3.3.2/i686-pc-linux-gnu /usr/local/gcc-3.3.2/include/c++/3.3.2/backward /usr/local/include /usr/local/gcc-3.3.2/include /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-gnu/3.3.2/include /usr/include End of search list. /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-gnu/3.3.2/cc1plus - fpreprocessed maptest.ii -quiet -dumpbase maptest.cpp -auxbase maptest - version -o maptest.s GNU C++ version 3.3.2 (i686-pc-linux-gnu) compiled by GNU C version 3.3.2. GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072 as -V -Qy -o maptest.o maptest.s GNU assembler version 2.11.93.0.2 (i386-redhat-linux) using BFD version 2.11.93.0.2 20020207 /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-gnu/3.3.2/collect2 --eh-frame- hdr -m elf_i386 -dynamic-linker /lib/ld- linux.so.2 /usr/lib/crt1.o /usr/lib/crti.o /usr/local/gcc-3.3.2/lib/gcc- lib/i686-pc-linux-gnu/3.3.2/crtbegin.o -L/usr/local/gcc-3.3.2/lib/gcc-lib/i686- pc-linux-gnu/3.3.2 -L/usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux- gnu/3.3.2/../../.. maptest.o -lstdc++ -lm -lgcc_s -lgcc -lpthread -lc -lgcc_s - lgcc /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux- gnu/3.3.2/crtend.o /usr/lib/crtn.o -------------------------------------------------------------------- #include <iostream> #include <map> #include <pthread.h> void *calc(void *n) { int *num = static_cast<int *>(n); unsigned c = 0; while (true) { std::cerr << "Thread " << *num << ": " << ++c << std::endl; std::map<int, int> m; for (unsigned i = 0; i < 100000; ++i) m[i] = i; } return 0; } int main(int argc, char *argv[]) { int n = argc == 2 ? std::max(atoi(argv[1]), 1) : 2; pthread_t id; for (int i = 0; i < n; ++i) pthread_create(&id, 0, calc, new int(i)); pthread_join(id, 0); return 0; }
Created attachment 5556 [details] compressed preprocessor output, from g++ - v -save-temps Compressed with bzip2.
Confirmed on the mainline but it is not as bad as 3.3 was. The problem is that is the a mutex is getting locked too much, I do not know if this is ever going to be fixed except by per thread pool_alloc.
Subject: Re: Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator Thanks for the fast response! Is there a per thread pool_alloc currrently available that is recommended for use? There looks like evidence of one, in the file include/c++/3.3.2/bits/stl_pthread_alloc.h (should I post questions like this via bugzilla?) ----- Original Message ----- From: "pinskia at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org> To: <devison@pacificit.co.nz> Sent: Friday, January 23, 2004 3:34 PM Subject: [Bug libstdc++/13823] Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator > > ------- Additional Comments From pinskia at gcc dot gnu dot org 2004-01-23 02:34 ------- > Confirmed on the mainline but it is not as bad as 3.3 was. The problem is that is the a mutex is > getting locked too much, I do not know if this is ever going to be fixed except by per thread > pool_alloc. > > -- > What |Removed |Added > -------------------------------------------------------------------------- -- > Severity|normal |enhancement > Status|UNCONFIRMED |NEW > Component|c++ |libstdc++ > Ever Confirmed| |1 > Last reconfirmed|0000-00-00 00:00:00 |2004-01-23 02:34:08 > date| | > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13823 > > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter. > >
No, that is what I am saying is this is a way to fix this and that is all.
Subject: Re: Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator > Is there a per thread pool_alloc currrently available that is recommended > for use? > There looks like evidence of one, in the file > include/c++/3.3.2/bits/stl_pthread_alloc.h Do not use that one (well, the problem was it was never made portable within our library context; it might work for you on any random system). There is <ext/mt_allocator.h> on mainline / 3.4. The author of that work just submitted patches which solve MP contention issues and double the speed on non-MP systems. Please try that work, if you can. Regards, Loren > (should I post questions like this via bugzilla?) PS, please join us on libstdc++@gcc.gnu.org since we are currently discussing MT/MP allocator issues in the C++ library.
FYI, the initially submitted case misused pthread_join (granted moot since threads never exit). Here is a better version for our use (this one exits, thus easier to time): #include <iostream> #include <map> void *calc(void *n) { std::map<int, int> m; for (unsigned i = 0; i < 100000; ++i) m[i] = i; return 0; } int main(int argc, char *argv[]) { int n = argc == 2 ? std::max(atoi(argv[1]), 1) : 2; pthread_t id[n]; for (int i = 0; i < n; ++i) pthread_create(&id[i], 0, calc, new int(i)); for (int i = 0; i < n; ++i) pthread_join(id[i], 0); return 0; }
Try this (we might change the default allocator to make it work better for MP at the next major library revision --- ABI change, but 3.4 will support this allocator with very good performance for people that wish to tune their application): #include <iostream> #include <map> #include <ext/mt_allocator.h> void *calc(void *n) { std::map<int, int, std::less<const int>, __gnu_cxx::__mt_alloc< std::pair<const int, int> > > m; for (unsigned i = 0; i < 100000; ++i) m[i] = i; return 0; } int main(int argc, char *argv[]) { int n = argc == 2 ? std::max(atoi(argv[1]), 1) : 2; pthread_t id[n]; for (int i = 0; i < n; ++i) pthread_create(&id[i], 0, calc, new int(i)); for (int i = 0; i < n; ++i) pthread_join(id[i], 0); return 0; }
Subject: Re: Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator thanks - I can see that's better being able to time it, and it simplifies the program. I'm no expert here, but perhaps it would still be best to have an outer loop in the calc function - say 5 or 10 iterations. That may exercise the allocator a bit more, repeatedly allocating and deallocating from the pool. As an aside - I didn't know an array could have a variable as its size. Stroustrup's C++PLv3 says "The number of elements of the array, the array bound, must be a constant expression". But I see it compiles and works with gcc! (btw I was aware of the misuse :) ----- Original Message ----- From: "ljrittle at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org> To: <devison@pacificit.co.nz> Sent: Friday, January 23, 2004 4:38 PM Subject: [Bug libstdc++/13823] Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator > > ------- Additional Comments From ljrittle at gcc dot gnu dot org 2004-01-23 03:38 ------- > FYI, the initially submitted case misused pthread_join (granted moot since > threads never > exit). Here is a better version for our use (this one exits, thus easier to time): > > #include <iostream> > #include <map> > > void *calc(void *n) > { > std::map<int, int> m; > > for (unsigned i = 0; i < 100000; ++i) > m[i] = i; > > return 0; > } > > int main(int argc, char *argv[]) > { > int n = argc == 2 ? std::max(atoi(argv[1]), 1) : 2; > > pthread_t id[n]; > > for (int i = 0; i < n; ++i) > pthread_create(&id[i], 0, calc, new int(i)); > > for (int i = 0; i < n; ++i) > pthread_join(id[i], 0); > > return 0; > } > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13823 > > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter. > >
Subject: Re: Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator > I'm no expert here, but perhaps it would still be best to have an outer loop > in the calc function - say 5 or 10 iterations. That may exercise the > allocator a bit more, repeatedly allocating and deallocating from the pool. OK, added to performance test case I'm adding to libstdc++-v3. > As an aside - I didn't know an array could have a variable as its size. > Stroustrup's C++PLv3 says "The number of elements of the array, the array > bound, must be a constant expression". But I see it compiles and works with > gcc! OK, you got me. Add -pedantic to see g++ fail the test case, as I updated it. This is a long-standing gcc extension. > (btw I was aware of the misuse :) OK, no problem. Just an FYI to those that follow the thread later. Thanks for the PR, Loren
Subject: Re: Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator Hi, I obtained the latest ext/mt_allocator.h file from cvs, and reran the test using it. Otherwise I'm still using the gcc 3.3.2 headers. BTW, I needed to make one addition to make it compile: namespace __gnu_cxx { using std::__throw_bad_alloc; } On my machine, it showed an improvement of around 30%, which is good, but is still running into a significant bottleneck. (Still 4 times slower than on a single cpu). Has anyone seen other results? By the way, while this example program is a little perverse, I was getting a similar performance hit in my application in which I was trying to speed things up with a second calculation thread. The calculations I am doing require creating tables of temporary results. thanks, Dan ----- Original Message ----- From: "ljrittle at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org> To: <devison@pacificit.co.nz> Sent: Friday, January 23, 2004 4:58 PM Subject: [Bug libstdc++/13823] Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator > > ------- Additional Comments From ljrittle at gcc dot gnu dot org 2004-01-23 03:58 ------- > Try this (we might change the default allocator to make it work better for MP > at the next major library revision --- ABI change, but 3.4 will support this > allocator > with very good performance for people that wish to tune their application): > > #include <iostream> > #include <map> > #include <ext/mt_allocator.h> > > void *calc(void *n) > { > std::map<int, int, std::less<const int>, > __gnu_cxx::__mt_alloc< std::pair<const int, int> > > m; > > for (unsigned i = 0; i < 100000; ++i) > m[i] = i; > > return 0; > } > > int main(int argc, char *argv[]) > { > int n = argc == 2 ? std::max(atoi(argv[1]), 1) : 2; > > pthread_t id[n]; > > for (int i = 0; i < n; ++i) > pthread_create(&id[i], 0, calc, new int(i)); > > for (int i = 0; i < n; ++i) > pthread_join(id[i], 0); > > return 0; > } > > -- > What |Removed |Added > -------------------------------------------------------------------------- -- > Status|NEW |SUSPENDED > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13823 > > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter. > >
Subject: Re: Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator > I obtained the latest ext/mt_allocator.h file from cvs, and reran the test > using it. Otherwise I'm still using the gcc 3.3.2 headers. BTW, I needed > to make one addition to make it compile: > namespace __gnu_cxx { using std::__throw_bad_alloc; } Please apply this patch and rerun your test (it would also be great if you could tell us whether your real application code sees major improvement using this allocator with an MP machine; I know that might be a little more work)... [This patch is from Stefan, the original author of mt_allocator.h .] Index: include/ext/mt_allocator.h =================================================================== RCS file: /cvs/gcc/gcc/libstdc++-v3/include/ext/mt_allocator.h,v retrieving revision 1.9 diff -c -r1.9 mt_allocator.h *** include/ext/mt_allocator.h 20 Jan 2004 06:35:21 -0000 1.9 --- include/ext/mt_allocator.h 23 Jan 2004 05:53:48 -0000 *************** *** 77,95 **** __mt_alloc() throw() { ! // XXX } __mt_alloc(const __mt_alloc&) throw() { ! // XXX } template<typename _Tp1> __mt_alloc(const __mt_alloc<_Tp1>&) throw() { ! // XXX ! } ~__mt_alloc() throw() { } --- 77,95 ---- __mt_alloc() throw() { ! // XXX } __mt_alloc(const __mt_alloc&) throw() { ! // XXX } template<typename _Tp1> __mt_alloc(const __mt_alloc<_Tp1>&) throw() { ! // XXX ! } ~__mt_alloc() throw() { } *************** *** 237,243 **** __throw_bad_alloc(); return static_cast<_Tp*>(__ret); } ! /* * Although the test in __gthread_once() would suffice, we * wrap test of the once condition in our own unlocked --- 237,243 ---- __throw_bad_alloc(); return static_cast<_Tp*>(__ret); } ! /* * Although the test in __gthread_once() would suffice, we * wrap test of the once condition in our own unlocked *************** *** 269,275 **** size_t thread_id = 0; #endif ! block_record* block; /* * Find out if we have blocks on our freelist. --- 269,275 ---- size_t thread_id = 0; #endif ! block_record* block = NULL; /* * Find out if we have blocks on our freelist. *************** *** 280,288 **** { /* * Are we using threads? ! * - Yes, lock and check if there are free blocks on the global ! * list (and if not add new ones), get the first one ! * and change owner. * - No, all operations are made directly to global pool 0 * no need to lock or change ownership but check for free * blocks on global list (and if not add new ones) and --- 280,290 ---- { /* * Are we using threads? ! * - Yes, check if there are free blocks on the global ! * list. If so, grab up to block_count blocks in one ! * lock and change ownership. If the global list is ! * empty, we allocate a new chunk and add those blocks ! * directly to our own freelist (with us as owner). * - No, all operations are made directly to global pool 0 * no need to lock or change ownership but check for free * blocks on global list (and if not add new ones) and *************** *** 291,347 **** #ifdef __GTHREADS if (__gthread_active_p()) { __gthread_mutex_lock(_S_bin[bin].mutex); if (_S_bin[bin].first[0] == NULL) { ! _S_bin[bin].first[0] = ! (block_record*)malloc(_S_chunk_size); ! if (!_S_bin[bin].first[0]) ! { ! __gthread_mutex_unlock(_S_bin[bin].mutex); ! __throw_bad_alloc(); ! } ! size_t bin_t = 1 << bin; ! size_t block_count = ! _S_chunk_size /(bin_t + sizeof(block_record)); ! _S_bin[bin].free[0] = block_count; block_count--; ! block = _S_bin[bin].first[0]; while (block_count > 0) { block->next = (block_record*)((char*)block + (bin_t + sizeof(block_record))); block = block->next; block_count--; } block->next = NULL; ! _S_bin[bin].last[0] = block; } ! block = _S_bin[bin].first[0]; ! /* ! * Remove from list and count down the available counter on ! * global pool 0. ! */ ! _S_bin[bin].first[0] = _S_bin[bin].first[0]->next; ! _S_bin[bin].free[0]--; ! __gthread_mutex_unlock(_S_bin[bin].mutex); /* ! * Now that we have removed the block from the global ! * freelist we can change owner and update the used ! * counter for this thread without locking. */ ! block->thread_id = thread_id; _S_bin[bin].used[thread_id]++; } else --- 293,375 ---- #ifdef __GTHREADS if (__gthread_active_p()) { + size_t bin_t = 1 << bin; + size_t block_count = + _S_chunk_size /(bin_t + sizeof(block_record)); + __gthread_mutex_lock(_S_bin[bin].mutex); if (_S_bin[bin].first[0] == NULL) { ! /* ! * No need to hold the lock when we are adding a ! * whole chunk to our own list ! */ ! __gthread_mutex_unlock(_S_bin[bin].mutex); ! _S_bin[bin].first[thread_id] = ! (block_record*)malloc(_S_chunk_size); ! if (!_S_bin[bin].first[thread_id]) ! __throw_bad_alloc(); ! _S_bin[bin].free[thread_id] = block_count; block_count--; ! block = _S_bin[bin].first[thread_id]; while (block_count > 0) { block->next = (block_record*)((char*)block + (bin_t + sizeof(block_record))); + block->thread_id = thread_id; block = block->next; block_count--; } block->next = NULL; ! block->thread_id = thread_id; ! _S_bin[bin].last[thread_id] = block; } + else + { + size_t global_count = 0; ! while( _S_bin[bin].first[0] != NULL && ! global_count < block_count ) ! { ! block = _S_bin[bin].first[0]; ! if (_S_bin[bin].first[thread_id] == NULL) ! _S_bin[bin].first[thread_id] = block; ! else ! _S_bin[bin].last[thread_id]->next = block; ! _S_bin[bin].last[thread_id] = block; ! ! block->thread_id = thread_id; ! ! _S_bin[bin].free[thread_id]++; ! ! _S_bin[bin].first[0] = _S_bin[bin].first[0]->next; ! ! global_count++; ! } ! ! block->next = NULL; ! ! __gthread_mutex_unlock(_S_bin[bin].mutex); ! } /* ! * Return the first newly added block in our list and ! * update the counters */ ! block = _S_bin[bin].first[thread_id]; ! _S_bin[bin].first[thread_id] = ! _S_bin[bin].first[thread_id]->next; ! ! _S_bin[bin].free[thread_id]--; _S_bin[bin].used[thread_id]++; } else *************** *** 354,362 **** size_t bin_t = 1 << bin; size_t block_count = ! _S_chunk_size / (bin_t + sizeof(block_record)); ! ! _S_bin[bin].free[0] = block_count; block_count--; block = _S_bin[bin].first[0]; --- 382,388 ---- size_t bin_t = 1 << bin; size_t block_count = ! _S_chunk_size / (bin_t + sizeof(block_record)); block_count--; block = _S_bin[bin].first[0]; *************** *** 375,386 **** block = _S_bin[bin].first[0]; /* ! * Remove from list and count down the available counter on ! * global pool 0 and increase it's used counter. */ _S_bin[bin].first[0] = _S_bin[bin].first[0]->next; - _S_bin[bin].free[0]--; - _S_bin[bin].used[0]++; } } else --- 401,409 ---- block = _S_bin[bin].first[0]; /* ! * Remove from list */ _S_bin[bin].first[0] = _S_bin[bin].first[0]->next; } } else *************** *** 392,399 **** block = _S_bin[bin].first[thread_id]; _S_bin[bin].first[thread_id] = _S_bin[bin].first[thread_id]->next; ! _S_bin[bin].free[thread_id]--; ! _S_bin[bin].used[thread_id]++; } return static_cast<_Tp*>(static_cast<void*>((char*)block + sizeof(block_record))); --- 415,428 ---- block = _S_bin[bin].first[thread_id]; _S_bin[bin].first[thread_id] = _S_bin[bin].first[thread_id]->next; ! ! #ifdef __GTHREADS ! if (__gthread_active_p()) ! { ! _S_bin[bin].free[thread_id]--; ! _S_bin[bin].used[thread_id]++; ! } ! #endif } return static_cast<_Tp*>(static_cast<void*>((char*)block + sizeof(block_record))); *************** *** 424,430 **** #endif block_record* block = (block_record*)((char*)__p ! - sizeof(block_record)); /* * This block will always be at the back of a list and thus --- 453,459 ---- #endif block_record* block = (block_record*)((char*)__p ! - sizeof(block_record)); /* * This block will always be at the back of a list and thus *************** *** 465,471 **** _S_bin[bin].first[thread_id] = _S_bin[bin].first[thread_id]->next; - _S_bin[bin].free[0]++; _S_bin[bin].free[thread_id]--; remove--; --- 494,499 ---- *************** *** 509,517 **** _S_bin[bin].last[0]->next = block; _S_bin[bin].last[0] = block; - - _S_bin[bin].free[0]++; - _S_bin[bin].used[0]--; } } }; --- 537,542 ---- *************** *** 564,570 **** #ifdef __GTHREADS if (__gthread_active_p()) { ! _S_thread_freelist_first = (thread_record*)malloc(sizeof(thread_record) * _S_max_threads); if (!_S_thread_freelist_first) --- 589,595 ---- #ifdef __GTHREADS if (__gthread_active_p()) { ! _S_thread_freelist_first = (thread_record*)malloc(sizeof(thread_record) * _S_max_threads); if (!_S_thread_freelist_first) *************** *** 578,584 **** for (i = 1; i < _S_max_threads; i++) { _S_thread_freelist_first[i - 1].next = ! &_S_thread_freelist_first[i]; _S_thread_freelist_first[i - 1].id = i; } --- 603,609 ---- for (i = 1; i < _S_max_threads; i++) { _S_thread_freelist_first[i - 1].next = ! &_S_thread_freelist_first[i]; _S_thread_freelist_first[i - 1].id = i; } *************** *** 605,656 **** if (!_S_bin) __throw_bad_alloc(); ! for (size_t bin = 0; bin < _S_no_of_bins; bin++) ! { ! std::size_t __n = _S_max_threads + 1; _S_bin[bin].first = (block_record**) ! malloc(sizeof(block_record*) * __n); if (!_S_bin[bin].first) __throw_bad_alloc(); _S_bin[bin].last = (block_record**) ! malloc(sizeof(block_record*) * __n); if (!_S_bin[bin].last) __throw_bad_alloc(); ! _S_bin[bin].free = (size_t*) malloc(sizeof(size_t) * __n); ! if (!_S_bin[bin].free) ! __throw_bad_alloc(); ! _S_bin[bin].used = (size_t*) malloc(sizeof(size_t) * __n); ! if (!_S_bin[bin].used) ! __throw_bad_alloc(); ! #ifdef __GTHREADS ! _S_bin[bin].mutex =(__gthread_mutex_t*) malloc(sizeof(__gthread_mutex_t)); #ifdef __GTHREAD_MUTEX_INIT ! { ! // Do not copy a POSIX/gthr mutex once in use. ! __gthread_mutex_t __tmp = __GTHREAD_MUTEX_INIT; ! *_S_bin[bin].mutex = __tmp; ! } #else ! { __GTHREAD_MUTEX_INIT_FUNCTION (_S_bin[bin].mutex); } #endif #endif ! for (size_t thread = 0; thread <= _S_max_threads; thread++) { _S_bin[bin].first[thread] = NULL; _S_bin[bin].last[thread] = NULL; ! _S_bin[bin].free[thread] = 0; ! _S_bin[bin].used[thread] = 0; } } --- 630,694 ---- if (!_S_bin) __throw_bad_alloc(); ! std::size_t __n = 1; + #ifdef __GTHREADS + if (__gthread_active_p()) + __n = _S_max_threads + 1; + #endif + + for (size_t bin = 0; bin < _S_no_of_bins; bin++) + { _S_bin[bin].first = (block_record**) ! malloc(sizeof(block_record*) * __n); if (!_S_bin[bin].first) __throw_bad_alloc(); _S_bin[bin].last = (block_record**) ! malloc(sizeof(block_record*) * __n); if (!_S_bin[bin].last) __throw_bad_alloc(); ! #ifdef __GTHREADS ! if (__gthread_active_p()) ! { ! _S_bin[bin].free = (size_t*) malloc(sizeof(size_t) * __n); ! if (!_S_bin[bin].free) ! __throw_bad_alloc(); ! _S_bin[bin].used = (size_t*) malloc(sizeof(size_t) * __n); ! if (!_S_bin[bin].used) ! __throw_bad_alloc(); ! _S_bin[bin].mutex =(__gthread_mutex_t*) malloc(sizeof(__gthread_mutex_t)); #ifdef __GTHREAD_MUTEX_INIT ! { ! // Do not copy a POSIX/gthr mutex once in use. ! __gthread_mutex_t __tmp = __GTHREAD_MUTEX_INIT; ! *_S_bin[bin].mutex = __tmp; ! } #else ! { __GTHREAD_MUTEX_INIT_FUNCTION (_S_bin[bin].mutex); } #endif + } #endif ! for (size_t thread = 0; thread < __n; thread++) { _S_bin[bin].first[thread] = NULL; _S_bin[bin].last[thread] = NULL; ! #ifdef __GTHREADS ! if (__gthread_active_p()) ! { ! _S_bin[bin].free[thread] = 0; ! _S_bin[bin].used[thread] = 0; ! } ! #endif } } *************** *** 783,788 **** --- 821,838 ---- template<typename _Tp> typename __mt_alloc<_Tp>::bin_record* volatile __mt_alloc<_Tp>::_S_bin = NULL; + + template<typename _Tp> + inline bool + operator==(const __mt_alloc<_Tp>&, + const __mt_alloc<_Tp>&) + { return true; } + + template<typename _Tp> + inline bool + operator!=(const __mt_alloc<_Tp>&, + const __mt_alloc<_Tp>&) + { return false; } } // namespace __gnu_cxx #endif
Subject: Re: Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator > Please apply this patch and rerun your test (it would also be great if > you could tell us whether your real application code sees major > improvement using this allocator with an MP machine; I know that might > be a little more work)... Hi Loren, well, the patch improved things significantly on the test. For example: one thread, one cpu: 17.4s two threads, one cpu: 35.0s two threads, two cpus: 29.5s So now, it seems that when the two threads are running the performance is only slightly lower on two cpus than on one. ie this is about 4 times faster than the previous version. Of course, it is still unfortunate that it is *slower* with an extra cpu. I've been working to integrate this into my application to test it properly, and it has been a real hassle! Luckily I use typedefs for every container class type, but still have 20 std::set types, 58 std::map types, 13 std::hash_map types, 97 std::vector types and 8 std::deque types. I've used the "template typdef" technique described in: http://www.gotw.ca/gotw/079.htm so at least I can now change which allocator is used in one place. I also make extensive use of std::string which also (of course) uses the allocator. Redefining a new string type using the mt_allocator has been a major hassle. I'm using lots of other libraries that return and accept ordinary std::strings, and of course the two strings aren't interchangeable. Also I've started defining new instantiations of basic_ostringstream rather than using ostringstream, etc. There's got to be a better way! any ideas...? thanks, Dan
Subject: Re: Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator > well, the patch improved things significantly on the test. For example: > one thread, one cpu: 17.4s > two threads, one cpu: 35.0s > two threads, two cpus: 29.5s [...] OK, that is good to see. I agree with your analysis; it is unfortunate that some work loads don't scale (with our allocators)... > I've been working to integrate this into my application to test it properly, > and it has been a real hassle! Luckily I use typedefs for every container > class type, but still have 20 std::set types, 58 std::map types, 13 > std::hash_map types, 97 std::vector types and 8 std::deque types. I've used > the "template typdef" technique described in: > http://www.gotw.ca/gotw/079.htm so at least I can now change which > allocator is used in one place. Sorry about that. All I can say is that reports on whether this allocator helps real applications will help us decide whether it can become the new default allocator. > I also make extensive use of std::string which also (of course) uses the > allocator. Redefining a new string type using the mt_allocator has been a > major hassle. I'm using lots of other libraries that return and accept > ordinary std::strings, and of course the two strings aren't interchangeable. > Also I've started defining new instantiations of basic_ostringstream rather > than using ostringstream, etc. > > There's got to be a better way! > > any ideas...? You might try selectively changing allocators in your system. Perhaps try only changing the containers but not strings. Regards, Loren