(cache) 13823 – Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator

User account creation filtered due to spam.

Bug 13823 - Significant performance issue with std::map on multiple threads on dual processor - possibly default allocator

Summary: Significant performance issue with std::map on multiple threads on dual proce...

Status:	SUSPENDED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	libstdc++ (show other bugs)
Version:	3.3.2

Importance:	P2 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2004-01-23 00:14 UTC by Dan Evison
Modified:	2016-01-16 05:57 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2005-09-26 00:37:29

Attachments
compressed preprocessor output, from g++ - v -save-temps (83.71 KB, application/octet-stream) 2004-01-23 00:18 UTC, Dan Evison	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dan Evison 2004-01-23 00:14:53 UTC

C++, gcc 3.3.2, Red Hat Linux 7.2, kernel 2.4.20, dual AMD Athlon MP 2200+.

When two threads are simultaneously adding elements to local standard 
container objects (including map, set and ext/hash_map), I observe a 
performance degradation by a factor of 6x, when the program is running on two 
cpus compared to running on a single CPU.  (Whereas you may expect an 
improvement of 2x).

This seems to a problem with the default allocator, since running the program 
with GLIBCPP_FORCE_NEW set, removes the problem, and dramatically speeds up 
the program on two cpus.  (Whereas you may expect a slight degradation).

The example program creates a number of threads, each of which runs an 
infinite loop, repeatedly constructing a std::map<int,int> on the stack, 
inserting a number of elements, then letting it go out of scope.  It takes a 
single parameter specifying how many threads to create - if unspecified it 
defaults to two threads.

When the program is run with two threads, on my dual CPU machine, I observe 
both processors staying about 60% idle, with each at 15% user and 25% system 
load.  Why such high system load?  When I force the program onto a single CPU 
(by running another compute-bound app), the two threads each consume a full 
50% of the CPU, with only a minimal system load, and it runs 6x faster.

By the way, I observed the same problem with gcc-3.2.1, and on another machine 
running a different version of linux.

Following is the output of the g++ -v -save-temps command.  I will attach the 
compressed preprocessed .ii file after this bug is submitted (since I can't 
see how to do it on this page).  Further below is the unpreprocessed source 
file, in case that is useful too.

thanks,

Dan Evison
devison at pacificit dot co dot nz

---------------------------------------------------------------------

$ /usr/local/gcc-3.3.2/bin/g++ -v -save-temps -pthread -D _PTHREADS -D 
_REENTRANT maptest.cpp
Reading specs from /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-
gnu/3.3.2/specs
Configured with: ../gcc-3.3.2/configure --enable-shared --with-gnu-as --with-
gnu-ld --enable-threads=posix --enable-languages=c++,java --disable-nls --
prefix=/usr/local/gcc-3.3.2
Thread model: posix
gcc version 3.3.2
 /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-gnu/3.3.2/cc1plus -E -
D__GNUG__=3 -quiet -v -D__GNUC__=3 -D__GNUC_MINOR__=3 -D__GNUC_PATCHLEVEL__=2 -
D_GNU_SOURCE -D_REENTRANT -D _PTHREADS -D _REENTRANT maptest.cpp maptest.ii
ignoring nonexistent directory "/usr/local/gcc-3.3.2/i686-pc-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/local/gcc-3.3.2/include/c++/3.3.2
 /usr/local/gcc-3.3.2/include/c++/3.3.2/i686-pc-linux-gnu
 /usr/local/gcc-3.3.2/include/c++/3.3.2/backward
 /usr/local/include
 /usr/local/gcc-3.3.2/include
 /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-gnu/3.3.2/include
 /usr/include
End of search list.
 /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-gnu/3.3.2/cc1plus -
fpreprocessed maptest.ii -quiet -dumpbase maptest.cpp -auxbase maptest -
version -o maptest.s
GNU C++ version 3.3.2 (i686-pc-linux-gnu)
        compiled by GNU C version 3.3.2.
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
 as -V -Qy -o maptest.o maptest.s
GNU assembler version 2.11.93.0.2 (i386-redhat-linux) using BFD version 
2.11.93.0.2 20020207
 /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-gnu/3.3.2/collect2 --eh-frame-
hdr -m elf_i386 -dynamic-linker /lib/ld-
linux.so.2 /usr/lib/crt1.o /usr/lib/crti.o /usr/local/gcc-3.3.2/lib/gcc-
lib/i686-pc-linux-gnu/3.3.2/crtbegin.o -L/usr/local/gcc-3.3.2/lib/gcc-lib/i686-
pc-linux-gnu/3.3.2 -L/usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-
gnu/3.3.2/../../.. maptest.o -lstdc++ -lm -lgcc_s -lgcc -lpthread -lc -lgcc_s -
lgcc /usr/local/gcc-3.3.2/lib/gcc-lib/i686-pc-linux-
gnu/3.3.2/crtend.o /usr/lib/crtn.o

--------------------------------------------------------------------

#include <iostream>
#include <map>
#include <pthread.h>

void *calc(void *n)
{
  int *num = static_cast<int *>(n);

  unsigned c = 0;
  
  while (true)
    {
      std::cerr << "Thread " << *num << ": " << ++c << std::endl;

      std::map<int, int> m;
      
      for (unsigned i = 0; i < 100000; ++i) 
      	m[i] = i;
    }

  return 0;
}

int main(int argc, char *argv[])
{
  int n = argc == 2 ? std::max(atoi(argv[1]), 1) : 2;

  pthread_t id;
  for (int i = 0; i < n; ++i)
    pthread_create(&id, 0, calc, new int(i));
  
  pthread_join(id, 0);
  
  return 0;
}

Comment 1 Dan Evison 2004-01-23 00:18:52 UTC

Created attachment 5556 [details]
compressed preprocessor output, from g++ - v -save-temps

Compressed with bzip2.

Comment 2 Andrew Pinski 2004-01-23 02:34:07 UTC

Confirmed on the mainline but it is not as bad as 3.3 was.  The problem is that is the a mutex is 
getting locked too much, I do not know if this is ever going to be fixed except by per thread 
pool_alloc.

Comment 3 Dan Evison 2004-01-23 02:41:45 UTC

Subject: Re:  Significant performance issue with std::map
 on multiple threads on dual processor - possibly default allocator


Thanks for the fast response!
Is there a per thread pool_alloc currrently available that is recommended
for use?
There looks like evidence of one, in the file
include/c++/3.3.2/bits/stl_pthread_alloc.h

(should I post questions like this via bugzilla?)


----- Original Message -----
From: "pinskia at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org>
To: <devison@pacificit.co.nz>
Sent: Friday, January 23, 2004 3:34 PM
Subject: [Bug libstdc++/13823] Significant performance issue with std::map
on multiple threads on dual processor - possibly default allocator


>
> ------- Additional Comments From pinskia at gcc dot gnu dot org
2004-01-23 02:34 -------
> Confirmed on the mainline but it is not as bad as 3.3 was.  The problem is
that is the a mutex is
> getting locked too much, I do not know if this is ever going to be fixed
except by per thread
> pool_alloc.
>
> --
>            What    |Removed                     |Added
> --------------------------------------------------------------------------
--
>            Severity|normal                      |enhancement
>              Status|UNCONFIRMED                 |NEW
>           Component|c++                         |libstdc++
>      Ever Confirmed|                            |1
>    Last reconfirmed|0000-00-00 00:00:00         |2004-01-23 02:34:08
>                date|                            |
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13823
>
> ------- You are receiving this mail because: -------
> You reported the bug, or are watching the reporter.
>
>

Comment 4 Andrew Pinski 2004-01-23 02:44:39 UTC

No, that is what I am saying is this is a way to fix this and that is all.

Comment 5 rittle 2004-01-23 03:26:04 UTC

Subject: Re:  Significant performance issue with std::map on
	multiple threads on dual processor - possibly default allocator

> Is there a per thread pool_alloc currrently available that is recommended
> for use?
> There looks like evidence of one, in the file
> include/c++/3.3.2/bits/stl_pthread_alloc.h

Do not use that one (well, the problem was it was never made portable
within our library context; it might work for you on any random system).

There is <ext/mt_allocator.h> on mainline / 3.4.  The author of that
work just submitted patches which solve MP contention issues and
double the speed on non-MP systems.  Please try that work, if you can.

Regards,
Loren

> (should I post questions like this via bugzilla?)

PS, please join us on libstdc++@gcc.gnu.org since we are currently
discussing MT/MP allocator issues in the C++ library.

Comment 6 Loren Rittle 2004-01-23 03:38:37 UTC

FYI, the initially submitted case misused pthread_join (granted moot since
threads never
exit).  Here is a better version for our use (this one exits, thus easier to time):

#include <iostream>
#include <map>

void *calc(void *n)
{
  std::map<int, int> m;
      
  for (unsigned i = 0; i < 100000; ++i) 
    m[i] = i;

  return 0;
}

int main(int argc, char *argv[])
{
  int n = argc == 2 ? std::max(atoi(argv[1]), 1) : 2;

  pthread_t id[n];

  for (int i = 0; i < n; ++i)
    pthread_create(&id[i], 0, calc, new int(i));
  
  for (int i = 0; i < n; ++i)
    pthread_join(id[i], 0);
  
  return 0;
}

Comment 7 Loren Rittle 2004-01-23 03:58:11 UTC

Try this (we might change the default allocator to make it work better for MP
at the next major library revision --- ABI change, but 3.4 will support this
allocator
with very good performance for people that wish to tune their application):

#include <iostream>
#include <map>
#include <ext/mt_allocator.h>

void *calc(void *n)
{
  std::map<int, int, std::less<const int>,
           __gnu_cxx::__mt_alloc< std::pair<const int, int> > > m;
      
  for (unsigned i = 0; i < 100000; ++i) 
    m[i] = i;

  return 0;
}

int main(int argc, char *argv[])
{
  int n = argc == 2 ? std::max(atoi(argv[1]), 1) : 2;

  pthread_t id[n];

  for (int i = 0; i < n; ++i)
    pthread_create(&id[i], 0, calc, new int(i));
  
  for (int i = 0; i < n; ++i)
    pthread_join(id[i], 0);
  
  return 0;
}

Comment 8 Dan Evison 2004-01-23 03:58:36 UTC

Subject: Re:  Significant performance issue with std::map
 on multiple threads on dual processor - possibly default allocator


thanks - I can see that's better being able to time it, and it simplifies
the program.

I'm no expert here, but perhaps it would still be best to have an outer loop
in the calc function - say 5 or 10 iterations.  That may exercise the
allocator a bit more, repeatedly allocating and deallocating from the pool.

As an aside - I didn't know an array could have a variable as its size.
Stroustrup's C++PLv3 says "The number of elements of the array, the array
bound, must be a constant expression".  But I see it compiles and works with
gcc!

(btw I was aware of the misuse :)


----- Original Message -----
From: "ljrittle at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org>
To: <devison@pacificit.co.nz>
Sent: Friday, January 23, 2004 4:38 PM
Subject: [Bug libstdc++/13823] Significant performance issue with std::map
on multiple threads on dual processor - possibly default allocator


>
> ------- Additional Comments From ljrittle at gcc dot gnu dot org
2004-01-23 03:38 -------
> FYI, the initially submitted case misused pthread_join (granted moot since
> threads never
> exit).  Here is a better version for our use (this one exits, thus easier
to time):
>
> #include <iostream>
> #include <map>
>
> void *calc(void *n)
> {
>   std::map<int, int> m;
>
>   for (unsigned i = 0; i < 100000; ++i)
>     m[i] = i;
>
>   return 0;
> }
>
> int main(int argc, char *argv[])
> {
>   int n = argc == 2 ? std::max(atoi(argv[1]), 1) : 2;
>
>   pthread_t id[n];
>
>   for (int i = 0; i < n; ++i)
>     pthread_create(&id[i], 0, calc, new int(i));
>
>   for (int i = 0; i < n; ++i)
>     pthread_join(id[i], 0);
>
>   return 0;
> }
>
> --
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13823
>
> ------- You are receiving this mail because: -------
> You reported the bug, or are watching the reporter.
>
>

Comment 9 rittle 2004-01-23 04:14:01 UTC

Subject: Re:  Significant performance issue with std::map on
	multiple threads on dual processor - possibly default allocator


> I'm no expert here, but perhaps it would still be best to have an outer loop
> in the calc function - say 5 or 10 iterations.  That may exercise the
> allocator a bit more, repeatedly allocating and deallocating from the pool.

OK, added to performance test case I'm adding to libstdc++-v3.

> As an aside - I didn't know an array could have a variable as its size.
> Stroustrup's C++PLv3 says "The number of elements of the array, the array
> bound, must be a constant expression".  But I see it compiles and works with
> gcc!

OK, you got me.  Add -pedantic to see g++ fail the test case, as I
updated it.  This is a long-standing gcc extension.

> (btw I was aware of the misuse :)

OK, no problem.  Just an FYI to those that follow the thread later.

Thanks for the PR,
Loren

Comment 10 Dan Evison 2004-01-23 05:31:19 UTC

Subject: Re:  Significant performance issue with std::map
 on multiple threads on dual processor - possibly default allocator


Hi,

I obtained the latest ext/mt_allocator.h file from cvs, and reran the test
using it.  Otherwise I'm still using the gcc 3.3.2 headers.  BTW, I needed
to make one addition to make it compile:
namespace __gnu_cxx { using std::__throw_bad_alloc; }

On my machine, it showed an improvement of around 30%, which is good, but is
still running into a significant bottleneck.  (Still 4 times slower than on
a single cpu).

Has anyone seen other results?

By the way, while this example program is a little perverse, I was getting a
similar performance hit in my application in which I was trying to speed
things up with a second calculation thread.  The calculations I am doing
require creating tables of temporary results.

thanks,
Dan


----- Original Message -----
From: "ljrittle at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org>
To: <devison@pacificit.co.nz>
Sent: Friday, January 23, 2004 4:58 PM
Subject: [Bug libstdc++/13823] Significant performance issue with std::map
on multiple threads on dual processor - possibly default allocator


>
> ------- Additional Comments From ljrittle at gcc dot gnu dot org
2004-01-23 03:58 -------
> Try this (we might change the default allocator to make it work better for
MP
> at the next major library revision --- ABI change, but 3.4 will support
this
> allocator
> with very good performance for people that wish to tune their
application):
>
> #include <iostream>
> #include <map>
> #include <ext/mt_allocator.h>
>
> void *calc(void *n)
> {
>   std::map<int, int, std::less<const int>,
>            __gnu_cxx::__mt_alloc< std::pair<const int, int> > > m;
>
>   for (unsigned i = 0; i < 100000; ++i)
>     m[i] = i;
>
>   return 0;
> }
>
> int main(int argc, char *argv[])
> {
>   int n = argc == 2 ? std::max(atoi(argv[1]), 1) : 2;
>
>   pthread_t id[n];
>
>   for (int i = 0; i < n; ++i)
>     pthread_create(&id[i], 0, calc, new int(i));
>
>   for (int i = 0; i < n; ++i)
>     pthread_join(id[i], 0);
>
>   return 0;
> }
>
> --
>            What    |Removed                     |Added
> --------------------------------------------------------------------------
--
>              Status|NEW                         |SUSPENDED
>
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13823
>
> ------- You are receiving this mail because: -------
> You reported the bug, or are watching the reporter.
>
>

Comment 11 rittle 2004-01-23 06:01:18 UTC

Subject: Re:  Significant performance issue with std::map on
	multiple threads on dual processor - possibly default allocator


> I obtained the latest ext/mt_allocator.h file from cvs, and reran the test
> using it.  Otherwise I'm still using the gcc 3.3.2 headers.  BTW, I needed
> to make one addition to make it compile:
> namespace __gnu_cxx { using std::__throw_bad_alloc; }

Please apply this patch and rerun your test (it would also be great if
you could tell us whether your real application code sees major
improvement using this allocator with an MP machine; I know that might
be a little more work)...

[This patch is from Stefan, the original author of mt_allocator.h .]

Index: include/ext/mt_allocator.h
===================================================================
RCS file: /cvs/gcc/gcc/libstdc++-v3/include/ext/mt_allocator.h,v
retrieving revision 1.9
diff -c -r1.9 mt_allocator.h
*** include/ext/mt_allocator.h	20 Jan 2004 06:35:21 -0000	1.9
--- include/ext/mt_allocator.h	23 Jan 2004 05:53:48 -0000
***************
*** 77,95 ****
  
        __mt_alloc() throw() 
        {
! 	// XXX
        }
  
        __mt_alloc(const __mt_alloc&) throw()
        {
! 	// XXX
        }
  
        template<typename _Tp1>
          __mt_alloc(const __mt_alloc<_Tp1>&) throw()
          {
! 	  // XXX
! 	}
  
        ~__mt_alloc() throw() { }
  
--- 77,95 ----
  
        __mt_alloc() throw() 
        {
!         // XXX
        }
  
        __mt_alloc(const __mt_alloc&) throw()
        {
!         // XXX
        }
  
        template<typename _Tp1>
          __mt_alloc(const __mt_alloc<_Tp1>&) throw()
          {
!           // XXX
!         }
  
        ~__mt_alloc() throw() { }
  
***************
*** 237,243 ****
                __throw_bad_alloc();
              return static_cast<_Tp*>(__ret);
            }
! 	
          /*
           * Although the test in __gthread_once() would suffice, we
           * wrap test of the once condition in our own unlocked
--- 237,243 ----
                __throw_bad_alloc();
              return static_cast<_Tp*>(__ret);
            }
! 
          /*
           * Although the test in __gthread_once() would suffice, we
           * wrap test of the once condition in our own unlocked
***************
*** 269,275 ****
          size_t thread_id = 0;
  #endif
  
!         block_record* block;
  
          /*
           * Find out if we have blocks on our freelist.
--- 269,275 ----
          size_t thread_id = 0;
  #endif
  
!         block_record* block = NULL;
  
          /*
           * Find out if we have blocks on our freelist.
***************
*** 280,288 ****
            {
              /*
               * Are we using threads?
!              * - Yes, lock and check if there are free blocks on the global
!              *   list (and if not add new ones), get the first one
!              *   and change owner.
               * - No, all operations are made directly to global pool 0
               *   no need to lock or change ownership but check for free
               *   blocks on global list (and if not add new ones) and
--- 280,290 ----
            {
              /*
               * Are we using threads?
!              * - Yes, check if there are free blocks on the global
!              *   list. If so, grab up to block_count blocks in one
!              *   lock and change ownership. If the global list is 
!              *   empty, we allocate a new chunk and add those blocks 
!              *   directly to our own freelist (with us as owner).
               * - No, all operations are made directly to global pool 0
               *   no need to lock or change ownership but check for free
               *   blocks on global list (and if not add new ones) and
***************
*** 291,347 ****
  #ifdef __GTHREADS
              if (__gthread_active_p())
                {
                  __gthread_mutex_lock(_S_bin[bin].mutex);
  
                  if (_S_bin[bin].first[0] == NULL)
                    {
!                     _S_bin[bin].first[0] =
!                       (block_record*)malloc(_S_chunk_size);
  
!                     if (!_S_bin[bin].first[0])
!                       {
!                         __gthread_mutex_unlock(_S_bin[bin].mutex);
!                         __throw_bad_alloc();
!                       }
  
!                     size_t bin_t = 1 << bin;
!                     size_t block_count =
!                       _S_chunk_size /(bin_t + sizeof(block_record));
  
!                     _S_bin[bin].free[0] = block_count;
  
                      block_count--;
!                     block = _S_bin[bin].first[0];
  
                      while (block_count > 0)
                        {
                          block->next = (block_record*)((char*)block +
                                        (bin_t + sizeof(block_record)));
                          block = block->next;
                          block_count--;
                        }
  
                      block->next = NULL;
!                     _S_bin[bin].last[0] = block;
                    }
  
!                 block = _S_bin[bin].first[0];
  
!                 /*
!                  * Remove from list and count down the available counter on
!                  * global pool 0.
!                  */
!                 _S_bin[bin].first[0] = _S_bin[bin].first[0]->next;
!                 _S_bin[bin].free[0]--;
  
!                 __gthread_mutex_unlock(_S_bin[bin].mutex);
  
                  /*
!                  * Now that we have removed the block from the global
!                  * freelist we can change owner and update the used
!                  * counter for this thread without locking.
                   */
!                 block->thread_id = thread_id;
                  _S_bin[bin].used[thread_id]++;
                }
              else
--- 293,375 ----
  #ifdef __GTHREADS
              if (__gthread_active_p())
                {
+                 size_t bin_t = 1 << bin;
+                 size_t block_count =
+                   _S_chunk_size /(bin_t + sizeof(block_record));
+ 
                  __gthread_mutex_lock(_S_bin[bin].mutex);
  
                  if (_S_bin[bin].first[0] == NULL)
                    {
!                     /*
!                      * No need to hold the lock when we are adding a
!                      * whole chunk to our own list
!                      */
!                     __gthread_mutex_unlock(_S_bin[bin].mutex);
  
!                     _S_bin[bin].first[thread_id] =
!                       (block_record*)malloc(_S_chunk_size);
  
!                     if (!_S_bin[bin].first[thread_id])
!                       __throw_bad_alloc();
  
!                     _S_bin[bin].free[thread_id] = block_count;
  
                      block_count--;
!                     block = _S_bin[bin].first[thread_id];
  
                      while (block_count > 0)
                        {
                          block->next = (block_record*)((char*)block +
                                        (bin_t + sizeof(block_record)));
+                         block->thread_id = thread_id;
                          block = block->next;
                          block_count--;
                        }
  
                      block->next = NULL;
!                     block->thread_id = thread_id;
!                     _S_bin[bin].last[thread_id] = block;
                    }
+                 else
+                   {
+                     size_t global_count = 0;
  
!                     while( _S_bin[bin].first[0] != NULL &&
!                            global_count < block_count )
!                       {
!                         block = _S_bin[bin].first[0];
  
!                         if (_S_bin[bin].first[thread_id] == NULL)
!                           _S_bin[bin].first[thread_id] = block;
!                         else
!                           _S_bin[bin].last[thread_id]->next = block;
  
!                         _S_bin[bin].last[thread_id] = block;
! 
!                         block->thread_id = thread_id;
! 
!                         _S_bin[bin].free[thread_id]++;
! 
!                         _S_bin[bin].first[0] = _S_bin[bin].first[0]->next;
! 
!                         global_count++;
!                       }
! 
!                     block->next = NULL;
! 
!                     __gthread_mutex_unlock(_S_bin[bin].mutex);
!                   }
  
                  /*
!                  * Return the first newly added block in our list and
!                  * update the counters
                   */
!                 block = _S_bin[bin].first[thread_id];
!                 _S_bin[bin].first[thread_id] = 
!                   _S_bin[bin].first[thread_id]->next;
! 
!                 _S_bin[bin].free[thread_id]--;
                  _S_bin[bin].used[thread_id]++;
                }
              else
***************
*** 354,362 ****
  
                  size_t bin_t = 1 << bin;
                  size_t block_count = 
! 		  _S_chunk_size / (bin_t + sizeof(block_record));
! 
!                 _S_bin[bin].free[0] = block_count;
  
                  block_count--;
                  block = _S_bin[bin].first[0];
--- 382,388 ----
  
                  size_t bin_t = 1 << bin;
                  size_t block_count = 
!                   _S_chunk_size / (bin_t + sizeof(block_record));
  
                  block_count--;
                  block = _S_bin[bin].first[0];
***************
*** 375,386 ****
                  block = _S_bin[bin].first[0];
  
                  /*
!                  * Remove from list and count down the available counter on
!                  * global pool 0 and increase it's used counter.
                   */
                  _S_bin[bin].first[0] = _S_bin[bin].first[0]->next;
-                 _S_bin[bin].free[0]--;
-                 _S_bin[bin].used[0]++;
                }
            }
          else
--- 401,409 ----
                  block = _S_bin[bin].first[0];
  
                  /*
!                  * Remove from list
                   */
                  _S_bin[bin].first[0] = _S_bin[bin].first[0]->next;
                }
            }
          else
***************
*** 392,399 ****
              block = _S_bin[bin].first[thread_id];
  
              _S_bin[bin].first[thread_id] = _S_bin[bin].first[thread_id]->next;
!             _S_bin[bin].free[thread_id]--;
!             _S_bin[bin].used[thread_id]++;
            }
  
          return static_cast<_Tp*>(static_cast<void*>((char*)block + sizeof(block_record)));
--- 415,428 ----
              block = _S_bin[bin].first[thread_id];
  
              _S_bin[bin].first[thread_id] = _S_bin[bin].first[thread_id]->next;
! 
! #ifdef __GTHREADS
!             if (__gthread_active_p())
!               {
!                 _S_bin[bin].free[thread_id]--;
!                 _S_bin[bin].used[thread_id]++;
!               }
! #endif
            }
  
          return static_cast<_Tp*>(static_cast<void*>((char*)block + sizeof(block_record)));
***************
*** 424,430 ****
  #endif
  
          block_record* block = (block_record*)((char*)__p
! 					      - sizeof(block_record));
  
          /*
           * This block will always be at the back of a list and thus
--- 453,459 ----
  #endif
  
          block_record* block = (block_record*)((char*)__p
!                                              - sizeof(block_record));
  
          /*
           * This block will always be at the back of a list and thus
***************
*** 465,471 ****
                      _S_bin[bin].first[thread_id] =
                        _S_bin[bin].first[thread_id]->next;
  
-                     _S_bin[bin].free[0]++;
                      _S_bin[bin].free[thread_id]--;
  
                      remove--;
--- 494,499 ----
***************
*** 509,517 ****
                _S_bin[bin].last[0]->next = block;
  
              _S_bin[bin].last[0] = block;
- 
-             _S_bin[bin].free[0]++;
-             _S_bin[bin].used[0]--;
            }
        }
      };
--- 537,542 ----
***************
*** 564,570 ****
  #ifdef __GTHREADS
        if (__gthread_active_p())
          {
! 	  _S_thread_freelist_first =
              (thread_record*)malloc(sizeof(thread_record) * _S_max_threads);
  
            if (!_S_thread_freelist_first)
--- 589,595 ----
  #ifdef __GTHREADS
        if (__gthread_active_p())
          {
!           _S_thread_freelist_first =
              (thread_record*)malloc(sizeof(thread_record) * _S_max_threads);
  
            if (!_S_thread_freelist_first)
***************
*** 578,584 ****
            for (i = 1; i < _S_max_threads; i++)
              {
                _S_thread_freelist_first[i - 1].next = 
! 		&_S_thread_freelist_first[i];
  
                _S_thread_freelist_first[i - 1].id = i;
              }
--- 603,609 ----
            for (i = 1; i < _S_max_threads; i++)
              {
                _S_thread_freelist_first[i - 1].next = 
!                 &_S_thread_freelist_first[i];
  
                _S_thread_freelist_first[i - 1].id = i;
              }
***************
*** 605,656 ****
        if (!_S_bin)
          __throw_bad_alloc();
  
!        for (size_t bin = 0; bin < _S_no_of_bins; bin++)
!         {
! 	  std::size_t __n = _S_max_threads + 1;
  
            _S_bin[bin].first = (block_record**) 
! 	    malloc(sizeof(block_record*) * __n);
  
            if (!_S_bin[bin].first)
              __throw_bad_alloc();
  
            _S_bin[bin].last = (block_record**) 
! 	    malloc(sizeof(block_record*) * __n);
  
            if (!_S_bin[bin].last)
              __throw_bad_alloc();
  
!           _S_bin[bin].free = (size_t*) malloc(sizeof(size_t) * __n);
  
!           if (!_S_bin[bin].free)
!             __throw_bad_alloc();
  
!           _S_bin[bin].used = (size_t*) malloc(sizeof(size_t) * __n);
  
!           if (!_S_bin[bin].used)
!             __throw_bad_alloc();
  
! #ifdef __GTHREADS
!           _S_bin[bin].mutex =(__gthread_mutex_t*) malloc(sizeof(__gthread_mutex_t));
  
  #ifdef __GTHREAD_MUTEX_INIT
! 	  {
! 	    // Do not copy a POSIX/gthr mutex once in use.
! 	    __gthread_mutex_t __tmp = __GTHREAD_MUTEX_INIT;
! 	    *_S_bin[bin].mutex = __tmp;
! 	  }
  #else
! 	  { __GTHREAD_MUTEX_INIT_FUNCTION (_S_bin[bin].mutex); }
  #endif
  #endif
  
!           for (size_t thread = 0; thread <= _S_max_threads; thread++)
              {
                _S_bin[bin].first[thread] = NULL;
                _S_bin[bin].last[thread] = NULL;
!               _S_bin[bin].free[thread] = 0;
!               _S_bin[bin].used[thread] = 0;
              }
          }
  
--- 630,694 ----
        if (!_S_bin)
          __throw_bad_alloc();
  
!       std::size_t __n = 1;
  
+ #ifdef __GTHREADS
+       if (__gthread_active_p())
+         __n = _S_max_threads + 1;
+ #endif
+ 
+       for (size_t bin = 0; bin < _S_no_of_bins; bin++)
+         {
            _S_bin[bin].first = (block_record**) 
!             malloc(sizeof(block_record*) * __n);
  
            if (!_S_bin[bin].first)
              __throw_bad_alloc();
  
            _S_bin[bin].last = (block_record**) 
!             malloc(sizeof(block_record*) * __n);
  
            if (!_S_bin[bin].last)
              __throw_bad_alloc();
  
! #ifdef __GTHREADS
!           if (__gthread_active_p())
!             {
!               _S_bin[bin].free = (size_t*) malloc(sizeof(size_t) * __n);
  
!               if (!_S_bin[bin].free)
!                 __throw_bad_alloc();
  
!               _S_bin[bin].used = (size_t*) malloc(sizeof(size_t) * __n);
  
!               if (!_S_bin[bin].used)
!                 __throw_bad_alloc();
  
!               _S_bin[bin].mutex =(__gthread_mutex_t*) malloc(sizeof(__gthread_mutex_t));
  
  #ifdef __GTHREAD_MUTEX_INIT
!               {
!                 // Do not copy a POSIX/gthr mutex once in use.
!                 __gthread_mutex_t __tmp = __GTHREAD_MUTEX_INIT;
!                 *_S_bin[bin].mutex = __tmp;
!               }
  #else
!               { __GTHREAD_MUTEX_INIT_FUNCTION (_S_bin[bin].mutex); }
  #endif
+             }
  #endif
  
!           for (size_t thread = 0; thread < __n; thread++)
              {
                _S_bin[bin].first[thread] = NULL;
                _S_bin[bin].last[thread] = NULL;
! #ifdef __GTHREADS
!               if (__gthread_active_p())
!                 {
!                   _S_bin[bin].free[thread] = 0;
!                   _S_bin[bin].used[thread] = 0;
!                 }
! #endif
              }
          }
  
***************
*** 783,788 ****
--- 821,838 ----
  
    template<typename _Tp> typename __mt_alloc<_Tp>::bin_record*
    volatile __mt_alloc<_Tp>::_S_bin = NULL;
+ 
+   template<typename _Tp>
+     inline bool
+     operator==(const __mt_alloc<_Tp>&,
+                const __mt_alloc<_Tp>&)
+     { return true; }
+   
+   template<typename _Tp>
+     inline bool
+     operator!=(const __mt_alloc<_Tp>&,
+                const __mt_alloc<_Tp>&)
+     { return false; }
  } // namespace __gnu_cxx
  
  #endif

Comment 12 Dan Evison 2004-01-23 12:48:25 UTC

Subject: Re:  Significant performance issue with std::map
 on multiple threads on dual processor - possibly default allocator


> Please apply this patch and rerun your test (it would also be great if
> you could tell us whether your real application code sees major
> improvement using this allocator with an MP machine; I know that might
> be a little more work)...


Hi Loren,

well, the patch improved things significantly on the test.  For example:
one thread, one cpu:   17.4s
two threads, one cpu:  35.0s
two threads, two cpus: 29.5s

So now, it seems that when the two threads are running the performance is
only slightly lower on two cpus than on one.  ie this is about 4 times
faster than the previous version.  Of course, it is still unfortunate that
it is *slower* with an extra cpu.

I've been working to integrate this into my application to test it properly,
and it has been a real hassle!  Luckily I use typedefs for every container
class type, but still have 20 std::set types, 58 std::map types, 13
std::hash_map types, 97 std::vector types and 8 std::deque types.  I've used
the "template typdef" technique described in:
http://www.gotw.ca/gotw/079.htm  so at least I can now change which
allocator is used in one place.

I also make extensive use of std::string which also (of course) uses the
allocator.   Redefining a new string type using the mt_allocator has been a
major hassle.  I'm using lots of other libraries that return and accept
ordinary std::strings, and of course the two strings aren't interchangeable.
Also I've started defining new instantiations of basic_ostringstream rather
than using ostringstream, etc.

There's got to be a better way!

any ideas...?

thanks,
Dan

Comment 13 rittle 2004-01-23 20:47:07 UTC

Subject: Re:  Significant performance issue with std::map on
	multiple threads on dual processor - possibly default allocator


> well, the patch improved things significantly on the test.  For example:
> one thread, one cpu:   17.4s
> two threads, one cpu:  35.0s
> two threads, two cpus: 29.5s
[...]

OK, that is good to see.  I agree with your analysis; it is
unfortunate that some work loads don't scale (with our allocators)...

> I've been working to integrate this into my application to test it properly,
> and it has been a real hassle!  Luckily I use typedefs for every container
> class type, but still have 20 std::set types, 58 std::map types, 13
> std::hash_map types, 97 std::vector types and 8 std::deque types.  I've used
> the "template typdef" technique described in:
> http://www.gotw.ca/gotw/079.htm  so at least I can now change which
> allocator is used in one place.

Sorry about that.  All I can say is that reports on whether this
allocator helps real applications will help us decide whether it can
become the new default allocator.

> I also make extensive use of std::string which also (of course) uses the
> allocator.   Redefining a new string type using the mt_allocator has been a
> major hassle.  I'm using lots of other libraries that return and accept
> ordinary std::strings, and of course the two strings aren't interchangeable.
> Also I've started defining new instantiations of basic_ostringstream rather
> than using ostringstream, etc.
>
> There's got to be a better way!
>
> any ideas...?

You might try selectively changing allocators in your system.  Perhaps
try only changing the containers but not strings.

Regards,
Loren