|
| 1 | +nedalloc v1.05 15th June 2008: |
| 2 | +-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= |
| 3 | + |
| 4 | +by Niall Douglas (http://www.nedprod.com/programs/portable/nedmalloc/) |
| 5 | + |
| 6 | +Enclosed is nedalloc, an alternative malloc implementation for multiple |
| 7 | +threads without lock contention based on dlmalloc v2.8.4. It is more |
| 8 | +or less a newer implementation of ptmalloc2, the standard allocator in |
| 9 | +Linux (which is based on dlmalloc v2.7.0) but also contains a per-thread |
| 10 | +cache for maximum CPU scalability. |
| 11 | + |
| 12 | +It is licensed under the Boost Software License which basically means |
| 13 | +you can do anything you like with it. This does not apply to the malloc.c.h |
| 14 | +file which remains copyright to others. |
| 15 | + |
| 16 | +It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64) |
| 17 | +and Apple MacOS X (x86). It works very well on all of these and is very |
| 18 | +significantly faster than the system allocator on all of these platforms. |
| 19 | + |
| 20 | +By literally dropping in this allocator as a replacement for your system |
| 21 | +allocator, you can see real world improvements of up to three times in normal |
| 22 | +code! |
| 23 | + |
| 24 | +To use: |
| 25 | +-=-=-=- |
| 26 | +Drop in nedmalloc.h, nedmalloc.c and malloc.c.h into your project. |
| 27 | +Configure using the instructions in nedmalloc.h. Run and enjoy. |
| 28 | + |
| 29 | +To test, compile test.c. It will run a comparison between your system |
| 30 | +allocator and nedalloc and tell you how much faster nedalloc is. It also |
| 31 | +serves as an example of usage. |
| 32 | + |
| 33 | +Notes: |
| 34 | +-=-=-= |
| 35 | +If you want the very latest version of this allocator, get it from the |
| 36 | +TnFOX SVN repository at svn://svn.berlios.de/viewcvs/tnfox/trunk/src/nedmalloc |
| 37 | + |
| 38 | +Because of how nedalloc allocates an mspace per thread, it can cause |
| 39 | +severe bloating of memory usage under certain allocation patterns. |
| 40 | +You can substantially reduce this wastage by setting MAXTHREADSINPOOL |
| 41 | +or the threads parameter to nedcreatepool() to a fraction of the number of |
| 42 | +threads which would normally be in a pool at once. This will reduce |
| 43 | +bloating at the cost of an increase in lock contention. If allocated size |
| 44 | +is less than THREADCACHEMAX, locking is avoided 90-99% of the time and |
| 45 | +if most of your allocations are below this value, you can safely set |
| 46 | +MAXTHREADSINPOOL to one. |
| 47 | + |
| 48 | +You will suffer memory leakage unless you call neddisablethreadcache() |
| 49 | +per pool for every thread which exits. This is because nedalloc cannot |
| 50 | +portably know when a thread exits and thus when its thread cache can |
| 51 | +be returned for use by other code. Don't forget pool zero, the system pool. |
| 52 | + |
| 53 | +For C++ type allocation patterns (where the same sizes of memory are |
| 54 | +regularly allocated and deallocated as objects are created and destroyed), |
| 55 | +the threadcache always benefits performance. If however your allocation |
| 56 | +patterns are different, searching the threadcache may significantly slow |
| 57 | +down your code - as a rule of thumb, if cache utilisation is below 80% |
| 58 | +(see the source for neddisablethreadcache() for how to enable debug |
| 59 | +printing in release mode) then you should disable the thread cache for |
| 60 | +that thread. You can compile out the threadcache code by setting |
| 61 | +THREADCACHEMAX to zero. |
| 62 | + |
| 63 | +Speed comparisons: |
| 64 | +-=-=-=-=-=-=-=-=-= |
| 65 | +See Benchmarks.xls for details. |
| 66 | + |
| 67 | +The enclosed test.c can do two things: it can be a torture test or a speed |
| 68 | +test. The speed test is designed to be a representative synthetic |
| 69 | +memory allocator test. It works by randomly mixing allocations with frees |
| 70 | +with half of the allocation sizes being a two power multiple less than |
| 71 | +512 bytes (to mimic C++ stack instantiated objects) and the other half |
| 72 | +being a simple random value less than 16Kb. |
| 73 | + |
| 74 | +The real world code results are from Tn's TestIO benchmark. This is a |
| 75 | +heavily multithreaded and memory intensive benchmark with a lot of branching |
| 76 | +and other stuff modern processors don't like so much. As you'll note, the |
| 77 | +test doesn't show the benefits of the threadcache mostly due to the saturation |
| 78 | +of the memory bus being the limiting factor. |
| 79 | + |
| 80 | +ChangeLog: |
| 81 | +-=-=-=-=-= |
| 82 | +v1.05 15th June 2008: |
| 83 | + * { 1042 } Added error check for TLSSET() and TLSFREE() macros. Thanks to |
| 84 | +Markus Elfring for reporting this. |
| 85 | + * { 1043 } Fixed a segfault when freeing memory allocated using |
| 86 | +nedindependent_comalloc(). Thanks to Pavel Vozenilek for reporting this. |
| 87 | + |
| 88 | +v1.04 14th July 2007: |
| 89 | + * Fixed a bug with the new optimised implementation that failed to lock |
| 90 | +on a realloc under certain conditions. |
| 91 | + * Fixed lack of thread synchronisation in InitPool() causing pool corruption |
| 92 | + * Fixed a memory leak of thread cache contents on disabling. Thanks to Earl |
| 93 | +Chew for reporting this. |
| 94 | + * Added a sanity check for freed blocks being valid. |
| 95 | + * Reworked test.c into being a torture test. |
| 96 | + * Fixed GCC assembler optimisation misspecification |
| 97 | + |
| 98 | +v1.04alpha_svn915 7th October 2006: |
| 99 | + * Fixed failure to unlock thread cache list if allocating a new list failed. |
| 100 | +Thanks to Dmitry Chichkov for reporting this. Futher thanks to Aleksey Sanin. |
| 101 | + * Fixed realloc(0, <size>) segfaulting. Thanks to Dmitry Chichkov for |
| 102 | +reporting this. |
| 103 | + * Made config defines #ifndef so they can be overriden by the build system. |
| 104 | +Thanks to Aleksey Sanin for suggesting this. |
| 105 | + * Fixed deadlock in nedprealloc() due to unnecessary locking of preferred |
| 106 | +thread mspace when mspace_realloc() always uses the original block's mspace |
| 107 | +anyway. Thanks to Aleksey Sanin for reporting this. |
| 108 | + * Made some speed improvements by hacking mspace_malloc() to no longer lock |
| 109 | +its mspace, thus allowing the recursive mutex implementation to be removed |
| 110 | +with an associated speed increase. Thanks to Aleksey Sanin for suggesting this. |
| 111 | + * Fixed a bug where allocating mspaces overran its max limit. Thanks to |
| 112 | +Aleksey Sanin for reporting this. |
| 113 | + |
| 114 | +v1.03 10th July 2006: |
| 115 | + * Fixed memory corruption bug in threadcache code which only appeared with >4 |
| 116 | +threads and in heavy use of the threadcache. |
| 117 | + |
| 118 | +v1.02 15th May 2006: |
| 119 | + * Integrated dlmalloc v2.8.4, fixing the win32 memory release problem and |
| 120 | +improving performance still further. Speed is now up to twice the speed of v1.01 |
| 121 | +(average is 67% faster). |
| 122 | + * Fixed win32 critical section implementation. Thanks to Pavel Kuznetsov |
| 123 | +for reporting this. |
| 124 | + * Wasn't locking mspace if all mspaces were locked. Thanks to Pavel Kuznetsov |
| 125 | +for reporting this. |
| 126 | + * Added Apple Mac OS X support. |
| 127 | + |
| 128 | +v1.01 24th February 2006: |
| 129 | + * Fixed multiprocessor scaling problems by removing sources of cache sloshing |
| 130 | + * Earl Chew <earl_chew <at> agilent <dot> com> sent patches for the following: |
| 131 | + 1. size2binidx() wasn't working for default code path (non x86) |
| 132 | + 2. Fixed failure to release mspace lock under certain circumstances which |
| 133 | + caused a deadlock |
| 134 | + |
| 135 | +v1.00 1st January 2006: |
| 136 | + * First release |
0 commit comments