|
|
|
|
nedmalloc
is a VERY fast, VERY scalable, multithreaded memory allocator with little
memory fragmentation. If you're running on an older operating system (e.g.
Windows XP, Linux 2.4 series, FreeBSD 6 series, Mac OS X 10.4 or earlier)
you will probably find it significantly improves your application's
performance (Windows 7, Linux 3.x, FreeBSD 8, Mac OS X 10.6 all contain
state-of-the-art allocators and no third party allocator is likely to
significantly improve on them in real world results). Unlike other allocators, it is written in C
and so can be used anywhere and it also comes under the Boost software
license which permits commercial usage.It has been tested on some very
high end hardware with more than eight processing cores and more than 8Gb of
RAM. It is in daily use by some of the world's major banks, root DNS
servers, multinational airlines and consumer products (embedded). It also
costs no money (though donations are welcome!). Thanks to work generously
sponsored by Applied Research
Associates, nedmalloc can patch itself into existing binaries to replace
the system allocator on Windows - for example, Microsoft Word on Windows XP is noticeably
quicker for very large documents after the nedmalloc DLL has been injected
into it!
It is more than 125 times faster than the standard Windows XP memory
allocator, 4-10 times faster than the standard FreeBSD 6 memory allocator and
up to twice as fast as ptmalloc2, the standard Linux memory allocator. It
can sustain a minimum of between 7.3m and 8.2m malloc & free pair
operations per second on a 3400 (2.20Ghz) AMD Athlon64 machine.
It scales with extra CPU's far better than either the standard Windows XP
memory allocator or ptmalloc2 and can cause significantly less memory
bloating than ptmalloc2. It avoids processor serialisation (locking)
entirely when the requested memory size is in the thread cache leading to
the kind of scalability you can see in the graph on the right. In real world
code:
 
| |
Memory Mapped |
Packetised |
nedmalloc's Improvement |
| Win32 (default) |
123.72 |
46.29 |
45.38% |
54.03% |
| nedmalloc v1.02 |
179.87 |
71.3 |
- |
- |
| nedmalloc v1.01 |
172.47 |
67.9 |
4.29% |
5.01% |
| Win32 (low frag) |
164.28 |
58.74 |
9.49% |
21.38% |
| ptmalloc2 |
167.41 |
63.46 |
7.44% |
12.35% |
| Hoard v3.4 |
167.4 |
64.65 |
7.45% |
10.29% |
If you want an explanation of the difference between the Packetised and
Memory Mapped benchmarks, please see the Tn
homepage (but basically, the Packetised involves performing a lot more
memory ops in a more loaded multithreaded environment). As you can see
above, the benefits of nedmalloc translate into real world code with more
than a 50% speed increase over the default win32 allocator. The Tn speed
test is very heavy on the memory bus, so you can expect your own
applications to see greater improvements than this.
See below for a Frequently Asked Questions list. Below
and to the right is a series of comparisons between nedmalloc, system
allocators and a number of other replacement memory allocators such as
tcmalloc and Hoard. The graphs below are for v1.00 but are still good for an
idea of performance on a wide variety of systems, but note than nedmalloc
has become much faster in recent revisions (as you can see on the right).
The next generation of memory allocator: the v1.2x series
Since v1.10, and given the outstanding default performance of the Windows
7, Apple Mac OS X 10.6 and FreeBSD 7+ system allocators, nedmalloc has taken
a different approach to improve performance: it has begun to implement
changes to the 1970s malloc API and kernel VM design whose design
increasingly constrains performance on modern systems.
To my knowledge, nedmalloc is among the fastest portable memory allocators
available, and it has many features and outstanding configurability useful
in themselves. However it cannot consistently beat the excellent system allocators in
Windows 7, Apple Mac OS X 10.6+ or FreeBSD 7+ (and neither can
any other
allocator I know of in
real world
testing). It isn't any slower than these allocators, but for now we
have plateaued with current API and VM design.
For a next generation API design allocator, see the C1X change
proposal N1527 at
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1527.pdf). Two
reference C implementations of N1527 are also available at
http://github.com/ned14/C1X_N1527. This proposed API substantially
reduces whole program memory allocation latencies, and the ISO C1X committee
have not rejected the idea in principle (they are currently considering
whether to make it into a Technical Specification).
v1.10 beta 1 had a first attempt at an improved malloc API. N1527
introduced a second attempt, and resulting from the feedback from the March
2011 ISO C1X committee meeting in London, v1.20 intends to introduce
a third attempt at getting the API right. The committee has had an idea of
attributed arenas, so basically one creates memory pools which have
certain configurable characteristics. This is fairly complex, but solves a
whole load of problems present and future at once.
For an example of a next generation VM design allocator (which by the way
the new malloc API allows you to use directly through the alignment
and size rounding pool attributes i.e. you set both to the page
size), you can try the user mode
page allocator in nedmalloc v1.10 (Windows Vista or later only). It
opens a whole new world of performance and scalability, but requires
Administrator privileges to run. Want to know more? Here are two academic
papers on the subject:
- Douglas, N, (2011-May), 'User
Mode Memory Page Management: An old idea applied anew to the memory wall
problem', ArXiv e-prints, vol: 1105.1815.
- Douglas, N, (2011-May), 'User
Mode Memory Page Allocation: A Silver Bullet For Memory Allocation?',
ArXiv e-prints, vol: 1105.1811.
Downloads:

ChangeLog
(from GIT). GIT HEAD (both are identical mirrors):
Current bleeding edge:
v1.10 beta 4 in GIT
HEAD.
Current betas:
Beta 3 of v1.10 (455Kb). You should use this in preference to any other
(it's a very mature beta).
Previous:
Beta 2 of v1.06 (svn 1159) (963Kb)
Beta 1 of v1.06 (svn 1151) (957Kb)
v1.05 (svn
1078) of nedmalloc (80Kb)
v1.04 (svn
1040) of nedmalloc (80Kb)
v1.03 of nedmalloc (76.4Kb)
v1.02 of nedmalloc (76.3Kb)
v1.01 of nedmalloc (71.9Kb)
v1.00 of nedmalloc (69.7Kb)
Changes last few releases:
v1.10 beta 3 17th July 2012:
- [master 5f26c1a] Due to a bug introduced
in sha 7a9dd5c (17th April 2010), nedmalloc has never allocated more than a
single mspace when using the system pool. This effectively had disabled
concurrency for any allocation > THREADCACHEMAX (8Kb) which no doubt made
nedmalloc v1.10 betas 1 and 2 appear no faster than system allocators. My
thanks to the eagle eyes of Gavin Lambert for spotting this.
v1.10 beta 2 10th July 2012:
- [master 51ab2a2] scons now tests for C++0x
support before turning it on and tries multiple libraries for clock_gettime()
rather than assuming it lives in librt. This ought to fix miscompilation on
Mac OS X. Thanks to Robert D. Blanchet Jr. for reporting this.
- [master b2c3517] Mac defines malloc_size
to be const void *ptr, not void *ptr
- [master 9333e50] Updated to use the new
O(1) Cfind(rounds=1) feature in nedtries
- [master 54c7e44] Avoid overflowing allocation
size. Thanks to Xi Wang for supplying a patch fixing this.
- [master 5b614a0] Removed __try1 and __finally1
from MinGW support as x64 target no longer supports SEH. Thanks to Geri for
reporting this.
- [master 48f1aa9] Tidied up bitrot which
had broken compilation due to mismatched #if...#endif.
v1.10 beta 1 19th May 2011:
- [master 89f1806] Moved from SVN to GIT. Bumped
version to v1.10 as new ARA contract will involve significant further improvements
mainly centering around realloc() performance.
- [master 254fe7c] Added nedmemsize() for API
compatibility with other allocators. Added DEFAULTMAXTHREADSINPOOL and set it
to FOUR which is a BREAKING CHANGE from previous versions of nedalloc (which
set it to 16).
- [nedmalloc_fast_realloc 97d1420] Added win32mremap()
implementation.
- [nedmalloc_fast_realloc 8a1001e] Significantly
improved test.c with new test options TESTCPLUSPLUS, BLOCKSIZE, TESTTYPE and
MAXMEMORY.
- [nedmalloc_fast_realloc 7ea606d] Implemented
two variants of direct mremap() on Windows, one using file mappings and the
other using over-reservation. The former is used on 32 bit and the latter on
64 bit.
- [nedmalloc_fast_realloc 26ff9a7] Added the
malloc2() interface to nedalloc.
- [nedmalloc_fast_realloc 5bc5d97] Rewrote
Readme.txt to become Readme.html which makes it much clearer to read.
- [nedmalloc_fast_realloc 2efa595] Added doxygen
markup to nedmalloc.h and a first go at a policy driven STL allocator class.
- [nedmalloc_fast_realloc d851bde] Added a
CHM documenting the nedalloc API.
- [nedmalloc_fast_realloc dbd3991] Added a
fast malloc operations logger which outputs a CSV log on process exit.
- [nedmalloc_fast_realloc d6a8585] Added stack
backtracing to the logger.
- [master c7ea06d] Finished user mode page
allocator, so merged nedmalloc_fast_realloc branch.
- [master 9a8800f] Fixed small bug which was
preventing the windows patcher from correctly finding the proper MSVCRT.
- [master 37c58b1] Fixed leak of mutexes when
using pthread or win32 mutexs as locks. Thanks to Gavin Lambert for reporting
this.
- [master f67e284] Fixed nedflushlogs() not
actually flushing data and/or causing a segfault. Thanks to Roman Tatkin for
reporting this.
- [master 1324bf3] Finally got round to retiring
the MSVC project files as they were sources of never ending hassle due to being
out of sync with the SConstruct config. Rebuilt scons build system to be fully
compatible with MSVC instead (long overdue!)
- [master 068494e] As the release of v1.10
RC1 approaches, fixed a long standing problem with the binary patcher where
multiple MSVCRT versions in the process weren't handled - everything was sent
to one MSVCRT only, and needless to say that sorta worked sometimes and sometimes
not. Now when nedmalloc passes a foreign block to the system allocator, it runs
a stack backtrace to figure out what MSVCRT in the process it ought to pass
it to. It's slow, but fixes a very common segfault on process exit on VS2010.
- [master 4cca52c] Very embarrassingly, nedmalloc
has been severely but unpredictably broken on POSIX for over a year now when built with DEBUG defined.
This was turning on DEFAULT_GRANULARITY_ALIGNED whose POSIX implementation
was causing random segfaults so mysterious that neither gdb nor valgrind
could pick them up - in other words, the very worst kind of memory
corruption: undetectable, untraceable and undebuggable. I only found them
myself due to a recent bug report for TnFOX on POSIX where due to luck, very
recent Linux kernels just happened by pure accident to cause this bug to
manifest itself as preventing process init right at the very start - so
early that no debugger could attach. After over a week of trial & error I
narrowed it down to being somewhere in nedmalloc, then having something to
do with DEBUG being defined or not, then two hours ago the eureka moment
arrived and I quite literally did a jig around the room in joy. Problem is
now fixed thank the heavens!!!
- [master 3d55a01] Fixed a problem where the
binary patcher was early outing too soon and therefore failing to patch all
the binaries properly. It would seem that the Microsoft linker doesn't sort
the import table like I had thought it did - I would guess it sorts per DLL
location, otherwise is unsorted. Thanks to Roman Tatkin for reporting this bug.
- [master 6c74071] Added override of _GNU_SOURCE
for when HAVE_MREMAP is auto-detected. Thanks to Maxim Zakharov for reporting
this issue.
- [master dee2d27] Marked off the v2 malloc API
as deprecated in preparation for beta release. Updated CHM documentation.
- When should I replace my memory allocator?
If you want your program to run at the maximum possible speed on
operating systems before Windows 7, Apple Mac OS X 10.6, FreeBSD 7 or
currently any version of Linux, you should
consider replacing your memory allocator. Fixing up your code to use a new
memory allocator is usually easy for most C and C++ projects, but can
become tricky if you must maintain compatibility with your system
allocator (you must tag each memory block so you can discern between what
has been allocated by the system and your custom allocator). If you are
running on Windows then nedmalloc can binary patch existing binaries thus
avoiding the need to recompile.
- Is nedmalloc faster than all other memory allocators?
No, there are faster ones, especially for specialised circumstances e.g.;
tcmalloc which never returns memory to the system and so can forego free
space coalescing (tcmalloc isn't suitable as a general purpose allocator). However, nedmalloc is an excellent general-purpose allocator and it
is based on dlmalloc, one of the most tried & tested memory allocators
available as it is the core allocator in Linux. If you use nedmalloc,
you
will never be far from the best performing specialised allocator. As you
might note in the real world benchmarks above, you get severely
diminishing returns to allocator improvement once they get into a certain
performance range.
- How space-efficient is nedmalloc?
dlmalloc does not fragment the memory space as much as other
allocators, but it does have a sixteen or thirty-two byte minimum
allocation with an eight or sixteen byte granularity. nedmalloc's thread
cache is a simple two power allocator which does cause bloating for items
small enough to enter the thread cache (by default, 8Kb or less) but in
general, this wastage across the entire program is small. You can
configure nedmalloc to use finer grained bins to quarter the average
wastage but this comes at a performance cost. When configured to only
permit one memory space per thread, memory bloating is considerably less
than that of ptmalloc2.
- Is
tcmalloc better or worse than nedmalloc?
As you can see in the graph above, nedmalloc is about equal to
tcmalloc for threadcache-only ops and substantially beats it for non-threadcache
ops. nedmalloc is also written in C rather than C++ and v0.5 of tcmalloc
only works on Unix systems and not win32. tcmalloc achieves its speed by
never returning memory to the system - free space reclamation is one of
the slowest parts of any allocator. Therefore tcmalloc should NOT be used
outside long running server processes (and indeed its own docs say the
same).
- Is Hoard better
or worse than nedmalloc?
As of v1.01, nedmalloc is close enough to Hoard to make little
difference in real world code (see real world benchmarks above).
nedmalloc's synthetic test seems to trigger a bug in Hoard causing dismal
performance, however I trust its author and its design enough to say that
Hoard may be slightly faster in certain circumstances eg; if code
allocates a large block in one thread and frees it in another. However,
Hoard is licensed under the GPL unless you pay which is not the case with
nedmalloc.
- Is ptmalloc3
better or worse than nedmalloc?
ptmalloc3 is also a new implementation of ptmalloc2 and is also based
on a newer dlmalloc. ptmalloc3 currently outperforms nedmalloc for a low
number of threads especially on uniprocessor hardware, but on dual
processor and above or with a lot of threads nedmalloc is faster.
nedmalloc also runs fine on Windows whereas ptmalloc3 would (to my
knowledge) require extra support code.
- Is
jemalloc better or worse than nedmalloc?
Good question!
There are many similarities between the designs, and like nedmalloc
jemalloc keeps changing its internals over time so whatever I say
here is likely out of date! Last time I looked, jemalloc uses
red-black trees internally which are considerably slower than
binary bitwise trees. On the other hand,
jemalloc has the big advantage of a fully integrated threadcache
whereas nedmalloc's is literally bolted on on top of dlmalloc and
its lack of integration does cost a few percent of performance (but
eases my maintenance). jemalloc allocates small blocks more tightly
and therefore wastes less memory, but this can introduce cache line
sloshing when multiple CPU cores are writing to the same cache line.
jemalloc is generally developed on Linux and Mac OS X first and
Windows after, whereas I'd target Windows first due to its
popularity and the others after. nedmalloc definitely is more
experimental with
C1X N1527 support (though I'd love if Jason added this too -
hint hint!). In short, I'd doubt you'll find ANY performance
difference in real world code.
|