MAT is a memory analysis tool that monitors calls to malloc, realloc and free calls. It works with a small shared library libmat.so
that is loaded into the program with the LD_PRELOAD
dynamic linker feature (See the ld.so(8) man page). The library overrides the malloc, realloc and free function to monitor calls to these functions. It then writes or sends probe events which contain enough information for mat to tell what, when, where and by whom the memory allocation was done.
Tag - memory
Using MAT the Memory Analysis Tool
By Stephane Carrez2015-05-25 09:51:00
Optimization with Valgrind Massif and Cachegrind
By Stephane Carrez2013-03-02 22:51:00
Memory optimization reveals sometimes some nice surprise. I was interested to analyze the memory used by the Ada Server Faces framework. For this I've profiled the unit tests program. This includes 130 tests that cover almost all the features of the framework.
Memory analysis with Valgrind Massif
Massif is a Valgrind tool that is used for heap analysis. It does not require the application to be re-compiled and can be used easily. The application is executed by using Valgrind and its tool Massif. The command that I've used was:
valgrind --tool=massif --threshold=0.1 \
--detailed-freq=1 --alloc-fn=__gnat_malloc \
bin/asf_harness -config test.properties
The valgrind tool creates a file massif.out.NNN
which contains the analysis. The massif-visualizer
is a graphical tool that reads the file and allows you to analyze the results. It is launched as follows:
massif-visualizer massif.out.19813
(the number is the pid of the process that was running, replace it accordingly).
The tool provides a graphical representation of memory used over the time. It allows to highlight a given memory snapshot and understand roughly where the memory is used.
While looking at the result, I was intrigued by a 1MB allocation that was made several times and then released (It creates these visual spikes and it correspond to the big red horizontal bar that appears visually). It was within the sax-utils.adb file that is part of the XML/Ada library. Looking at the implementation, it turns out that it allocates a hash table with 65536 entries. This allocation is done each time the sax parser is created. I've reduced the size of this hash table to 1024 entries. If you want to do it, change the following line in sax/sax-symbols.ads
(line 99):
Hash_Num : constant := 2**16;
by:
Hash_Num : constant := 2**10;
After building, checking there is no regression (yes, it works), I've re-run the Massif tool and here are the results.
The peak memory was reduced from 2.7Mb to 2.0Mb. The memory usage is now easier to understand and analyse because the 1Mb allocation is gone. Other memory allocations have more importance now. But wait. There is more! My program is now faster!
Cache analysis with cachegrind
To understand why the program is now faster, I've used Cachegrind that measures processor cache performance. Cachegrind is a cache and branch-prediction profiler provided by Valgrind as another tool. I've executed the tool with the following command:
valgrind --tool=cachegrind \
bin/asf_harness -config test.properties
I've launched it once before the hash table correction and once after. Similar to Massif, Cachegrind generates a file cachgrind.NNN
that contains the analysis. You analyze the result by using either cg_annotate
or kcachegrind
. Having two Cachegrind files, I've used cg_diff
to somehow get diff between the two executions.
cg_diff cachegrind.out.24198 cachegrind.out.23286 > cg.out.1
cg_annotate cg.out.1
Before the fix, we can see in Cachegrind report that the most intensive memory operations are performed by Sax.Htable.Reset
operation and by the GNAT operation that initializes the Sax.Symbols.Symbol_Table_Record
type which contains the big hash table. Dr
is the number of data reads, D1mr
the L1 cache read miss and Dw
is the number of writes with D1mw
representing the L1 cache write miss. Having a lot of cache miss will slow down the execution: L1 cache access requires a few cycles while main memory access could cost several hundreds of them.
--------------------------------------------------------------------------------
Dr D1mr Dw D1mw
--------------------------------------------------------------------------------
212,746,571 2,787,355 144,880,212 2,469,782 PROGRAM TOTALS
--------------------------------------------------------------------------------
Dr D1mr Dw D1mw file:function
--------------------------------------------------------------------------------
25,000,929 2,081,943 27,672 244 sax/sax-htable.adb:sax__symbols__string_htable__reset
508 127 33,293,050 2,080,768 sax/sax-htable.adb:sax__symbols__symbol_table_recordIP
43,894,931 129,786 7,532,775 8,677 ???:???
15,021,128 4,140 5,632,923 0 pthread_getspecific
7,510,564 2,995 7,510,564 10,673 ???:system__task_primitives__operations__specific__selfXnn
6,134,652 41,357 4,320,817 49,207 _int_malloc
4,774,547 22,969 1,956,568 4,392 _int_free
3,753,930 0 5,630,895 5,039 ???:system__task_primitives__operations(short,...)(long, float)
With a smaller hash table, the Cachegrind report indicates a reduction of 24,543,482 data reads and 32,765,323 data writes. The cache read miss was reduced by 2,086,579 (74%) and the cache write miss was also reduced by 2,056,247 (83% reduction!).
With a small hash table, the Sax.Symbols.Symbol_Table_Record
gets initialized quicker and its cleaning needs less memory accesses, hence CPU cycles. By having a smaller hash table, we also benefit from less cache miss: using a 1Mb hash table flushes a big part of the data cache.
--------------------------------------------------------------------------------
Dr D1mr Dw D1mw
--------------------------------------------------------------------------------
188,203,089 700,776 112,114,889 413,535 PROGRAM TOTALS
--------------------------------------------------------------------------------
Dr D1mr Dw D1mw file:function
--------------------------------------------------------------------------------
43,904,760 120,883 7,532,577 8,407 ???:???
15,028,328 98 5,635,623 0 pthread_getspecific
7,514,164 288 7,514,164 9,929 ???:system__task_primitives__operations__specific__selfXnn
6,129,019 39,636 4,305,043 48,446 _int_malloc
4,784,026 18,626 1,959,387 3,261 _int_free
3,755,730 0 5,633,595 4,390 ???:system__task_primitives__operations(short,...)(long, float)
2,418,778 65 2,705,140 14 ???:system__tasking__initialization__abort_undefer
3,839,603 2,605 1,283,289 0 malloc
Conclusion
Running massif and cachegrind is very easy but it may take some time to figure out how to understand and use the results. A big hash table is not always a good thing for an application. By creating cache misses it may in fact slow down the application. To learn more about this subject, I recommend the excellent document What Every Programmer Should Know About Memory written by Ulrich Drepper.
Tags
- Facelet
- NetBSD
- framework
- Mysql
- generator
- files
- application
- gcc
- ReadyNAS
- Security
- binutils
- ELF
- JSF
- Java
- bacula
- Tutorial
- Apache
- COFF
- collaboration
- planning
- project
- upgrade
- AWA
- C
- EL
- J2EE
- UML
- php
- symfony
- Ethernet
- Ada
- FreeBSD
- Go
- KVM
- MDE
- Proxy
- STM32
- Servlet
- backup
- lvm
- multiprocessing
- web
- Bean
- Jenkins
- release
- OAuth
- ProjectBar
- REST
- Rewrite
- Sqlite
- Storage
- USB
- Ubuntu
- bison
- cache
- crash
- Linux
- firefox
- performance
- interview