Blog posts - Tag memory

Tag - memory

Using MAT the Memory Analysis Tool

By Stephane Carrez2015-05-25 09:51:00

MAT is a memory analysis tool that monitors calls to malloc, realloc and free calls. It works with a small shared library libmat.so that is loaded into the program with the LD_PRELOAD dynamic linker feature (See the ld.so(8) man page). The library overrides the malloc, realloc and free function to monitor calls to these functions. It then writes or sends probe events which contain enough information for mat to tell what, when, where and by whom the memory allocation was done.

To add a comment, you must be connected. Login to add a comment

Optimization with Valgrind Massif and Cachegrind

By Stephane Carrez2013-03-02 22:51:00

Memory optimization reveals sometimes some nice surprise. I was interested to analyze the memory used by the Ada Server Faces framework. For this I've profiled the unit tests program. This includes 130 tests that cover almost all the features of the framework.

Memory analysis with Valgrind Massif

Massif is a Valgrind tool that is used for heap analysis. It does not require the application to be re-compiled and can be used easily. The application is executed by using Valgrind and its tool Massif. The command that I've used was:

valgrind --tool=massif --threshold=0.1 \
   --detailed-freq=1 --alloc-fn=__gnat_malloc \
   bin/asf_harness -config test.properties

The valgrind tool creates a file massif.out.NNN which contains the analysis. The massif-visualizer is a graphical tool that reads the file and allows you to analyze the results. It is launched as follows:

massif-visualizer massif.out.19813

(the number is the pid of the process that was running, replace it accordingly).

The tool provides a graphical representation of memory used over the time. It allows to highlight a given memory snapshot and understand roughly where the memory is used.

While looking at the result, I was intrigued by a 1MB allocation that was made several times and then released (It creates these visual spikes and it correspond to the big red horizontal bar that appears visually). It was within the sax-utils.adb file that is part of the XML/Ada library. Looking at the implementation, it turns out that it allocates a hash table with 65536 entries. This allocation is done each time the sax parser is created. I've reduced the size of this hash table to 1024 entries. If you want to do it, change the following line in sax/sax-symbols.ads (line 99):

   Hash_Num : constant := 2**16;

by:

   Hash_Num : constant := 2**10;

After building, checking there is no regression (yes, it works), I've re-run the Massif tool and here are the results.

The peak memory was reduced from 2.7Mb to 2.0Mb. The memory usage is now easier to understand and analyse because the 1Mb allocation is gone. Other memory allocations have more importance now. But wait. There is more! My program is now faster!

Cache analysis with cachegrind

To understand why the program is now faster, I've used Cachegrind that measures processor cache performance. Cachegrind is a cache and branch-prediction profiler provided by Valgrind as another tool. I've executed the tool with the following command:

valgrind --tool=cachegrind \
    bin/asf_harness -config test.properties

I've launched it once before the hash table correction and once after. Similar to Massif, Cachegrind generates a file cachgrind.NNN that contains the analysis. You analyze the result by using either cg_annotate or kcachegrind. Having two Cachegrind files, I've used cg_diff to somehow get diff between the two executions.

cg_diff cachegrind.out.24198 cachegrind.out.23286 > cg.out.1
cg_annotate cg.out.1

Before the fix, we can see in Cachegrind report that the most intensive memory operations are performed by Sax.Htable.Reset operation and by the GNAT operation that initializes the Sax.Symbols.Symbol_Table_Record type which contains the big hash table. Dr is the number of data reads, D1mr the L1 cache read miss and Dw is the number of writes with D1mw representing the L1 cache write miss. Having a lot of cache miss will slow down the execution: L1 cache access requires a few cycles while main memory access could cost several hundreds of them.

--------------------------------------------------------------------------------
         Dr      D1mr          Dw      D1mw 
--------------------------------------------------------------------------------
212,746,571 2,787,355 144,880,212 2,469,782  PROGRAM TOTALS

--------------------------------------------------------------------------------
        Dr      D1mr         Dw      D1mw  file:function
--------------------------------------------------------------------------------
25,000,929 2,081,943     27,672       244  sax/sax-htable.adb:sax__symbols__string_htable__reset
       508       127 33,293,050 2,080,768  sax/sax-htable.adb:sax__symbols__symbol_table_recordIP
43,894,931   129,786  7,532,775     8,677  ???:???
15,021,128     4,140  5,632,923         0  pthread_getspecific
 7,510,564     2,995  7,510,564    10,673  ???:system__task_primitives__operations__specific__selfXnn
 6,134,652    41,357  4,320,817    49,207  _int_malloc
 4,774,547    22,969  1,956,568     4,392  _int_free
 3,753,930         0  5,630,895     5,039  ???:system__task_primitives__operations(short,...)(long, float)

With a smaller hash table, the Cachegrind report indicates a reduction of 24,543,482 data reads and 32,765,323 data writes. The cache read miss was reduced by 2,086,579 (74%) and the cache write miss was also reduced by 2,056,247 (83% reduction!).

With a small hash table, the Sax.Symbols.Symbol_Table_Record gets initialized quicker and its cleaning needs less memory accesses, hence CPU cycles. By having a smaller hash table, we also benefit from less cache miss: using a 1Mb hash table flushes a big part of the data cache.

--------------------------------------------------------------------------------
         Dr    D1mr          Dw    D1mw 
--------------------------------------------------------------------------------
188,203,089 700,776 112,114,889 413,535  PROGRAM TOTALS

--------------------------------------------------------------------------------
        Dr    D1mr        Dw   D1mw  file:function
--------------------------------------------------------------------------------
43,904,760 120,883 7,532,577  8,407  ???:???
15,028,328      98 5,635,623      0  pthread_getspecific
 7,514,164     288 7,514,164  9,929  ???:system__task_primitives__operations__specific__selfXnn
 6,129,019  39,636 4,305,043 48,446  _int_malloc
 4,784,026  18,626 1,959,387  3,261  _int_free
 3,755,730       0 5,633,595  4,390  ???:system__task_primitives__operations(short,...)(long, float)
 2,418,778      65 2,705,140     14  ???:system__tasking__initialization__abort_undefer
 3,839,603   2,605 1,283,289      0  malloc

Conclusion

Running massif and cachegrind is very easy but it may take some time to figure out how to understand and use the results. A big hash table is not always a good thing for an application. By creating cache misses it may in fact slow down the application. To learn more about this subject, I recommend the excellent document What Every Programmer Should Know About Memory written by Ulrich Drepper.