Using efficiently the memory in parallel application can be hard, especially for NUMA machines. This trace visualization is designed to help you identify your applications' memory behavior and how to improve it.
Before showing any actual visualizations here are a few reminders on efficient memory access:
Split the memory
If the memory is correctly accessed, some groups of threads (from 1 to the maximum thread per NUMA node of the experimental machine) should appear. A group of thread is a set of thread accessing (mostly) the same set of pages. Moreover, the Average number of accesses should be more or less the same, for every threads and for every pages.
For NUMA machines, it is also important to think about the memory distribution over the node: thread working on the same data should be on the same node to avoid remote accesses. We should ensure that data are mapped near to the thread using them, moreover we should distribute the data (and the threads) over the node to avoid memory contention. If a part of the memory is used by every thread it should be either duplicated (in the case of a relatively small structure, read only) or interleaved among the nodes for the same reasons.
If some of your structures are used by (almost) every threads, you should obtain better performances by doing one of the following solutions:
Divide the structure
Modify your code to make threads works on small parts of the structure, then pin threads working on close data to the same NUMA node (or use an automated tool to do so).
Interleave
You can distribute the pages of the structure among the NUMA nodes (interleave) to balance the memory bandwith.
Duplicate
If the structure is only read and the structure size is small enough, each thread can work on local copy.
Mapping policy
By default, all recent operating systems maps memory pages close to the first thread accessing them, it is called first touch. Therefore either the first touch should be done be the thread actually using the data, or some mapping should be made (manually or via an external / automated tool).
Beware of the stack
Stack is designed for private data, shared data should be on the heap (dynamically allocated) or global. Remote stack access might be quite inefficient and hard to improve for automated tools.