Table of Contents
Memory Tracing
Participants
Project Goals
Full system, timing accurate memory traces
Related Tools
Dinero IV: trace-driven uniprocessor CPU cache simulator (http://pages.cs.wisc.edu/~markhill/DineroIV/)
QEMU Technical Details
Add here instructions for installation, how to get the source from our repo and other details.
The repository with all the source can be found at https://doomsday.cs.fiu.edu/gitweb
Meetings
- 01/26/12: Initial discussion
Literature Review
This section is dedicated to any paper related to this project.
Here is the original qemu paper: bellard.pdf
Title: Characterization of Self-Similarity in Memory Workload: Analysis and Synthesis
Authors: Qiang Zou, Jianhui Yue, Bruce Segee, and Yifeng Zhu
Abstract: This paper studies self-similarity of memory I/O
accesses in high-performance computer systems. We analyze the
auto-correlation functions of memory access arrival intervals
with small time scales and present both pictorial and statistical
evidence that memory accesses have self-similar like behavior.
For memory I/O traces studied in our experiments, all estimated
Hurst parameters are larger than 0.5, which indicate that self-
similarity seems to be a general property of memory access
behaviors. In addition, we implement a memory access series
generator in which the inputs are the measured properties of the
available trace data. Experimental results show that this model
can accurately emulate the complex access arrival behaviors of
real memory systems, particularly the heavy-tail characteristics
under both Gaussian and non-Gaussian workloads.
Link: msst12-paper8.pdf
Relevance: not much
Comments: The authors use a full system simulator M5 to collect memory traces of the SPEC CPU 2000/2006 for few seconds. Then applied several statistical methods to characterize the traces. After finding specific characteristics of the studied workloads, the authors were able to create synthetic traces with high degree of similarity to the real traces. This study is tailored to high-performance computing.
Title: METRIC: Memory tracing via dynamic binary rewriting to identify cache inefficiencies
Authors: Jaydeep Marathe, Frank Mueller, Tushar Mohan, Sally A. Mckee, Bronis R. De Supinski, and Andy Yoo
Abstract: With the diverging improvements in CPU speeds and memory access latencies, detecting and removing memory access bottlenecks becomes increasingly important. In this work we present METRIC, a software framework for isolating and understanding such bottlenecks using partial access traces. METRIC extracts access traces from executing programs without special compiler or linker support. We make four primary contributions. First, we present a framework for extracting partial access traces based on dynamic binary rewriting of the executing application. Second, we introduce a novel algorithm for compressing these traces. The algorithm generates constant space representations for regular accesses occurring in nested loop structures. Third, we use these traces for offline incremental memory hierarchy simulation. We extract symbolic information from the application executable and use this to generate detailed source-code correlated statistics including per-reference metrics, cache evictor information, and stream metrics. Finally, we demonstrate how this information can be used to isolate and understand memory access inefficiencies. This illustrates a potential advantage of METRIC over compile-time analysis for sample codes, particularly when interprocedural analysis is required.
Link: p1-marathe.pdf
Relevance: Some
Comments: Memory traces from executing applications are extracted by using dynamic binary rewriting (only limited to user space). Much time is spend on developing an efficient compression algorithm optimized for scientific application where nested loop structures are common. This work uses the DynInst framework (http://www.dyninst.org/).
Title: Memory system characterization of commercial workloads
Authors: Luiz André Barroso, Kourosh Gharachorloo, and Edouard Bugnion
Abstract: Commercial applications such as databases and Web servers constitute the largest and fastest-growing segment of the market for multiprocessor servers. Ongoing innovations in disk subsystems, along with the ever increasing gap between processor and memory speeds, have elevated memory system design as the critical performance factor for such workloads. However, most current server designs have been optimized to perform well on scientific and engineering workloads, potentially leading to design decisions that are non-ideal for commercial applications. The above problem is exacerbated by the lack of information on the performance requirements of commercial workloads, the lack of available applications for widespread study, and the fact that most representative applications are too large and complex to serve as suitable benchmarks for evaluating trade-offs in the design of processors and servers.This paper presents a detailed performance study of three important classes of commercial workloads: online transaction processing (OLTP), decision support systems (DSS), and Web index search. We use the Oracle commercial database engine for our OLTP and DSS workloads, and the AltaVista search engine for our Web index search workload. This study characterizes the memory system behavior of these workloads through a large number of architectural experiments on Alpha multiprocessors augmented with full system simulations to determine the impact of architectural trends. We also identify a set of simplifications that make these workloads more amenable to monitoring and simulation without affecting representative memory system behavior. We observe that systems optimized for OLTP versus DSS and index search workloads may lead to diverging designs, specifically in the size and speed requirements for off-chip caches.
Link: p3-barroso.pdf
Relevance:
Comments:
Title: Exploiting Stability to Reduce Time-Space Cost for Memory Tracing
Authors: Xiaofeng Gao and Allan Snavely
Abstract: Memory traces record the addresses touched by a program during its execution, enabling many useful investigations for understanding and predicting program performance. But complete address traces are time-consuming to acquire and too large to practically store except in the case of short-running programs. Also, memory traces have to be re-acquired each time the input data (and thus the dynamic behavior of the program) changes. We observe that individual load and store instructions typically have stable memory access patterns. Changes in dynamic control-flow of programs, rather than variation in memory access patterns of individual instructions, appear to be the primary cause of overall memory behavior varying both during one execution of a program and during re-execution of the same program on different input data. We are leveraging this observation to enable approximate memory traces that are smaller than full traces, faster to acquire via sampling, much faster to re-acquire for new input data, and have a high degree of verisimilitude relative to full traces. This paper presents an update on our progress.
Link: exploiting_stability_to_reduce_time-space_cost_for_memory_tracing.pdf
Relevance:
Comments:
Title: Virtual machine memory access tracing with hypervisor exclusive cache
Authors: Pin Lu, and Kai Shen
Abstract: Virtual machine (VM) memory allocation and VM consolidation can benefit from the prediction of VM page miss rate at each candidate memory size. Such prediction is challenging for the hypervisor (or VM monitor) due to a lack of knowledge on VM memory access pattern. This paper explores the approach that the hypervisor takes over the management for part of the VM memory and thus all accesses that miss the remaining VM memory can be transparently traced by the hypervisor.
For online memory access tracing, its overhead should be small compared to the case that all allocated memory is directly managed by the VM. To save memory space, the hypervisor manages its memory portion as an exclusive cache (i.e., containing only data that is not in the remaining VM memory). To minimize I/O overhead, evicted data from a VM enters its cache directly from VM memory (as opposed to entering from the secondary storage). We guarantee the cache correctness by only caching memory pages whose current contents provably match those of corresponding storage locations. Based on our design, we show that when the VM evicts pages in the LRU order, the employment of the hypervisor cache does not introduce any additional I/O overhead in the system.
We implemented the proposed scheme on the Xen para-virtualization platform. Our experiments with microbenchmarks and four real data-intensive services (SPECweb99, index searching, TPC-C, and TPC-H) illustrate the overhead of our hypervisor cache and the accuracy of cache-driven VM page miss rate prediction. We also present the results on adaptive VM memory allocation with performance assurance.
Link: usenix07.pdf
Relevance: Not much
Comments: The idea is to create an exclusive cache at the hypervisor level containing the pages evicted by the VM OS, collect statistics about memory misses to construct a miss-rate function in order to estimate how much memory is used by each VM OS. Using this information, memory is redistributed by removing unused memory from one VM to make it available to another. A simple greedy algorithms is used to do this. The VM OS was modified in order to communicate with the hypervisor.
Title: Memory Access Pattern Analysis
Authors: Mary Brown, Roy M. Jeneveiny, and Nasr Ulla
Abstract:
A methodology for analyzing memory behavior has been
developed for the purpose of evaluating memory system
design. MPAT, a memory pattern analysis tool, has been
used to profile memory transactions of dynamic instruction
traces. This paper will first describe the memory model and
metrics gathered by MPAT. Then the metrics are evaluated
in order to determine what hardware and software changes
should be made to improve memory system performance
Link: 00809366.pdf
Relevance:
Comments:
Title: SIGMA: A Simulator Infrastructure to Guide Memory Analysis
Authors: Luiz DeRose, K. Ekanadham, Jeffrey K. Hollingsworth, and Simone Sbaraglia
Abstract: In this paper we present SIGMA (Simulation Infrastructure to Guide Memory Analysis), a new data collection framework and family of cache analysis tools. The SIGMA environment provides detailed cache information by gathering memory reference data using software-based instrumentation. This infrastructure can facilitate quick probing into the factors that influence the performance of an application by highlighting bottleneck scenarios including: excessive cache/TLB misses and inefficient data layouts. The tool can also assist in perturbation analysis to determine performance variations caused by changes to architecture or program. Our validation tests using the SPEC Swim benchmark show that most of the performance metrics obtained with SIGMA are within 1% of the metrics obtained with hardware performance counters, with the advantage that SIGMA provides performance data on a data structure level, as specified by the programmer.
Link: 01592837.pdf
Relevance:
Comments:
Title: Memory Characterization of Workloads Using Instrumentation-Driven Simulation
Authors: Aamer Jalee
Abstract: There is a growing need for simulation methodologies to
understand the memory system requirements of emerging
workloads in a reasonable amount of time. This paper
presents binary instrumentation-driven simulation as an
alternative to conventional execution-driven and trace-driven
simulation methodologies. We illustrate the use of
instrumentation-driven simulation (IDS) using Pin to
determine the memory system requirements of workloads from
the SPEC CPU2000 and SPEC CPU2006 benchmark suites.
In comparison to SPEC CPU2000, SPEC CPU2006
workloads are an order of magnitude longer (in terms of
instruction count). Additionally, SPEC CPU2006 comprises of
many more memory intensive workloads that require more
than 4MB of cache size for better cache performance.
Link: specanalysis.pdf
Relevance:
Comments:
——————–
Title: HMTT: A Platform Independent Full-System Memory Trace Monitoring System
Authors: Yungang Bao, Mingyu Chen, Yuan Ruan,Li Liu Jianping Fan, Qingbo Yuan, Bo Song, Jianwei Xu
Abstract: Memory trace analysis is an important technology for architecture
research, system software (i.e., OS, compiler) optimization, and
application performance improvements. Many approaches have
been used to track memory trace, such as simulation, binary
instrumentation and hardware snooping. However, they usually
have limitations of time, accuracy and capacity.
In this paper we propose a platform independent memory trace monitoring system, which is able to track virtual memory reference trace of full systems (including OS, VMMs, libraries, and applications). The system adopts a DIMM-snooping mechanism that uses hardware boards plugged in DIMM slots to snoop. There are several advantages in this approach, such as fast, complete, undistorted, and portable. Three key techniques are proposed to address the system design challenges with this mechanism: (1) To keep up with memory speeds, the DDR protocol state machine is simplified, and large FIFOs are added between the state machine and the trace transmitting logic to handle burst memory accesses; (2) To reconstruct physical-to- virtual mapping and distinguish one process' address space from others, an OS kernel module, which collects page table information, and a synchronization mechanism, which synchronizes the page table information with the memory trace, are developed; (3) To dump massive trace data, we employ a straightforward method to compress the trace and use Gigabit Ethernet and RAID to send and receive the compressed trace.
We present our implementation of an initial monitoring system,
named HMTT (Hyper Memory Trace Tracker). Using HMTT, we
have observed that burst bandwidth utilization is much larger than
average bandwidth utilization, by up to 5X in desktop
applications. We have also confirmed that the stream memory
accesses of many applications contribute even more than 40% of
L2 Cache misses and OS virtual memory management may
decrease stream accesses in view of memory controller (or L2
Cache), by up to 30.2%. Moreover, we have evaluated OS impact
on memory performance in real systems. The evaluations and case
studies show the feasibility and effectiveness of our proposed
monitoring mechanism and techniques.
Link: baoyg_hmtt.pdf
Relevance:
Comments:
——————–
Title: ReTrace: Collecting Execution Trace with Virtual Machine Deterministic Replay
Authors: Min Xu, Vyacheslav Malyugin, Jeffrey Sheldon, Ganesh Venkitachalam, and Boris Weissman
Abstract:
Execution trace is an important tool in computer architecture research. Unfortunately, existing trace collection
techniques are often slow (due to software tracing overheads) or expensive (due to special tracing hardware requirements). Regardless of the method of collection, detailed trace files are generally large and inconvenient to store and share.
We present ReTrace, a trace collection tool based on the deterministic replay technology of the VMware hypervisor. ReTrace operates in two stages: capturing and expansion. ReTrace capturing accumulates the minimal amount of information necessary to later recreate a more detailed execution trace. It captures (records) only non-deterministic events resulting in low time and space overheads (as low as 5% run-time overhead, as low as 0.5 byte per thousand instructions log growth rate) on supported platforms. ReTrace expansion uses the information collected by the capturing stage to generate a complete and accurate execution trace without any data loss or distortion. ReTrace is an experimental feature of VMware Workstation 6.0 currently available in Windows and Linux flavors for commodity IA32 platforms. No special tracing hardware is required.
We have three key results. First, we find that trace col-
lection can be done both efficiently and inexpensively.
Second, deterministic replay is an effective technique for
compressing large trace files. Third, performing the trace
collection at the hypervisor layer is minimally invasive
to the collected trace while enabling tracing of the entire
system (user/supervisor level, CPU, peripheral devices).
ReTrace is a rapidly evolving technology. We would like
to use this paper to solicit feedback on the applicability
of ReTrace in computer architecture research to help us
refine our future development plans.
Link: retrace.pdf
Relevance:
Comments:
——————–
Title: Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation
Authors: Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Reddit, and Kim Hazalwood
Abstract: Robust and powerful software instrumentation tools are essential
for program analysis tasks such as profiling, performance evalu-
ation, and bug detection. To meet this need, we have developed
a new instrumentation system called Pin. Our goals are to pro-
vide easy-to-use, portable, transparent, and efficient instrumenta-
tion. Instrumentation tools (called Pintools) are written in C/C++
using Pin’s rich API. Pin follows the model of ATOM, allowing the
tool writer to analyze an application at the instruction level with-
out the need for detailed knowledge of the underlying instruction
set. The API is designed to be architecture independent whenever
possible, making Pintools source compatible across different archi-
tectures. However, a Pintool can access architecture-specific details
when necessary. Instrumentation with Pin is mostly transparent as
the application and Pintool observe the application’s original, unin-
strumented behavior. Pin uses dynamic compilation to instrument
executables while they are running. For efficiency, Pin uses sev-
eral techniques, including inlining, register re-allocation, liveness
analysis, and instruction scheduling to optimize instrumentation.
This fully automated approach delivers significantly better instru-
mentation performance than similar tools. For example, Pin is 3.3x
faster than Valgrind and 2x faster than DynamoRIO for basic-block
counting. To illustrate Pin’s versatility, we describe two Pintools
in daily use to analyze production software. Pin is publicly avail-
able for Linux platforms on four architectures: IA32 (32-bit x86),
EM64T (64-bit x86), Itanium R , and ARM. In the ten months since
Pin 2 was released in July 2004, there have been over 3000 down-
loads from its website.
Link: 10.1.1.85.4883.pdf
Relevance:
Comments:
——————–
Title: Execution replay of multiprocessor virtual machines
Authors: George W. Dunlap, Dominic G. Lucchetti, Michael A. Fetterman, and Peter M. Chen
Abstract: Execution replay of virtual machines is a technique which has many important applications, including debugging, fault-tolerance, and security. Execution replay for single processor virtual machines is well-understood, and available commercially. With the advancement of multi-core architectures, however, multiprocessor virtual machines are becoming more important. Our system, SMP-ReVirt, is the first system to log and replay a multiprocessor virtual machine on commodity hardware. We use hardware page protection to detect and accurately replay sharing between virtual cpus of a multi-cpu virtual machine, allowing us to replay the entire operating system and all applications. We have tested our system on a variety of workloads, and find that although sharing under SMP-ReVirt is expensive, for many workloads and applications, including debugging, the overhead is acceptable.
Link: p121-dunlap.pdf
Relevance:
Comments:
——————–
Title: Decoupling dynamic program analysis from execution in virtual environments
Authors: Jim Chow, Tal Garfinkel, and Peter M. Chen
Abstract: Analyzing the behavior of running programs has a wide variety of compelling applications, from intrusion detection and prevention to bug discovery. Unfortunately, the high runtime overheads imposed by complex analysis techniques makes their deployment impractical in most settings. We present a virtual machine based architecture called Aftersight ameliorates this, providing a flexible and practical way to run heavyweight analyses on production workloads.
Aftersight decouples analysis from normal execution by logging nondeterministic VM inputs and replaying them on a separate analysis platform. VM output can be gated on the results of an analysis for intrusion prevention or analysis can run at its own pace for intrusion detection and best effort prevention. Logs can also be stored for later analysis offline for bug finding or forensics, allowing analyses that would otherwise be unusable to be applied ubiquitously. In all cases, multiple analyses can be run in parallel, added on demand, and are guaranteed not to interfere with the running workload.
We present our experience implementing Aftersight as part of the VMware virtual machine platform and using it to develop a realtime intrusion detection and prevention system, as well as an an offline system for bug detection, which we used to detect numerous novel and serious bugs in VMware ESX Server, Linux, and Windows applications.
Link: chow.pdf
Relevance:
Comments: