Skip to content

Proposal: Adding a fixed-width binary format for memory traces. #103

@ziadomalik

Description

@ziadomalik

Problem

Given that we now have a use for trace files beyond simple debugging (i.e. building our own trace visualizer), the current CSV format output by the existing TraceRecorder introduces some unessesary inconveniences when trying to parse it efficiently:

  • We need to count & index the memory mapped CSV file, which still can take 3-5s on SSD and ~11s on HDD with numerous optimizations.
    • This is how far I got optimizing the parsing of the CSV trace file for my Trace Visualizer, any more gives diminishing returns: Code
  • Since we're working with characters in a string, we need to perform linear time iterations to find a row and then a specific entry in a row.
  • The current format has odd quirks that could be optimized for file size, like:
    • Spaces after commas in the output CSV (Those add up! The first test in the mess branch produces with spaces a 27MB file and removing the spaces gives you something along the order 23MB, so imagine how much of a 1GB file is spaces.
    • When something is broken/invalid, TraceRecorder outputs -1 as a value in that entry, forcing us to use integer types, getting half the positive number range for double the size. Depending on how we define the maximum values for entries in the address vector, one could easily cut the file-size in half if it were stored in binary.

Solution

Yesterday, I wanted to validate my idea, so I implemented a binary trace format (+ recorder) with the following characteristics:

  • It is fixed-width, meaning after memory mapping, we get O(1) lookup for each entry at any line.
  • It defines a trace event to be exactly 32B wide, meaning two trace events fit into one 64B cache line.
    • (i.e when we look up i, we get i+1 loaded into the CPU cache for free).
  • It is very friendly to size-optimization if you can define tighter upper boundaries for the address vector entries.

I have forked Ramulator 2.0 and added:

  • Header-only library defining the file format: (Code)
  • New recorder called BinaryTraceRecorder: (Code)

How to test it

Since you guys don't seem to have a set-in-stone way of doing testing right now, I didn't want to add my own unit/integration testing setup without talking to you guys first, so I vibecoded (!) a script that takes the binary file and turns it back into a matching CSV. (Code)
I recommend the following testing procedure:

  1. Add both the TraceRecorder and BinaryTraceRecorder to a project and generate some traces. (Params are the same)
    • You should end up with the usual CSV files {path}.ch{channel_id} and new files that end in {channel_id}.mtrc (the new format)
  2. Pick one channel, (i.e. 0), and convert the file ending in {id}.mtrc to a CSV like so:
    • python3 test_mtrc.py <path>.mtrc
    • You will end up with a file of the same name as the input file but ending in .csv
  3. Get the diff of this file and the file produced by TraceRecorder corresponding to the same channel:
    • diff -a visualizer_trace_0.csv visualizer_trace_csv.ch0
    • If the diff is empty, it means our binary format contains the same information as the original trace format!

I made an issue to hear you guys' thoughts on this and whether you guys think it's a good change for the trace visualizer project and a good addition to the project in general. If you are interested, I can turn my branch into a PR. (CC: @nisabostanci, @RichardLuo79)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions