-
Notifications
You must be signed in to change notification settings - Fork 142
Description
Problem
Given that we now have a use for trace files beyond simple debugging (i.e. building our own trace visualizer), the current CSV format output by the existing TraceRecorder introduces some unessesary inconveniences when trying to parse it efficiently:
- We need to count & index the memory mapped CSV file, which still can take 3-5s on SSD and ~11s on HDD with numerous optimizations.
- This is how far I got optimizing the parsing of the CSV trace file for my Trace Visualizer, any more gives diminishing returns: Code
- Since we're working with characters in a string, we need to perform linear time iterations to find a row and then a specific entry in a row.
- The current format has odd quirks that could be optimized for file size, like:
- Spaces after commas in the output CSV (Those add up! The first test in the
messbranch produces with spaces a27MBfile and removing the spaces gives you something along the order23MB, so imagine how much of a 1GB file is spaces. - When something is broken/invalid,
TraceRecorderoutputs-1as a value in that entry, forcing us to use integer types, getting half the positive number range for double the size. Depending on how we define the maximum values for entries in the address vector, one could easily cut the file-size in half if it were stored in binary.
- Spaces after commas in the output CSV (Those add up! The first test in the
Solution
Yesterday, I wanted to validate my idea, so I implemented a binary trace format (+ recorder) with the following characteristics:
- It is fixed-width, meaning after memory mapping, we get
O(1)lookup for each entry at any line. - It defines a trace event to be exactly
32Bwide, meaning two trace events fit into one64Bcache line.- (i.e when we look up
i, we geti+1loaded into the CPU cache for free).
- (i.e when we look up
- It is very friendly to size-optimization if you can define tighter upper boundaries for the address vector entries.
I have forked Ramulator 2.0 and added:
- Header-only library defining the file format: (Code)
- New recorder called
BinaryTraceRecorder: (Code)
How to test it
Since you guys don't seem to have a set-in-stone way of doing testing right now, I didn't want to add my own unit/integration testing setup without talking to you guys first, so I vibecoded (!) a script that takes the binary file and turns it back into a matching CSV. (Code)
I recommend the following testing procedure:
- Add both the
TraceRecorderandBinaryTraceRecorderto a project and generate some traces. (Params are the same)- You should end up with the usual CSV files
{path}.ch{channel_id}and new files that end in{channel_id}.mtrc(the new format)
- You should end up with the usual CSV files
- Pick one channel, (i.e.
0), and convert the file ending in{id}.mtrcto a CSV like so:python3 test_mtrc.py <path>.mtrc- You will end up with a file of the same name as the input file but ending in
.csv
- Get the diff of this file and the file produced by
TraceRecordercorresponding to the same channel:diff -a visualizer_trace_0.csv visualizer_trace_csv.ch0- If the diff is empty, it means our binary format contains the same information as the original trace format!
I made an issue to hear you guys' thoughts on this and whether you guys think it's a good change for the trace visualizer project and a good addition to the project in general. If you are interested, I can turn my branch into a PR. (CC: @nisabostanci, @RichardLuo79)