[Buffers] Add latency and occupancy balancing LPs.#711
[Buffers] Add latency and occupancy balancing LPs.#711ziadomalik wants to merge 84 commits intomainfrom
Conversation
Also it's more modular now, take the logic out of the constructor, and a data flow graph describes a single thing, not an array of things
…es with hacky stuff
8db2093 to
f4b6d6e
Compare
| /// Extra latency to insert on this channel for balancing (integer). | ||
| CPVar extraLatency; |
There was a problem hiding this comment.
What is the difference between this extraLatency and dataLatency above? Could we use that one instead?
| /// Occupancy contribution of this unit (real). | ||
| CPVar occupancy; |
There was a problem hiding this comment.
We should add a remark for the CPVars that are exclusive to FPGA24
| ReconvergentPath path; | ||
| const ReconvergentPathFinderGraph *graph; |
There was a problem hiding this comment.
Let's think of better names for ReconvergentPath and ReconvergentPathFinderGraph:
ReconvergentPath: this one actually finds you the paths, and it is a graph by itselfReconvergentPathFinderGraph: shall we call it CFGTransitionSequenceSubgraph (to match what is actually is)?
For this one in the previous PR:
/// A reconvergent path is a subgraph where multiple paths diverge from a fork
/// and reconverge at a join. This is important for latency balancing.
struct ReconvergentPath {
NodeIdType forkNodeId; // The divergence point
NodeIdType joinNodeId; // The convergence point
std::set<NodeIdType> nodeIds; // All nodes on paths from fork to join.
ReconvergentPath(NodeIdType fork, NodeIdType join, std::set<NodeIdType> nodes)
: forkNodeId(fork), joinNodeId(join), nodeIds(std::move(nodes)) {}
};
Does it make more sense to put the edges already here (so we'll not need ReconvergentPathFinderGraph here anymore)? This could be implemented as a subgraph as well
| ArrayRef<ReconvergentPathWithGraph> reconvergentPaths; | ||
| ArrayRef<SynchronizingCyclePair> syncCyclePairs; |
There was a problem hiding this comment.
Not a good idea to have a reference as a class member (not clear who owns the reference), could just be std::vector
| ArrayRef<ReconvergentPathWithGraph> reconvergentPaths; | ||
| ArrayRef<CFDFC *> cfdfcs; |
| std::vector<unsigned> pathNumVars; /// Number of L_c vars per path | ||
| std::vector<unsigned> pathNumEdges; /// Total edges per path |
| /// (Paper: Section 4, Equation 4) | ||
| void LatencyBalancingMILP::addStallPropagationConstraints() { | ||
| DenseMap<Value, SmallVector<CPVar *>> channelToPatterns; | ||
|
|
There was a problem hiding this comment.
could add a lambda function here:
auto addPatternToChannelStallProp = [](Value ch, CPVar * imbalanced) { .... };
And directly call this function in the loops:
for (size_t i = 0; i < reconvergentPaths.size(); ++i) {
const ReconvergentPathWithGraph &pathWithGraph = reconvergentPaths[i];
const ReconvergentPath &path = pathWithGraph.path;
const ReconvergentPathFinderGraph *graph = pathWithGraph.graph;
for (NodeIdType nodeId : path.nodeIds) {
for (EdgeIdType edgeId : graph->adjList[nodeId]) {
const auto &edge = graph->edges[edgeId];
if (path.nodeIds.count(edge.dstId)) {
addPatternToChannelStallProp(edge, &vars.reconvergentPathVars[i].imbalanced);
}
}
}
}
| /// Check if any unit in the path has variable latency. | ||
| bool hasVarLatency = false; | ||
| for (NodeIdType nodeId : path.nodeIds) { | ||
| Operation *op = graph->nodes[nodeId].op; | ||
| if (hasVariableLatency(op)) { | ||
| hasVarLatency = true; | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| /// If variable latency exists, there is no opportunity to balance the | ||
| /// pattern that contains it. Thus, we force the pattern to be marked | ||
| /// imbalanced. (Paper: Section 4, Equation 3) | ||
| if (hasVarLatency) { | ||
| model->addConstr(patternImbalanced == 1, | ||
| "varLatency_rp_" + std::to_string(pathIdx)); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Outline the section like this:
// [START Handle pattern with variable latency (mark them as always imbalanced)]
/* code in the middle */
// [END Handle pattern with variable latency (mark them as always imbalanced)]
| pathNumEdges.push_back(simplePath.edges.size()); | ||
| } | ||
|
|
||
| /// Debug: Log path latencies for this reconvergent pattern (only for first |
| /// Add imbalance constraints for each pair of paths: | ||
| /// |L_i - L_j| <= M * s_p. | ||
| /// Which translates to: | ||
| /// L_i - L_j <= M * s_p. | ||
| /// L_j - L_i <= M * s_p. |
There was a problem hiding this comment.
Move to right before the constraints
Here we could leave a comment to explain why "edges to joins" is a vector of vectors (instead of a single vector to edges (BTW, why it can't be a single vector to edges...? |
Status: 17.01.25
This PR is not mergeable yet, in my goal to latency and occupancy balance
fir.c, I only got:~3000cycles normally~2000cycles with this hackContext on the hack that makes me achieve
~2000 cyclesMultiply latency seems to be 5, so accordingly the LP places 5 DV buffers.
When I did the manual placement I used 6 DV, (I think we debugged this together).
So the way I remember this, for
II=1with a 5-cycle component, we need 6 DV Buffers because:Is this the correct reasoning? If yes, with what constraint does the paper account for this? Or is my mistake maybe somewhere else.
Another Problem with Section 5, Equation 11:
The paper says:
When enabled, the solver concentrates occupancy on the wrong channels.
What we want is:
fork5→ N=6,fork3→ N=6 (matching L values)Got:
mux1→ N=5,fork5→ N=4 (4 FIFO + 1 DV instead of 6 DV)The solver satisfies the path sum cheaply by putting occupancy on narrow channels (mux output) instead of where LP1 added latency (fork outputs). How should Equation 11 be implemented correctly? Is there a constraint I am missing that's tying occupancy to specific channels? For now I just disabled it.
Debug Output: