Skip to content

Commit 02c7495

Browse files
committed
refactor: eliminate pylint duplicate-code violations in diffctx
1 parent 8c43227 commit 02c7495

File tree

93 files changed

+56295
-5980
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

93 files changed

+56295
-5980
lines changed

docs/Context-Selection-for-Git-Diff/main.tex

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
\author{
1616
Nikolay Eremeev \\
17-
\texttt{your-email@example.com} \\
17+
\texttt{nikolay.eremeev@outlook.com} \\
1818
\texttt{https://github.com/nikolay-e/treemapper}
1919
}
2020

@@ -24,7 +24,7 @@
2424
\maketitle
2525

2626
\begin{abstract}
27-
Retrieving optimal context for Large Language Models to understand code changes is a critical challenge in automated software engineering. Current approaches rely on naive windowing or purely lexical retrieval, missing complex structural dependencies. We formalize diff context selection as budgeted submodular maximization over Code Property Graphs, where relevance propagates via weighted structural dependencies from edited code using Personalized PageRank. This formulation enables: (1) principled optimization grounded in submodular function theory, (2) adaptive stopping based on marginal utility thresholds, and (3) unified treatment of structural and lexical signals as weighted graph edges. We propose a rigorous evaluation methodology using dependency coverage recall and fault localization accuracy as behavioral proxies for the latent notion of ``understanding.'' The framework offers a principled, scalable solution for processing diffs within strict token budgets imposed by LLM context windows.
27+
Retrieving optimal context for Large Language Models to understand code changes is a critical challenge in automated software engineering. Current approaches rely on naive windowing or purely lexical retrieval, missing complex structural dependencies. We formalize diff context selection as budgeted submodular maximization over Code Property Graphs, where relevance propagates via weighted structural dependencies from edited code using Personalized PageRank. This formulation enables: (1) optimization inspired by submodular function theory with empirically validated performance, (2) adaptive stopping based on marginal utility thresholds, and (3) unified treatment of structural and lexical signals as weighted graph edges. We propose a rigorous evaluation methodology using dependency coverage recall and fault localization accuracy as behavioral proxies for the latent notion of ``understanding.'' The framework offers a scalable solution for processing diffs within strict token budgets imposed by LLM context windows.
2828
\end{abstract}
2929

3030
\section{Introduction}
@@ -97,7 +97,7 @@ \subsection{Code Property Graph}
9797
\label{tab:edges}
9898
\end{table}
9999

100-
\textbf{Note on symbol reference edges}: We use lexical name matching within function scope as a heuristic approximation to data-flow analysis. True def-use chains require control-flow graph construction and reaching definitions analysis, which is infeasible for incomplete or unparsable code in git working trees. For dynamic languages (Python, JavaScript), static resolution captures approximately 60--70\% of actual dependencies.
100+
\textbf{Note on symbol reference edges}: We use lexical name matching within function scope as a heuristic approximation to data-flow analysis. True def-use chains require control-flow graph construction and reaching definitions analysis, which is infeasible for incomplete or unparsable code in git working trees. For dynamic languages (Python, JavaScript), static resolution coverage varies significantly by codebase characteristics: codebases with heavy metaprogramming or dynamic dispatch may achieve 40--60\% resolution, typical application code 50--70\%, and strongly typed code with type hints 70--85\%. These ranges require validation against runtime call graphs.
101101

102102
Edge weights reflect empirical findings that developers predominantly follow structural links (call/data dependencies) rather than lexical search when navigating code~\cite{ko2006exploratory}. For typed languages, structural edges dominate; for dynamic languages, lexical signals compensate for incomplete static analysis.
103103

@@ -222,7 +222,7 @@ \subsection{Experimental Design}
222222
\section{Discussion and Limitations}
223223

224224
\paragraph{Language Coverage.}
225-
The framework requires language-specific tooling for AST parsing and name resolution. Planned initial support targets Python, TypeScript, Rust, and JavaScript. Dynamic languages present challenges due to incomplete static analysis of dynamic dispatch (estimated 60--70\% resolution rate).
225+
The framework requires language-specific tooling for AST parsing and name resolution. The current implementation supports Python, TypeScript, JavaScript, Go, Rust, Java, C, C++, Ruby, Kotlin, and Scala via tree-sitter parsing, with Python and JavaScript receiving dedicated semantic analysis. Dynamic languages present challenges due to incomplete static analysis of dynamic dispatch.
226226

227227
\paragraph{Symbol Reference vs. Data-Flow.}
228228
Our ``symbol reference'' edges use lexical name matching within function scope as a heuristic. This is \textbf{not} true data-flow analysis, which requires CFG construction and reaching definitions computation---infeasible for incomplete code in git working trees. We explicitly acknowledge this limitation.
@@ -231,7 +231,7 @@ \section{Discussion and Limitations}
231231
The current PPR formulation propagates relevance along forward edges (callee direction). For understanding changes, backward dependencies (callers of modified functions, usages of modified variables) are often more important. Future work should add reverse edges or run PPR on the graph transpose.
232232

233233
\paragraph{Hub Node Dominance.}
234-
Utility classes with high in-degree (Logger, Config, Utils) can accumulate excessive PPR mass. Mitigation: IDF-style damping $w'_{uv} = w_{uv} / \log(1 + \text{in\_degree}(v))$ for nodes exceeding the 95th percentile in-degree.
234+
Utility classes with high in-degree (Logger, Config, Utils) can accumulate excessive PPR mass. Our implementation applies IDF-style damping $w'_{uv} = w_{uv} / \log(1 + \text{in\_degree}(v))$ for nodes exceeding the 95th percentile in-degree. While this improves PPR distribution in preliminary testing, systematic validation across diverse codebases is required to confirm its effectiveness.
235235

236236
\paragraph{Configuration Files.}
237237
Non-code files (YAML, JSON, TOML) lack structural dependencies. For these, lexical signals must dominate, with higher BM25 weights (0.4--0.5).
@@ -298,7 +298,9 @@ \section{Heuristics Annex}
298298

299299
The following tactics are useful in practice but lack strong theoretical or empirical grounding. They should be evaluated via ablation before deployment.
300300

301-
\paragraph{Hub Suppression.} Apply IDF-style penalty to edges into high in-degree nodes: $w'_{uv} = w_{uv} / \log(1 + \text{in\_degree}(v))$ when $\text{in\_degree}(v)$ exceeds the 95th percentile.
301+
\paragraph{Relatedness Bonus.} The implementation includes a minimum marginal gain for fragments with high PPR relevance ($\geq 0.10$) and concept overlap: $\text{gain}_{\min} = \text{rel} \times 0.25 \times \min(|\text{covered}|, 5)$. This ensures semantically related fragments are included even when concepts are already covered by core fragments. Additionally, a fallback bonus of $\text{rel} \times 0.25$ applies to high-PPR fragments regardless of concept overlap, ensuring structurally related fragments (via call graph) are included. \textbf{Warning:} This modification may break strict submodularity of the utility function, invalidating worst-case approximation guarantees. Ablation studies comparing performance with and without the bonus are required.
302+
303+
\paragraph{Hub Suppression.} Apply IDF-style penalty to edges into high in-degree nodes: $w'_{uv} = w_{uv} / \log(1 + \text{in\_degree}(v))$ when $\text{in\_degree}(v)$ exceeds the 95th percentile. This is implemented globally during graph construction.
302304

303305
\paragraph{Backward Edge Weighting.} Weight caller$\to$callee edges at 0.7--0.8$\times$ of callee$\to$caller edges to prioritize impact analysis over dependency tracing.
304306

0 commit comments

Comments
 (0)