[ZIPT Benchmark] ZIPT Benchmark: Z3 c3 branch — 2026-03-15 #9002
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Qf S Benchmark. A newer discussion is available at Discussion #9031. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Date: 2026-03-15
Branch: c3
Benchmark set: QF_S (50 randomly selected files from
tests/QF_S.tar.zst, total pool: 22,172 files)Timeout: 10 s per benchmark (
-T:10for Z3 solvers;-t:10000for ZIPT)Z3 build: Debug mode (commit
d53846d) | ZIPT:parikhbranchSummary
Soundness disagreements (any two solvers return conflicting sat/unsat): 1⚠️
Notable Issues
Lehmann-Rabin_sat_non_incre_equiv_trans_16_1.smt2Both Z3 solvers (seq and nseq) agree on unsat. ZIPT claims sat after hitting an internal exception (
Specified method is not supported) mid-solving, then falls back to emittingSAT. This is a ZIPT soundness regression: when unsupported operations are encountered inside a satisfiability proof attempt, ZIPT should returnunknownrather thanSAT. The benchmark's:statusannotation isunknown(notsat), so seq/nseq'sunsatis the stronger claim—but ZIPT's incorrectSATis the critical issue.🐛 Unsupported Feature (ZIPT) — str.replace_all
ZIPT's
parikhbranch does not implementstr.replace_all. Five benchmarks hit this path and produce no verdict:pcp_instance_23.smt2benchmark_0294.smt2unsolved_pcp_instance_409.smt2slog_stranger_149_sink.smt2benchmark_0356.smt2⏱ Slow Benchmarks (any solver > 8 s)
29 files had at least one solver exceed 8 s. nseq timed out on 29 of 50 benchmarks (58%), reflecting a systematic weakness on the automatark-lu and similar
str.substr-heavy instances. ZIPT was fast (<1 s) on almost all solvable instances.Only one file had all three solvers slow:
03_track_58.smt2(seq 5.0 s, nseq 10.0 s, ZIPT 10.1 s — all three returnedunknown).Trace Analysis: seq-fast / nseq-slow Hypotheses
Definition: seq_time < 1.0 s AND nseq_time > 3 × seq_time AND nseq_time > 0.5 s.
16 candidates were identified (mostly from the
20230329-automatark-lufamily).Pattern common to all 16 candidates
The seq traces are dominated by repeated
mk_eq_coreentries atseq_rewriter.cpp:5193, resolving equations of the form:followed by
enque_axiomcalls that add length constraints and concrete character axioms (seq.unit Char[N]for each character in the alphabet). seq's rewriter can directly reduce asubstr-equals-unit equation into character and length constraints through its built-inmk_eq_corespecialisation, and the resulting arithmetic + character constraints are quickly dispatched.Hypothesis for nseq slowdown: nseq's Nielsen graph engine is designed for string-equation unification (word equations of the form
u = vover concatenation), but the benchmark constraints are dominated bystr.substroperations with arithmetic offset expressions ((str.len X) - 1). The Nielsen graph's simplification and extension rules (ConstNielsen,Det,EqSplit) do not directly decomposesubstrterms — these must first be axiomatised as auxiliary string equations, creating a large equation set that the iterative-deepening DFS (depth 10, 20, 40, …) explores very slowly. By contrast, seq calls its arithmetic length solver immediately once characters are enumerated. The result is that nseq exhausts its 10 s budget expanding fruitless extensions while seq terminates in <0.2 s by directly propagating the character/length constraints inferred frommk_eq_core.Three representative cases with the most extreme ratio:
instance03410.smt2instance04442.smt2instance01911.smt2Per-File Results
Click to expand all 50 results
instance01978.smt2slog_stranger_4602_sink.smt2instance03410.smt2instance02157.smt2instance13640.smt2instance02503.smt2instance00368.smt2instance04442.smt2instance01673.smt2instance13383.smt2instance08935.smt2instance03644.smt2instance15968.smt2instance15889.smt2instance01419.smt2pcp_instance_23.smt2instance01911.smt204_track_27.smt2instance00496.smt2benchmark_0294.smt2instance02487.smt2unsolved_pcp_instance_409.smt203_track_58.smt2Lehmann-Rabin_sat_non_incre_equiv_trans_16_1.smt2instance08441.smt2query3429.smt201_track_40.smt2instance00500.smt2instance08627.smt2instance12864.smt2instance11639.smt2instance04136.smt2instance01318.smt2instance08348.smt2instance12749.smt2slog_stranger_582_sink.smt203_track_26.smt2slog_stranger_2174_sink.smt2slog_stranger_149_sink.smt2instance15442.smt2instance01226.smt2instance14019.smt2instance10623.smt2slog_stranger_410_sink.smt2benchmark_0356.smt2slog_stranger_4304_sink.smt2instance03530.smt2slog_stranger_3451_sink.smt2instance11803.smt2instance01344.smt2Generated automatically by the ZIPT Benchmark workflow on the c3 branch.
Beta Was this translation helpful? Give feedback.
All reactions