[ZIPT Benchmark] ZIPT Benchmark: Z3 c3 branch — 2026-03-15 #9002

2026-03-15T17:45:45Z

github-actions[bot]
bot Mar 15, 2026

Date: 2026-03-15
Branch: c3
Benchmark set: QF_S (50 randomly selected files from tests/QF_S.tar.zst, total pool: 22,172 files)
Timeout: 10 s per benchmark (-T:10 for Z3 solvers; -t:10000 for ZIPT)
Z3 build: Debug mode (commit d53846d) | ZIPT: parikh branch

Summary

Metric	seq solver	nseq solver	ZIPT solver
sat	25	7	32
unsat	13	8	12
unknown	12	35	1
timeout	0	0	0
bug/crash	0	0	5
Total time (s)	74.153	291.177	24.359
Avg time/benchmark (s)	1.483	5.824	0.487

Note on nseq "unknown": nseq returned unknown on 35 files. Most of these hit Z3's internal 10 s wall (-T:10) — the solver emits unknown rather than being killed by the outer timeout, so these are effectively timeouts.

Note on ZIPT "bug": All 5 ZIPT bug entries are files that use str.replace_all, which ZIPT's parikh branch does not yet support. It prints Unsupported feature: str.replace_all and exits without a verdict.

Soundness disagreements (any two solvers return conflicting sat/unsat): 1 ⚠️

Notable Issues

⚠️ Soundness Disagreements (Critical)

File	seq	nseq	ZIPT
`Lehmann-Rabin_sat_non_incre_equiv_trans_16_1.smt2`	unsat	unsat	sat ← ZIPT

Both Z3 solvers (seq and nseq) agree on unsat. ZIPT claims sat after hitting an internal exception (Specified method is not supported) mid-solving, then falls back to emitting SAT. This is a ZIPT soundness regression: when unsupported operations are encountered inside a satisfiability proof attempt, ZIPT should return unknown rather than SAT. The benchmark's :status annotation is unknown (not sat), so seq/nseq's unsat is the stronger claim—but ZIPT's incorrect SAT is the critical issue.

🐛 Unsupported Feature (ZIPT) — str.replace_all

ZIPT's parikh branch does not implement str.replace_all. Five benchmarks hit this path and produce no verdict:

pcp_instance_23.smt2
benchmark_0294.smt2
unsolved_pcp_instance_409.smt2
slog_stranger_149_sink.smt2
benchmark_0356.smt2

⏱ Slow Benchmarks (any solver > 8 s)

29 files had at least one solver exceed 8 s. nseq timed out on 29 of 50 benchmarks (58%), reflecting a systematic weakness on the automatark-lu and similar str.substr-heavy instances. ZIPT was fast (<1 s) on almost all solvable instances.

Only one file had all three solvers slow: 03_track_58.smt2 (seq 5.0 s, nseq 10.0 s, ZIPT 10.1 s — all three returned unknown).

Trace Analysis: seq-fast / nseq-slow Hypotheses

Definition: seq_time < 1.0 s AND nseq_time > 3 × seq_time AND nseq_time > 0.5 s.

16 candidates were identified (mostly from the 20230329-automatark-lu family).

Pattern common to all 16 candidates

The seq traces are dominated by repeated mk_eq_core entries at seq_rewriter.cpp:5193, resolving equations of the form:

"\u{a}" == (str.substr X (+ (- 1) (str.len X)) 1)
(str.substr X 1 (+ (- 1) (str.len X))) == "\u{a}"

followed by enque_axiom calls that add length constraints and concrete character axioms (seq.unit Char[N] for each character in the alphabet). seq's rewriter can directly reduce a substr-equals-unit equation into character and length constraints through its built-in mk_eq_core specialisation, and the resulting arithmetic + character constraints are quickly dispatched.

Hypothesis for nseq slowdown: nseq's Nielsen graph engine is designed for string-equation unification (word equations of the form u = v over concatenation), but the benchmark constraints are dominated by str.substr operations with arithmetic offset expressions ((str.len X) - 1). The Nielsen graph's simplification and extension rules (ConstNielsen, Det, EqSplit) do not directly decompose substr terms — these must first be axiomatised as auxiliary string equations, creating a large equation set that the iterative-deepening DFS (depth 10, 20, 40, …) explores very slowly. By contrast, seq calls its arithmetic length solver immediately once characters are enumerated. The result is that nseq exhausts its 10 s budget expanding fruitless extensions while seq terminates in <0.2 s by directly propagating the character/length constraints inferred from mk_eq_core.

Three representative cases with the most extreme ratio:

File	seq	nseq	ratio
`instance03410.smt2`	0.171 s	10.008 s	58×
`instance04442.smt2`	0.136 s	10.007 s	74×
`instance01911.smt2`	0.103 s	10.008 s	97×

Per-File Results

Click to expand all 50 results

#	File	seq verdict	seq time (s)	nseq verdict	nseq time (s)	ZIPT verdict	ZIPT time (s)	Notes
1	`instance01978.smt2`	sat	1.145	unknown	10.008	sat	.316
2	`slog_stranger_4602_sink.smt2`	unknown	5.009	unknown	.052	sat	.749
3	`instance03410.smt2`	sat	.171	unknown	10.008	sat	.207
4	`instance02157.smt2`	sat	1.637	unknown	10.008	sat	.213
5	`instance13640.smt2`	unsat	.188	unknown	10.008	unsat	.236
6	`instance02503.smt2`	sat	2.815	unknown	10.008	sat	.213
7	`instance00368.smt2`	unknown	5.010	unknown	10.008	sat	.237
8	`instance04442.smt2`	sat	.136	unknown	10.007	sat	.226
9	`instance01673.smt2`	sat	.520	unknown	10.008	sat	.233
10	`instance13383.smt2`	unsat	2.331	unsat	.082	unsat	.231
11	`instance08935.smt2`	unsat	.413	unsat	.148	unsat	.341
12	`instance03644.smt2`	sat	.806	unknown	10.008	sat	.259
13	`instance15968.smt2`	unsat	.134	unsat	.029	unsat	.329
14	`instance15889.smt2`	unsat	.595	unknown	.054	unsat	.330
15	`instance01419.smt2`	sat	.662	unknown	10.008	sat	.235
16	`pcp_instance_23.smt2`	unknown	.250	unknown	10.008	bug	.139	ZIPT: str.replace_all unsupported
17	`instance01911.smt2`	sat	.103	unknown	10.008	sat	.220
18	`04_track_27.smt2`	unsat	4.057	unsat	.090	unsat	1.017
19	`instance00496.smt2`	sat	.035	sat	.023	sat	.306
20	`benchmark_0294.smt2`	unknown	.898	unknown	.034	bug	.133	ZIPT: str.replace_all unsupported
21	`instance02487.smt2`	sat	.645	unknown	10.008	sat	.294
22	`unsolved_pcp_instance_409.smt2`	unknown	.231	unknown	10.009	bug	.145	ZIPT: str.replace_all unsupported
23	`03_track_58.smt2`	unknown	5.009	unknown	10.020	unknown	10.146
24	`Lehmann-Rabin_sat_non_incre_equiv_trans_16_1.smt2`	unsat	.030	unsat	.027	sat	.260	SOUNDNESS_DISAGREEMENT (ZIPT bug)
25	`instance08441.smt2`	sat	3.350	unknown	10.015	sat	.329
26	`query3429.smt2`	sat	.542	sat	.039	sat	.244
27	`01_track_40.smt2`	sat	.172	sat	.038	sat	.305
28	`instance00500.smt2`	sat	.260	unknown	10.008	sat	.232
29	`instance08627.smt2`	unknown	5.009	unknown	10.009	sat	.377
30	`instance12864.smt2`	unsat	.056	unsat	.025	unsat	.235
31	`instance11639.smt2`	unsat	.526	unknown	10.008	unsat	.327
32	`instance04136.smt2`	sat	.055	sat	.027	sat	.211
33	`instance01318.smt2`	sat	.815	unknown	10.008	sat	.195
34	`instance08348.smt2`	sat	1.296	sat	.027	sat	.301
35	`instance12749.smt2`	sat	3.619	unknown	10.008	sat	.314
36	`slog_stranger_582_sink.smt2`	sat	.182	sat	.038	sat	.394
37	`03_track_26.smt2`	unsat	.033	unsat	.022	unsat	.244
38	`slog_stranger_2174_sink.smt2`	sat	.743	unknown	.037	sat	.357
39	`slog_stranger_149_sink.smt2`	unknown	5.009	unknown	10.012	bug	.164	ZIPT: str.replace_all unsupported
40	`instance15442.smt2`	unknown	5.009	unknown	10.008	sat	.265
41	`instance01226.smt2`	sat	.133	unknown	10.008	sat	.210
42	`instance14019.smt2`	unsat	.451	unknown	10.008	unsat	.283
43	`instance10623.smt2`	unsat	1.594	unknown	10.008	unsat	.307
44	`slog_stranger_410_sink.smt2`	unsat	.033	unsat	.024	unsat	.286
45	`benchmark_0356.smt2`	unknown	.879	unknown	.035	bug	.137	ZIPT: str.replace_all unsupported
46	`slog_stranger_4304_sink.smt2`	sat	.587	unknown	.035	sat	.374
47	`instance03530.smt2`	sat	.786	unknown	10.008	sat	.220
48	`slog_stranger_3451_sink.smt2`	sat	.165	sat	.035	sat	.357
49	`instance11803.smt2`	unknown	5.009	unknown	10.008	sat	.474
50	`instance01344.smt2`	unknown	5.010	unknown	10.008	sat	.202

Generated automatically by the ZIPT Benchmark workflow on the c3 branch.

AI generated by Qf S Benchmark · history

expires on Mar 22, 2026, 5:45 PM UTC

2026-03-18T04:40:27Z

github-actions[bot]
bot Mar 18, 2026
Author

This discussion has been marked as outdated by Qf S Benchmark.

A newer discussion is available at Discussion #9031.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ZIPT Benchmark] ZIPT Benchmark: Z3 c3 branch — 2026-03-15 #9002

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[ZIPT Benchmark] ZIPT Benchmark: Z3 c3 branch — 2026-03-15 #9002

Uh oh!

github-actions[bot] bot Mar 15, 2026

Summary

Notable Issues

⚠️ Soundness Disagreements (Critical)

🐛 Unsupported Feature (ZIPT) — str.replace_all

⏱ Slow Benchmarks (any solver > 8 s)

Trace Analysis: seq-fast / nseq-slow Hypotheses

Pattern common to all 16 candidates

Per-File Results

Replies: 1 comment

Uh oh!

github-actions[bot] bot Mar 18, 2026 Author

github-actions[bot]
bot Mar 15, 2026

github-actions[bot]
bot Mar 18, 2026
Author