[ZIPT Benchmark] Z3 c3 branch — 2026-03-18 #9031

2026-03-18T04:40:18Z

github-actions[bot]
bot Mar 18, 2026

ZIPT Benchmark Report — Z3 c3 branch

Date: 2026-03-18
Branch: c3
Benchmark set: QF_S (50 randomly selected files from tests/QF_S.tar.zst, total 22,172 available)
Timeout: 10 seconds per benchmark (-T:5 with -tr:seq for seq; -T:10 for nseq; -t:10000 for ZIPT)
Build: Debug mode (CMake + Ninja, CMAKE_BUILD_TYPE=Debug)

Note on ZIPT build: The .NET bindings were built using a custom net8.0-targeted project (NuGet network access was blocked in this environment). ZIPT was compiled from the parikh branch at https://github.com/CEisenhofer/ZIPT.

Summary

Metric	seq solver	nseq solver	ZIPT solver
sat	26	19	30
unsat	15	22	15
unknown	9	9	0
timeout	0	0	0
bug/crash	0	0	5
Total time (s)	40.428	82.651	23.650
Avg time/benchmark (s)	0.809	1.653	0.473

Times include nseq hitting the 10s wall for 9 benchmarks.

Soundness disagreements (any two solvers return conflicting sat/unsat): 7

Notable Issues

⚠️ Soundness Disagreements (Critical) — 7 files

All 7 disagreements follow the same pattern: nseq returns unsat while seq and ZIPT return sat. This strongly suggests a soundness bug in the nseq solver.

File	seq	nseq	ZIPT	Disagreeing pair
instance14787.smt2	unknown	unsat	sat	nseq vs ZIPT
instance05326.smt2	sat	unsat	sat	seq vs nseq, nseq vs ZIPT
instance08321.smt2	sat	unsat	sat	seq vs nseq, nseq vs ZIPT
slog_stranger_770_sink.smt2	sat	unsat	sat	seq vs nseq, nseq vs ZIPT
slog_stranger_4808_sink.smt2	unknown	unsat	sat	nseq vs ZIPT
instance02344.smt2	sat	unsat	sat	seq vs nseq, nseq vs ZIPT
instance11266.smt2	sat	unsat	sat	seq vs nseq, nseq vs ZIPT

The consistent pattern — nseq claims unsat, both seq and ZIPT claim sat — points to a false unsatisfiability bug in the nseq (Nielsen-graph-based) solver. The slog_stranger files are real-world web security string constraints; the instance* files are from the AutomataArk benchmark suite. Checking one representative file (instance02344.smt2, status sat per seq and ZIPT) against nseq would confirm the bug.

🐛 Crashes / Bugs — 5 files (all in ZIPT)

ZIPT fails with an error/unsupported-operation on all 5 PCP-encoding benchmarks. These files use str.replace_all chained through multiple variables to encode Post Correspondence Problem instances. ZIPT apparently does not support str.replace_all.

File	ZIPT output
pcp_instance_252.smt2	bug (error/unsupported)
pcp_instance_160.smt2	bug
pcp_instance_347.smt2	bug
unsolved_pcp_instance_409.smt2	bug
unsolved_pcp_instance_381.smt2	bug

🐢 Slow Benchmarks (> 8s for any solver)

File	Slow solver	Time (s)
pcp_instance_252.smt2	nseq	10.011
pcp_instance_160.smt2	nseq	10.011
pcp_instance_347.smt2	nseq	10.011
unsolved_pcp_instance_409.smt2	nseq	10.011
unsolved_pcp_instance_381.smt2	nseq	10.011
instance14408.smt2	nseq	10.009
slog_stranger_2780_sink.smt2	nseq	10.011
instance08497.smt2	nseq	10.012
slog_stranger_770_sink.smt2	ZIPT	8.809

🔍 Trace Analysis: seq-fast / nseq-slow Hypotheses

Six files satisfied the criterion seq_time < 1.0s AND nseq_time > 3×seq_time AND nseq_time > 0.5s.

pcp_instance_252.smt2, pcp_instance_160.smt2, pcp_instance_347.smt2, unsolved_pcp_instance_409.smt2, unsolved_pcp_instance_381.smt2
(seq: ~0.22s (unknown), nseq: 10.011s (unknown))

These five benchmarks are PCP instances encoded with chained str.replace_all operations (e.g., x_1 = str.replace_all(x_0, "2", Top_0), x_2 = str.replace_all(x_1, "3", Top_1), etc.). The seq trace shows that seq immediately recognises and simplifies the equations using mk_eq_core (rewriter at seq_rewriter.cpp:5196), reducing each replace_all with a concrete replacement string into an explicit equation within a few hundred trace entries. Although seq ultimately returns unknown (it cannot conclude sat/unsat for the undecidable PCP), it does so quickly because its rewriter handles str.replace_all eagerly.

The nseq solver, by contrast, hits the 10s timeout on all five. The likely explanation is that nseq's Nielsen-graph engine does not natively reduce str.replace_all at the rewrite level — it probably expands it into an auxiliary sequence of equations or recursive constraints, producing a large Nielsen graph that exhausts the depth budget. The absence of specialised replace_all normalisation in nseq means it falls back to a more expensive general-purpose enumeration, rather than the direct symbolic fixpoint that seq's rewriter employs.

slog_stranger_2780_sink.smt2
(seq: 0.673s (sat), nseq: 10.011s (unknown))

The seq trace for this file (5,926 lines) shows that seq works through a regex membership constraint of the form x_6 = " <b>" ++ sigmaStar_3 and resolves it by normalising concrete string literals (the trace repeatedly simplifies " <b>", "</b> " and "\SCRIPT" strings through mk_eq_core), then calls enque_axiom for each individual character unit, and eventually constructs a mk_value for the satisfying assignment (the trace ends with the full model value being built). The approach is essentially symbolic character-level unfolding combined with automata alignment (seq.align.l, seq.align.r primitives visible in the trace).

nseq times out at 10s without producing a result. The benchmark contains a str.in_re constraint with a complex regular expression involving str.replace and re.* / re.+ operators encoding HTML tag sanitisation. The nseq Nielsen-graph solver appears to lack the automata-intersection machinery that seq uses to quickly determine membership; instead it likely tries to unroll the re.* into word equations, generating an exponential Nielsen-graph depth.

instance08497.smt2
(seq: 1.114s (sat), nseq: 10.012s (unknown))

This benchmark (from AutomataArk, status sat) asserts that a string X belongs to re.++ "\n" — i.e., an HTML comment containing arbitrary characters — while simultaneously being excluded from two other patterns. The seq trace (19,780 lines) shows it working through mk_eq_core equations where X is immediately collapsed to the concrete value "\n\r replacement string---->\n\n" (the rewriter identifies the only satisfying value that passes the complement-range membership tests), followed by extensive enque_axiom calls for individual character units.

nseq times out. The complement operator re.comp in the regular expression requires computing the complement automaton; nseq's word-equation engine appears not to reduce range complements into a finite character-class representation early in search, instead branching on character membership decisions and generating a combinatorially large search tree. In contrast, seq's automata-theoretic backbone handles re.comp natively via DFA complementation during mk_eq_core simplification.

📋 Per-File Results (all 50 benchmarks)

#	File	seq verdict	seq time (s)	nseq verdict	nseq time (s)	ZIPT verdict	ZIPT time (s)	Notes
1	instance14787.smt2	unknown	5.011	unsat	0.063	sat	0.346	SOUNDNESS_DISAGREEMENT
2	instance08140.smt2	sat	0.153	sat	0.031	sat	0.232
3	instance05326.smt2	sat	0.288	unsat	0.102	sat	0.366	SOUNDNESS_DISAGREEMENT
4	04_track_5.smt2	sat	1.557	sat	0.050	sat	0.457
5	instance02170.smt2	sat	0.031	sat	0.022	sat	0.258
6	slog_stranger_807_sink.smt2	unsat	0.024	unsat	0.023	unsat	0.207
7	instance00644.smt2	sat	0.081	sat	0.032	sat	0.242
8	instance13790.smt2	sat	0.388	unknown	0.094	sat	0.268
9	instance06516.smt2	unsat	0.207	unsat	0.047	unsat	0.370
10	instance03595.smt2	sat	0.464	sat	0.066	sat	0.257
11	pcp_instance_252.smt2	unknown	0.233	unknown	10.011	bug	0.135
12	instance10927.smt2	unsat	0.033	unsat	0.025	unsat	0.357
13	instance14523.smt2	sat	0.165	sat	0.031	sat	0.300
14	instance01101.smt2	sat	0.034	sat	0.023	sat	0.271
15	instance05566.smt2	sat	0.089	sat	0.028	sat	0.239
16	slog_stranger_2571_sink.smt2	unsat	0.082	unsat	0.063	unsat	0.256
17	instance14408.smt2	unknown	5.009	unknown	10.009	sat	0.365
18	instance12485.smt2	unsat	0.071	unsat	0.028	unsat	0.207
19	instance05164.smt2	sat	0.081	sat	0.031	sat	0.246
20	instance06841.smt2	unsat	1.185	unsat	0.598	unsat	1.194
21	instance05858.smt2	unknown	5.012	sat	0.069	sat	0.225
22	instance10283.smt2	unsat	0.042	unsat	0.028	unsat	0.359
23	instance11451.smt2	unsat	0.323	unsat	0.059	unsat	0.339
24	instance06002.smt2	unsat	0.200	unsat	0.037	unsat	0.329
25	instance02277.smt2	sat	0.115	sat	0.033	sat	0.258
26	pcp_instance_160.smt2	unknown	0.232	unknown	10.011	bug	0.135
27	instance10191.smt2	unsat	0.061	unsat	0.034	unsat	0.440
28	instance15980.smt2	unsat	0.212	unsat	0.033	unsat	0.307
29	instance10096.smt2	unsat	0.298	unsat	0.034	unsat	0.289
30	instance06916.smt2	sat	0.196	sat	0.037	sat	0.267
31	instance08321.smt2	sat	0.493	unsat	0.036	sat	0.245	SOUNDNESS_DISAGREEMENT
32	instance08677.smt2	sat	0.570	sat	0.039	sat	0.271
33	instance08588.smt2	sat	0.178	sat	0.035	sat	0.292
34	instance14878.smt2	unsat	0.168	unsat	0.035	unsat	0.336
35	slog_stranger_770_sink.smt2	sat	1.744	unsat	0.037	sat	8.809	SOUNDNESS_DISAGREEMENT
36	instance00127.smt2	sat	0.156	sat	0.031	sat	0.193
37	pcp_instance_347.smt2	unknown	0.215	unknown	10.011	bug	0.132
38	instance11169.smt2	unsat	0.327	unsat	0.041	unsat	0.385
39	slog_stranger_4808_sink.smt2	unknown	5.011	unsat	0.081	sat	0.596	SOUNDNESS_DISAGREEMENT
40	instance02344.smt2	sat	0.477	unsat	0.028	sat	0.194	SOUNDNESS_DISAGREEMENT
41	instance11266.smt2	sat	3.884	unsat	0.302	sat	0.391	SOUNDNESS_DISAGREEMENT
42	instance07530.smt2	unsat	0.140	unsat	0.042	unsat	0.343
43	instance04637.smt2	sat	0.135	sat	0.031	sat	0.357
44	unsolved_pcp_instance_409.smt2	unknown	0.212	unknown	10.011	bug	0.142
45	instance09266.smt2	sat	0.123	sat	0.032	sat	0.251
46	instance05556.smt2	sat	2.681	sat	0.050	sat	0.254
47	unsolved_pcp_instance_381.smt2	unknown	0.219	unknown	10.011	bug	0.136
48	instance00740.smt2	sat	0.031	sat	0.023	sat	0.263
49	slog_stranger_2780_sink.smt2	sat	0.673	unknown	10.011	sat	0.245
50	instance08497.smt2	sat	1.114	unknown	10.012	sat	0.294

Generated automatically by the ZIPT Benchmark workflow on the c3 branch.

AI generated by Qf S Benchmark · history

expires on Mar 25, 2026, 4:40 AM UTC

2026-03-19T23:47:55Z

github-actions[bot]
bot Mar 19, 2026
Author

This discussion has been marked as outdated by Qf S Benchmark.

A newer discussion is available at Discussion #9049.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ZIPT Benchmark] Z3 c3 branch — 2026-03-18 #9031

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[ZIPT Benchmark] Z3 c3 branch — 2026-03-18 #9031

Uh oh!

github-actions[bot] bot Mar 18, 2026

ZIPT Benchmark Report — Z3 c3 branch

Summary

Notable Issues

⚠️ Soundness Disagreements (Critical) — 7 files

🐛 Crashes / Bugs — 5 files (all in ZIPT)

🐢 Slow Benchmarks (> 8s for any solver)

🔍 Trace Analysis: seq-fast / nseq-slow Hypotheses

Replies: 1 comment

Uh oh!

github-actions[bot] bot Mar 19, 2026 Author

github-actions[bot]
bot Mar 18, 2026

github-actions[bot]
bot Mar 19, 2026
Author