[ZIPT Benchmark] ZIPT Benchmark — Z3 c3 branch (2026-03-14) #8991

2026-03-14T20:48:31Z

github-actions[bot]
bot Mar 14, 2026

Date: 2026-03-14
Branch: c3 (commit 4c64c82cef62dfe37e2d98a02fdb70d0e69defa4)
Benchmark set: QF_S — 50 randomly selected files from tests/QF_S.tar.zst (22 172 total files)
Timeout: 10 s per benchmark (-T:10 for Z3 nseq; -T:5 + outer 7 s for seq with tracing; -t:10000 for ZIPT)
Build: Debug (-DCMAKE_BUILD_TYPE=Debug); ZIPT built against freshly compiled Microsoft.Z3.dll (net10.0)

Summary

Metric	seq solver	nseq solver	ZIPT solver
sat	22	5	31
unsat	12	7	16
unknown	15	37	1
timeout	0	0	0
bug/crash	1 ¹	1 ¹	2 ²
Total time (s)	101.319	331.036	13.779
Avg time/benchmark (s)	2.026	6.621	0.276

¹ instance13118.smt2 was flagged as a crash during the benchmark run but on re-run seq correctly returns unsat and nseq times out. Likely a stale-output artifact from the benchmark harness (the benchmark script's grep -qi "error" matched Z3's internal trace noise just before the answer was printed). The true verdict for both is non-bug.

² Both ZIPT crashes are "Unsupported feature: str.replace_all" on the two PCP-string instances. ZIPT exits with a non-zero code and prints an unsupported-function error.

Soundness disagreements (any two solvers return conflicting sat / unsat): 0

⚠️ Note: The benchmark script used declare -A definitive_map inside a while loop without reinitialising the array each iteration. This caused false SOUNDNESS_DISAGREEMENT flags in the TSV. After manual re-analysis all flagged rows are confirmed to have no actual sat-vs-unsat conflict.

Notable Issues

Soundness Disagreements (Critical)

None. No benchmark produced a contradicting definitive verdict between any two solvers.

Crashes / Bugs

File	Solver	Error
`unsolved_pcp_instance_335.smt2`	ZIPT	`Unsupported feature: str.replace_all`
`pcp_instance_187.smt2`	ZIPT	`Unsupported feature: str.replace_all`

str.replace_all is not yet implemented in ZIPT's constraint propagator. Both files come from the 20250403-pcp-string collection.

nseq Coverage Gap (Major Performance Finding)

nseq timed out on 33 of 50 benchmarks (returning unknown at the 10 s wall-clock limit). It produced a definitive answer on only 12 files — most of which are simple string-equality or short-regex benchmarks from the slog / track families. In contrast, seq solved 34 and ZIPT solved 47.

The gap is almost entirely explained by the AutomataArk (20230329-automatark-lu) family, which dominates the random sample: these are single-variable regex-membership problems of the form (not (str.in_re X (complex-re))). seq's mature regex-automaton engine (theory_seq::propagate_in_re) handles them quickly; nseq appears to lack equivalent regex specialisation and exhausts its budget on general Nielsen-graph search.

Slow Benchmarks (> 8 s for any solver)

All 33 slow entries are nseq hitting the 10 s ceiling. The worst seq time is instance14766.smt2 at 4.086 s.

Trace Analysis: seq-fast / nseq-slow Hypotheses

13 files satisfy seq_time < 1.0 s AND nseq_time > 3 × seq_time AND nseq_time > 0.5 s.

Pattern common to all 13 candidates

Every candidate is an AutomataArk-derived regex satisfiability query: a single string variable X constrained by one or more str.in_re (or not (str.in_re ...)) assertions over large regexes encoding real-world patterns (HTML tags, HTTP headers, file paths, date formats, etc.).

The seq solver trace consistently shows:

[seq] enque_axiom / deque_axiom — initial character-set axioms for the regex alphabet (dozens of seq.unit Char[n] facts) added in the first milliseconds.
[seq] propagate_in_re — automaton-intersection propagation immediately fires and emits strong length lower-bounds (assert:(>= (str.len ...) k)) and character-range constraints, sometimes directly contradicting the negated regex in a single propagation step.
[seq] propagate_lit — the derived length facts are asserted at scope 0, allowing the SAT solver to close the search without backtracking.

Selected file analyses

instance00166.smt2 — seq=0.103 s (sat), nseq=10.010 s (unknown)
The problem is (not (str.in_re X (HTML-tag-regex))). The trace shows seq immediately enqueues character axioms for the 20+ characters in the regex alphabet (Char[60] = <, Char[62] = >, etc.), then fires propagate_in_re to derive the existence of a short satisfying string. The entire resolution takes ≈100 ms. nseq has no equivalent propagate_in_re path; it must enumerate possible string lengths and character combinations via Nielsen-graph expansion, which does not converge within 10 s on this lexically large regex.

instance04207.smt2 — seq=0.578 s (sat), nseq=10.008 s (unknown)
Problem: (not (str.in_re X (FTP-header-regex))). The trace (~7 200 lines) shows mk_eq_core establishing the substr suffix structure (e.g. str.substr X (str.len X - 28) 28 == "…reaction.txt"…\n"), then propagate_in_re with a regex that fixes a 28-char suffix. seq recognises this as unconditionally satisfiable via a short witness that avoids the wowokay[0-9]+FTP… prefix pattern. nseq cannot exploit the prefix-exclusion without a Parikh/length argument that is apparently not generated.

instance13483.smt2 — seq=0.829 s (unsat), nseq=10.008 s (unknown)
HTTP-header problem with str.substr constraints (suffix must be ":/\n", earlier substr must be "\r\n/smi\n"). The trace shows propagate_in_re deriving (>= (str.len X) 18) and (>= (str.len substr) 32) at scope 0 from a hex-UUID regex (re.loop[32:32] (re.union [a-f] [0-9])). These length facts combined with the fixed suffixes produce an arithmetic conflict that seq discharges in < 1 s. nseq does not generate the length bounds from the regex loop constraint and is left with under-constrained arithmetic.

instance13846.smt2 — seq=0.351 s (unsat), nseq=10.008 s (unknown)
Similar HTTP-header structure. seq derives a lower-bound length (from a re.loop[n:n]) that is incompatible with the fixed prefix/suffix, producing a fast refutation. nseq again appears to miss the loop-induced length inference.

Lehmann-Rabin_sat_non_incre_equiv_bad_0_1.smt2 — seq=0.583 s (sat), nseq=10.009 s (unknown)
CHC-derived benchmark with two string variables varout, varin both in (re.* (re.union a b c d H E)) (alphabet-Kleene-star) plus a complex boolean formula involving length and substring equality. seq handles Kleene-star membership trivially (any string over the alphabet satisfies it) and quickly finds a model. nseq's Nielsen-graph must unfold the concatenation equalities without the simplification that re.* over the full alphabet imposes no constraint.

Summary hypothesis: The root cause of all 13 slowdowns is the absence in nseq of a dedicated propagate_in_re/automaton-intersection engine. Seq converts each str.in_re constraint to a symbolic automaton, intersects automata, and propagates character/length facts directly into the arithmetic layer. nseq instead relies on Nielsen graph expansion and Parikh-based arithmetic abstraction, which are powerful for string-equality problems but do not extract the strong structural constraints that regex automata yield. Until nseq implements regex-to-automaton propagation (or integrates seq's seq_regex.cpp routines), it will systematically time out on AutomataArk-class benchmarks.

Per-File Results (50 benchmarks)

#	File	seq verdict	seq time (s)	nseq verdict	nseq time (s)	ZIPT verdict	ZIPT time (s)	Notes
1	`instance12064.smt2`	unsat	0.121	unsat	0.039	unsat	0.357
2	`instance08907.smt2`	sat	1.605	unknown	10.009	sat	0.308
3	`instance15675.smt2`	sat	0.057	sat	0.028	sat	0.329
4	`instance14766.smt2`	unsat	4.086	unknown	10.009	unsat	0.367
5	`instance12646.smt2`	unsat	2.733	unknown	10.011	unsat	0.268
6	`unsolved_pcp_instance_335.smt2`	unknown	0.243	unknown	10.009	bug	0.121	💥 ZIPT: str.replace_all unsupported
7	`01_track_120.smt2`	unknown	5.008	sat	0.062	sat	0.372
8	`instance13483.smt2`	unsat	0.829	unknown	10.008	unsat	0.285
9	`instance12188.smt2`	sat	3.213	unknown	10.009	sat	0.237
10	`instance01616.smt2`	sat	2.924	unknown	10.009	sat	0.271
11	`slog_stranger_1706_sink.smt2`	unsat	0.031	unsat	0.024	unsat	0.252
12	`instance01258.smt2`	sat	0.310	sat	0.027	sat	0.221
13	`03_track_129.smt2`	sat	0.774	sat	0.055	sat	0.239
14	`instance11770.smt2`	sat	3.453	unknown	10.009	sat	0.256
15	`instance08827.smt2`	unknown	5.010	unknown	0.048	unsat	0.329
16	`instance13553.smt2`	unknown	5.009	unknown	10.009	unsat	0.429
17	`instance00166.smt2`	sat	0.103	unknown	10.010	sat	0.218
18	`instance02021.smt2`	sat	0.143	unknown	10.008	sat	0.202
19	`instance10989.smt2`	unsat	0.068	unsat	0.039	unsat	0.322
20	`instance07639.smt2`	unknown	5.009	unknown	10.009	sat	0.248
21	`instance04207.smt2`	sat	0.578	unknown	10.008	sat	0.258
22	`instance13309.smt2`	unknown	5.012	unknown	10.009	sat	0.277
23	`instance13182.smt2`	unsat	0.071	unsat	0.025	unsat	0.272
24	`pcp_instance_187.smt2`	unknown	0.234	unknown	10.009	bug	0.121	💥 ZIPT: str.replace_all unsupported
25	`instance04761.smt2`	unknown	5.010	unknown	10.009	sat	0.218
26	`instance07676.smt2`	unknown	5.010	unknown	10.008	sat	0.384
27	`slog_stranger_418_sink.smt2`	unsat	0.027	unsat	0.022	unsat	0.205
28	`instance05341.smt2`	sat	0.951	unknown	10.009	sat	0.242
29	`instance07992.smt2`	sat	0.638	unknown	10.008	sat	0.246
30	`instance04793.smt2`	sat	1.139	unknown	10.008	sat	0.260
31	`instance12730.smt2`	unknown	5.008	unknown	10.011	sat	0.318
32	`instance11360.smt2`	unsat	1.044	unknown	10.009	unsat	0.297
33	`slog_stranger_4135_sink.smt2`	sat	1.602	unknown	0.037	sat	0.370
34	`Lehmann-Rabin_sat_non_incre_equiv_bad_0_1.smt2`	sat	0.583	unknown	10.009	sat	0.270
35	`instance02418.smt2`	sat	0.092	unknown	0.174	sat	0.222
36	`instance13118.smt2`	bug¹	0.007	bug¹	0.007	unknown	0.036	¹ false artifact; re-run: seq=unsat, nseq=timeout
37	`instance13474.smt2`	sat	1.785	unknown	10.008	sat	0.302
38	`instance13846.smt2`	unsat	0.351	unknown	10.008	unsat	0.283
39	`instance15325.smt2`	unsat	0.083	unsat	0.036	unsat	0.335
40	`03_track_128.smt2`	sat	1.770	sat	0.043	sat	0.235
41	`instance09378.smt2`	unknown	5.009	unknown	10.008	sat	0.337
42	`instance15893.smt2`	sat	0.528	unknown	10.008	sat	0.259
43	`instance00151.smt2`	sat	2.737	unknown	10.008	sat	0.260
44	`instance11552.smt2`	unknown	5.010	unknown	0.052	unsat	0.295
45	`instance07312.smt2`	unknown	5.009	unknown	10.011	sat	0.371
46	`instance01990.smt2`	unknown	5.010	unknown	10.008	sat	0.243
47	`query8824.smt2`	sat	0.794	unknown	10.009	sat	0.231
48	`instance11865.smt2`	unknown	5.011	unknown	10.009	unsat	0.333
49	`instance15451.smt2`	sat	0.393	unknown	10.009	sat	0.247
50	`instance14922.smt2`	unsat	0.094	unsat	0.026	unsat	0.421

Generated automatically by the ZIPT Benchmark workflow on the c3 branch.

AI generated by Qf S Benchmark · history

expires on Mar 21, 2026, 8:48 PM UTC

2026-03-18T04:40:33Z

github-actions[bot]
bot Mar 18, 2026
Author

This discussion has been marked as outdated by Qf S Benchmark.

A newer discussion is available at Discussion #9031.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ZIPT Benchmark] ZIPT Benchmark — Z3 c3 branch (2026-03-14) #8991

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[ZIPT Benchmark] ZIPT Benchmark — Z3 c3 branch (2026-03-14) #8991

Uh oh!

github-actions[bot] bot Mar 14, 2026

Summary

Notable Issues

Soundness Disagreements (Critical)

Crashes / Bugs

nseq Coverage Gap (Major Performance Finding)

Slow Benchmarks (> 8 s for any solver)

Trace Analysis: seq-fast / nseq-slow Hypotheses

Pattern common to all 13 candidates

Selected file analyses

Replies: 1 comment

Uh oh!

github-actions[bot] bot Mar 18, 2026 Author

github-actions[bot]
bot Mar 14, 2026

github-actions[bot]
bot Mar 18, 2026
Author