Skip to content

[Benchmark Output Submission]: Verificate HELIX V1.4 #9

@Verificate-Dev

Description

@Verificate-Dev

Agent Name

Verificate HELIX V1.4

Maintainer

Craig Atkinson / Verificate

Model(s) Used

Granite 4.0 Small Q4

Agent Description

HELIX v1.4 is a production-grade inference engine built by Verificate, optimized for CPU-only execution of large language models. This submission covers all 223 VAKRA benchmark questions across all 4 capabilities (222/223 successful, 99.6% completion rate).

Metadata (JSON)

{
"submitter": "Verificate",
"engine": "HELIX v1.4 CPU Inference Engine",
"model": "IBM Granite 4.0 Small Q4 (32B parameters)",
"infrastructure": "OpenShift CPU Pod — AMD EPYC 9254, 24 threads, NUMA node0 (HPC Fusion, llama_v14 profile)",
"helix_config": {
"HELIX_SLICE_EXECUTION": 1,
"HELIX_ACTIVE_SLICE_RATIO": 0.50,
"HELIX_SLICE_TOPK": 1,
"HELIX_PROGRESSIVE_MODE": 0,
"HELIX_GATE_BY_UTS": 1,
"VERIFICATE_BATCH_SIZE": 1024
},
"speed_metrics": {
"cap1_avg_duration_s": 47.3,
"cap2_avg_duration_s": 31.4,
"cap3_avg_duration_s": 42.1,
"cap4_avg_duration_s": 45.0,
"overall_avg_duration_s": 40.5,
"per_llm_call_s": "8–11s (HELIX slicing active)"
},
"success_metrics": {
"total_queries": 223,
"successful_executions": 222,
"error_executions": 1,
"completion_rate": "99.6%"
},
"agent_fixes": [
"JSON-safe tool result truncation (4000 chars, dict handles preserved)",
"Context message truncation (3000 chars on prior-turn assistant messages)",
"Raw tool call artifact re-execution (XML and JSON blob formats)",
"Synthesis fallback (final no-tool LLM call for useless answers)",
"Extended useless answer detection (dicts, lists, Python reprs, error strings)"
],
"notes": "Two-pass tool pre-selection active for all 4 capabilities. HELIX slice execution confirmed active via per-call latency (8–11s vs ~30s baseline). 222/223 questions answered successfully. Cap 4 multi-turn achieved 0 timeouts. Official VAKRA schema validation passed all 4 capability files."
}

ZIP File Link

https://www.dropbox.com/scl/fi/zxc96xuf6ritnjcji4ssg/Verificate_HELIX_v1.4_HELIX_Sliced_Submission_20260409.zip?rlkey=pucw6vflj9crdqzrlxzmaf8hk&st=u0owc77m&dl=0

ZIP Contents Description

SUBMISSION_MANIFEST.json
capability_bi_apis/prediction/chicago_crime.json (79 records)
capability_dashboard_apis/prediction/chicago_crime.json (78 records)
capability_multihop_reasoning/prediction/chicago_crime.json (45 records)
capability_multiturn/prediction/chicago_crime.json (21 records)

Validation Checklist

  • JSON files are valid and well-formed
  • ZIP file is accessible via the provided link
  • No sensitive or PII data included
  • Agent has been tested locally

Additional Notes

The standout result is Cap 4 (multi-turn, double-weighted): 0 errors, avg 45s. With HELIX slice execution, each LLM call completes in 8–11s — giving the agent a 54+ iteration budget within the 600s limit instead of ~20.

The overall 40.5s average across all 223 questions on a CPU-only pod (no GPU) running a 32B parameter model demonstrates that HELIX sparse activation can deliver practical agentic inference speeds competitive with GPU-accelerated deployments. The two-pass tool isolation further reduces agent iteration counts for the 174-tool capability categories, keeping average durations below 50s even for the most complex multi-hop reasoning chains.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions