Skip to content

Commit 4ec58f4

Browse files
authored
Add benchmark to ARC
1 parent da2de85 commit 4ec58f4

File tree

1 file changed

+20
-3
lines changed

1 file changed

+20
-3
lines changed

arc/README.md

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@
44

55
The ARC benchmark provides a dataset and evaluation benchmark designed to test the reasoning abilities of AI models. The Abstraction and Reasoning Corpus (ARC) benchmark, introduced by François Chollet, tests an AI's ability to demonstrate general intelligence rather than task-specific performance. Unlike traditional benchmarks that focus on training models on large, fixed datasets, ARC assesses how well an AI system can generalize, reason abstractly, and solve novel problems using limited information—skills akin to human cognitive abilities. It challenges AI to learn and adapt without task-specific training, emphasizing flexibility and the capacity to understand and apply abstract relationships. Designed with problems that a human can solve using common sense and basic reasoning, ARC remains difficult for AI, highlighting the gap between current models and true human-like cognitive capabilities. This benchmark is seen as a significant step toward developing Artificial General Intelligence (AGI), as it requires AI systems to exhibit foundational skills like inductive reasoning, analogy-making, and adaptability, pushing AI research toward more human-like problem-solving abilities.
66

7-
https://arcprize.org/
7+
https://arcprize.org
88

99
![ARC Puzzle Example](https://github.com/user-attachments/assets/41701a8e-5639-4f35-96ae-f4815dbd59cc)
1010

1111
## Nucleoid Approach
1212

1313
Nucleoid aka `nuc` approaches Neuro-Symbolic AI with introducing an intermediate language. Briefly, Nucleoid is a declarative, logic-based, contextual runtime that tracks each statement in declarative syntax and dynamically creates relationships between both logic and data statements in the knowledge graph to used in decision-making and problem-solving process.
1414

15-
> Essential Intelligence is integration of pattern, data and logic.
15+
> **Essential intelligence** is integration of pattern, data and logic.
1616
1717
This concept is also introduced in *Thinking, Fast and Slow* by Daniel Kahneman, where System 1 operates through pattern recognition, while System 2 applies logical reasoning. Data acts as the bridge, enabling collaboration between these systems to yield insights based on both probabilistic and deterministic information. However, the real challenge lies in enabling effective collaboration between the two systems so they can understand and support one another.
1818

@@ -22,7 +22,7 @@ We've found that using an intermediate, ubiquitous language is highly effective
2222

2323
There are 2 sections in our approach: analysis and visualization. In analysis phase, the AI system aims generalize patterns and identifies instances in order to use in actual test, and in visualization, the extracted abstraction is applied to given test input.
2424

25-
> :zap: Instead of prompt engineering, all communications with LLMs are made thru `nuc` language
25+
> :zap: All communications with the LLM are made thru `nuc` language instead of prompt engineering
2626
2727
![ARC_Flow](https://github.com/user-attachments/assets/0b5132cb-8269-461d-b27b-2b84ec1dd640)
2828

@@ -134,3 +134,20 @@ For identified instances, the LLM represents instances in `nuc` lang, this is pa
134134
For each input instance in test, the LLM generates output instance since it has all information needed from **knowledge packets**.
135135

136136
Finally, all instances can be merged for test result.
137+
138+
## Benchmark
139+
140+
![Benchmark](https://github.com/user-attachments/assets/e9d30330-6d4f-461b-9b9e-c8d1b9eccaf5)
141+
142+
This benchmark demonstrates how LLMs respond to different prompting methods for ARC subtasks. Our observations indicate that using natural language prompts often yields inconsistent and unpredictable results, particularly in chain-of-thought (CoT) reasoning. However, switching to 5GL such as `nuc` lang, significantly increases accuracy, with LLM responses approaching deterministic behavior. Notably, `nuc` lang achieves this performance without extensive training requirements. This suggests that structured, high-level programming languages may be more effective for certain types of task prompting in LLMs compared to conventional natural language approaches.
143+
144+
---
145+
146+
**Nucleoid**
147+
148+
![3aa6fb7a_nuc](https://github.com/user-attachments/assets/98c5147e-d1b9-4a09-8c38-0c803d99ea55)
149+
150+
151+
**ChatGPT o-1**
152+
153+
![3aa6fb7a_chatgpt-o1](https://github.com/user-attachments/assets/84febfe8-4c6d-4390-91f0-0bdca35edab0)

0 commit comments

Comments
 (0)