Skip to content
This repository was archived by the owner on Jul 22, 2025. It is now read-only.

Commit 0c94660

Browse files
authored
DEV: improve artifact editing and eval system (#1130)
- Add non-contiguous search/replace support using ... syntax - Add judge support for evaluating LLM outputs with ratings - Improve error handling and reporting in eval runner - Add full section replacement support without search blocks - Add fabricators and specs for artifact diffing - Track failed searches to improve debugging - Add JS syntax validation for artifact versions in eval system - Update prompt documentation with clear guidelines * improve eval output * move error handling * llm as a judge * fix spec * small note on evals
1 parent 02f0908 commit 0c94660

File tree

10 files changed

+591
-27
lines changed

10 files changed

+591
-27
lines changed

README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,26 @@
33
**Plugin Summary**
44

55
For more information, please see: https://meta.discourse.org/t/discourse-ai/259214?u=falco
6+
7+
### Evals
8+
9+
The directory `evals` contains AI evals for the Discourse AI plugin.
10+
11+
To run them use:
12+
13+
cd evals
14+
./run --help
15+
16+
```
17+
Usage: evals/run [options]
18+
-e, --eval NAME Name of the evaluation to run
19+
--list-models List models
20+
-m, --model NAME Model to evaluate (will eval all models if not specified)
21+
-l, --list List evals
22+
```
23+
24+
To run evals you will need to configure API keys in your environment:
25+
26+
OPENAI_API_KEY=your_openai_api_key
27+
ANTHROPIC_API_KEY=your_anthropic_api_key
28+
GEMINI_API_KEY=your_gemini_api_key

app/models/ai_artifact_version.rb

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,19 @@ class AiArtifactVersion < ActiveRecord::Base
44
validates :html, length: { maximum: 65_535 }
55
validates :css, length: { maximum: 65_535 }
66
validates :js, length: { maximum: 65_535 }
7+
8+
# used when generating test cases
9+
def write_to(path)
10+
css_path = "#{path}/main.css"
11+
html_path = "#{path}/main.html"
12+
js_path = "#{path}/main.js"
13+
instructions_path = "#{path}/instructions.txt"
14+
15+
File.write(css_path, css)
16+
File.write(html_path, html)
17+
File.write(js_path, js)
18+
File.write(instructions_path, change_description)
19+
end
720
end
821

922
# == Schema Information

evals/lib/eval.rb

Lines changed: 151 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,17 @@ class DiscourseAi::Evals::Eval
1010
:vision,
1111
:expected_output,
1212
:expected_output_regex,
13-
:expected_tool_call
13+
:expected_tool_call,
14+
:judge
15+
16+
class EvalError < StandardError
17+
attr_reader :context
18+
19+
def initialize(message, context)
20+
super(message)
21+
@context = context
22+
end
23+
end
1424

1525
def initialize(path:)
1626
@yaml = YAML.load_file(path).symbolize_keys
@@ -27,10 +37,14 @@ def initialize(path:)
2737
Regexp.new(@expected_output_regex, Regexp::MULTILINE) if @expected_output_regex
2838
@expected_tool_call = @yaml[:expected_tool_call]
2939
@expected_tool_call.symbolize_keys! if @expected_tool_call
40+
@judge = @yaml[:judge]
41+
@judge.symbolize_keys! if @judge
3042

31-
@args[:path] = File.expand_path(File.join(File.dirname(path), @args[:path])) if @args&.key?(
32-
:path,
33-
)
43+
@args.each do |key, value|
44+
if (key.to_s.include?("_path") || key.to_s == "path") && value.is_a?(String)
45+
@args[key] = File.expand_path(File.join(File.dirname(path), value))
46+
end
47+
end
3448
end
3549

3650
def run(llm:)
@@ -44,6 +58,8 @@ def run(llm:)
4458
image_to_text(llm, **args)
4559
when "prompt"
4660
prompt_call(llm, **args)
61+
when "edit_artifact"
62+
edit_artifact(llm, **args)
4763
end
4864

4965
if expected_output
@@ -53,7 +69,7 @@ def run(llm:)
5369
{ result: :fail, expected_output: expected_output, actual_output: result }
5470
end
5571
elsif expected_output_regex
56-
if result.match?(expected_output_regex)
72+
if result.to_s.match?(expected_output_regex)
5773
{ result: :pass }
5874
else
5975
{ result: :fail, expected_output: expected_output_regex, actual_output: result }
@@ -71,9 +87,13 @@ def run(llm:)
7187
else
7288
{ result: :pass }
7389
end
90+
elsif judge
91+
judge_result(result)
7492
else
75-
{ result: :unknown, actual_output: result }
93+
{ result: :pass }
7694
end
95+
rescue EvalError => e
96+
{ result: :fail, message: e.message, context: e.context }
7797
end
7898

7999
def print
@@ -96,14 +116,68 @@ def to_json
96116

97117
private
98118

99-
def helper(llm, input:, name:)
119+
def judge_result(result)
120+
prompt = judge[:prompt].dup
121+
prompt.sub!("{{output}}", result)
122+
prompt.sub!("{{input}}", args[:input])
123+
124+
prompt += <<~SUFFIX
125+
126+
Reply with a rating from 1 to 10, where 10 is perfect and 1 is terrible.
127+
128+
example output:
129+
130+
[RATING]10[/RATING] perfect output
131+
132+
example output:
133+
134+
[RATING]5[/RATING]
135+
136+
the following failed to preserve... etc...
137+
SUFFIX
138+
139+
judge_llm = DiscourseAi::Evals::Llm.choose(judge[:llm]).first
140+
141+
DiscourseAi::Completions::Prompt.new(
142+
"You are an expert judge tasked at testing LLM outputs.",
143+
messages: [{ type: :user, content: prompt }],
144+
)
145+
146+
result = judge_llm.llm_model.to_llm.generate(prompt, user: Discourse.system_user)
147+
148+
if rating = result.match(%r{\[RATING\](\d+)\[/RATING\]})
149+
rating = rating[1].to_i
150+
end
151+
152+
if rating.to_i >= judge[:pass_rating]
153+
{ result: :pass }
154+
else
155+
{
156+
result: :fail,
157+
message: "LLM Rating below threshold, it was #{rating}, expecting #{judge[:pass_rating]}",
158+
context: result,
159+
}
160+
end
161+
end
162+
163+
def helper(llm, input:, name:, locale: nil)
100164
completion_prompt = CompletionPrompt.find_by(name: name)
101165
helper = DiscourseAi::AiHelper::Assistant.new(helper_llm: llm.llm_proxy)
166+
user = Discourse.system_user
167+
if locale
168+
user = User.new
169+
class << user
170+
attr_accessor :effective_locale
171+
end
172+
173+
user.effective_locale = locale
174+
user.admin = true
175+
end
102176
result =
103177
helper.generate_and_send_prompt(
104178
completion_prompt,
105179
input,
106-
current_user = Discourse.system_user,
180+
current_user = user,
107181
_force_default_locale = false,
108182
)
109183

@@ -169,4 +243,73 @@ def prompt_call(llm, system_prompt:, message:, tools: nil, stream: false)
169243
end
170244
result
171245
end
246+
247+
def edit_artifact(llm, css_path:, js_path:, html_path:, instructions_path:)
248+
css = File.read(css_path)
249+
js = File.read(js_path)
250+
html = File.read(html_path)
251+
instructions = File.read(instructions_path)
252+
artifact =
253+
AiArtifact.create!(
254+
css: css,
255+
js: js,
256+
html: html,
257+
user_id: Discourse.system_user.id,
258+
post_id: 1,
259+
name: "eval artifact",
260+
)
261+
262+
post = Post.new(topic_id: 1, id: 1)
263+
diff =
264+
DiscourseAi::AiBot::ArtifactUpdateStrategies::Diff.new(
265+
llm: llm.llm_model.to_llm,
266+
post: post,
267+
user: Discourse.system_user,
268+
artifact: artifact,
269+
artifact_version: nil,
270+
instructions: instructions,
271+
)
272+
diff.apply
273+
274+
if diff.failed_searches.present?
275+
puts "Eval Errors encountered"
276+
p diff.failed_searches
277+
raise EvalError.new("Failed to apply all changes", diff.failed_searches)
278+
end
279+
280+
version = artifact.versions.last
281+
raise EvalError.new("Invalid JS", version.js) if !valid_javascript?(version.js)
282+
283+
output = { css: version.css, js: version.js, html: version.html }
284+
285+
artifact.destroy
286+
output
287+
end
288+
289+
def valid_javascript?(str)
290+
require "open3"
291+
292+
# Create a temporary file with the JavaScript code
293+
Tempfile.create(%w[test .js]) do |f|
294+
f.write(str)
295+
f.flush
296+
297+
File.write("/tmp/test.js", str)
298+
299+
begin
300+
Discourse::Utils.execute_command(
301+
"node",
302+
"--check",
303+
f.path,
304+
failure_message: "Invalid JavaScript syntax",
305+
timeout: 30, # reasonable timeout in seconds
306+
)
307+
true
308+
rescue Discourse::Utils::CommandError
309+
false
310+
end
311+
end
312+
rescue StandardError
313+
false
314+
end
172315
end

evals/lib/runner.rb

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -155,9 +155,18 @@ def run!
155155

156156
if result[:result] == :fail
157157
puts "Failed 🔴"
158-
puts "---- Expected ----\n#{result[:expected_output]}"
159-
puts "---- Actual ----\n#{result[:actual_output]}"
158+
puts "Error: #{result[:message]}" if result[:message]
159+
# this is deliberate, it creates a lot of noise, but sometimes for debugging it's useful
160+
#puts "Context: #{result[:context].to_s[0..2000]}" if result[:context]
161+
if result[:expected_output] && result[:actual_output]
162+
puts "---- Expected ----\n#{result[:expected_output]}"
163+
puts "---- Actual ----\n#{result[:actual_output]}"
164+
end
160165
logger.error("Evaluation failed with LLM: #{llm.name}")
166+
logger.error("Error: #{result[:message]}") if result[:message]
167+
logger.error("Expected: #{result[:expected_output]}") if result[:expected_output]
168+
logger.error("Actual: #{result[:actual_output]}") if result[:actual_output]
169+
logger.error("Context: #{result[:context]}") if result[:context]
161170
elsif result[:result] == :pass
162171
puts "Passed 🟢"
163172
logger.info("Evaluation passed with LLM: #{llm.name}")

0 commit comments

Comments
 (0)