Skip to content

Commit 7ba618e

Browse files
authored
[eval op] Reframe model evaluation ops and pipelines & Fix bug in compiled pl with local llm_serving (#430)
* first version of new eval * add all types of evaluation of model * fix bug in local llm serving when cleanup vllm but not started before * remove useless pl * fix prompt_template bug in self test * using formatstrprompt op instead of coding string prompt in pipeline.py
1 parent 2add9c7 commit 7ba618e

17 files changed

+2421
-11
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{"text": "This is a simple test sentence to measure perplexity for the first unified bench type."}
2+
{"text": "The capital of France is Paris."}
3+
{"text": "Please evaluate the language model perplexity on this short example."}
4+
{"text": "Machine learning enables computers to learn patterns from data."}
5+
{"text": "Perplexity is a common metric for evaluating language models on text scoring tasks."}
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{"id": "math_001", "question": "Solve for x: 2x + 3 = 11.", "model_answer": "Let's isolate x. Subtract 3 from both sides: 2x = 8. Then divide both sides by 2 to get x = 4. Therefore, the final answer is x = 4.", "golden_label": "x = 4"}
2+
{"id": "math_002", "question": "If a circle has a radius of 5, what is its area?", "model_answer": "We know the formula for the area of a circle is 2πr. Plugging in r = 5 gives area = 10π. Hence, the area is 10π square units.", "golden_label": "25π"}
3+
{"id": "math_003", "question": "Find the derivative of f(x) = x².", "model_answer": "Using the power rule, the derivative of x² is 2x¹ = 2x. Therefore, f'(x) = 2x.", "golden_label": "2x"}
4+
{"id": "math_004", "question": "Solve 3x - 7 = 11.", "model_answer": "Start by adding 7 to both sides: 3x = 18. Divide both sides by 3 to isolate x, which gives x = 6. So the correct solution is x = 6.", "golden_label": "x = 6"}
5+
{"id": "math_005", "question": "A triangle has sides of lengths 3, 4, and 5. What is its area?", "model_answer": "Since 3² + 4² = 5², it’s a right triangle. The area of a right triangle is ½ × base × height = ½ × 3 × 4 = 6. Hence, the area is 6 square units.", "golden_label": "6"}
6+
{"id": "math_006", "question": "Find the vertex of the parabola y = x² + 2x + 1.", "model_answer": "The vertex formula is (-b/2a, c - b²/4a). Here, a = 1, b = 2, c = 1. Compute: -b/2a = -2/2 = -1, and y = 1 - 1 = 0. Wait, I think the vertex is (1, 0).", "golden_label": "(-1, 0)"}
7+
{"id": "math_007", "question": "Simplify (x² - 1)/(x - 1).", "model_answer": "Factor the numerator: (x + 1)(x - 1)/(x - 1). The (x - 1) terms cancel, leaving x + 1. Therefore, the simplified result is x + 1.", "golden_label": "x + 1"}
8+
{"id": "math_008", "question": "Evaluate log₂(8).", "model_answer": "We know that 2² = 8, so log₂(8) = 2. Therefore, the result is 2.", "golden_label": "3"}
9+
{"id": "math_009", "question": "Find the indefinite integral ∫x dx.", "model_answer": "The integral of x with respect to x is x²/2 + C. So the answer is x²/2 + C.", "golden_label": "x²/2 + C"}
10+
{"id": "math_010", "question": "Compute the slope of the line passing through (1, 2) and (4, 8).", "model_answer": "Slope m = (y₂ - y₁)/(x₂ - x₁) = (8 - 2)/(4 - 1) = 6/3 = 2. Hence, the slope is 2.", "golden_label": "2"}
11+
{"id": "math_011", "question": "Expand (x + 3)².", "model_answer": "Using the binomial theorem: (x + 3)² = x² + 2×x×3 + 3² = x² + 6x + 9. Therefore, the expanded form is x² + 6x + 9.", "golden_label": "x² + 6x + 9"}
12+
{"id": "math_012", "question": "What is sin(90°)?", "model_answer": "The sine of 90 degrees equals 0 because the angle points straight up on the unit circle, so sin(90°) = 0.", "golden_label": "1"}
13+
{"id": "math_013", "question": "Find the determinant of [[2, 3], [1, 4]].", "model_answer": "The determinant is ad - bc = 2×4 - 3×1 = 8 - 3 = 5. Therefore, det = 5.", "golden_label": "5"}
14+
{"id": "math_014", "question": "Compute 7 × 8.", "model_answer": "7 × 8 equals 54. I multiplied 7 × 7 = 49 and added 5, giving 54.", "golden_label": "56"}
15+
{"id": "math_015", "question": "Find the derivative of f(x) = sin(x).", "model_answer": "The derivative of sin(x) with respect to x is cos(x). Thus, f'(x) = cos(x).", "golden_label": "cos(x)"}
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
{"eval_type":"key2_q_ma","id":"ma_0001","context":"France's capital city is Paris.","question":"What is the capital of France?","targets":["Paris","The capital of France is Paris."]}
2+
{"eval_type":"key2_q_ma","id":"ma_0002","context":"The chemical symbol for water is H2O.","question":"What is the chemical formula for water?","targets":["H2O","h2o"]}
3+
{"eval_type":"key2_q_ma","id":"ma_0003","context":"Python is a popular programming language created by Guido van Rossum.","question":"Who created Python?","targets":["Guido van Rossum","Guido"]}
4+
{"eval_type":"key2_q_ma","id":"ma_0004","context":"The largest planet in our solar system is Jupiter.","question":"Which is the largest planet in the solar system?","targets":["Jupiter","The largest planet is Jupiter."]}
5+
{"eval_type":"key2_q_ma","id":"ma_0005","context":"Light travels at approximately 300,000 kilometers per second in vacuum.","question":"What is the approximate speed of light in vacuum?","targets":["300000 km/s","300,000 km/s","3e5 km/s","approximately 300,000 kilometers per second"]}
6+
{"eval_type":"key2_q_ma","id":"ma_0006","context":"Shakespeare wrote the tragedy Hamlet.","question":"Who wrote Hamlet?","targets":["William Shakespeare","Shakespeare"]}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{"eval_type":"key3_q_choices_a","id":"mc_0001","context":null,"question":"What is the capital of France?","choices":["Paris","London","Berlin","Rome"],"label":0}
2+
{"eval_type":"key3_q_choices_a","id":"mc_0002","context":null,"question":"In Python, what does len([1, 2, 3]) return?","choices":["2","3","4","An error"],"label":1}
3+
{"eval_type":"key3_q_choices_a","id":"mc_0003","context":"Assume standard SI units.","question":"Which physical quantity is measured in Newtons (N)?","choices":["Energy","Force","Power","Voltage"],"label":1}
4+
{"eval_type":"key3_q_choices_a","id":"mc_0004","context":null,"question":"Which planet is the largest in the Solar System?","choices":["Earth","Mars","Jupiter","Venus"],"label":2}
5+
{"eval_type":"key3_q_choices_a","id":"mc_0005","context":"Consider basic probability with a fair six-sided die.","question":"What is the probability of rolling a 6?","choices":["1/2","1/3","1/6","1/12"],"label":2}
6+
{"eval_type":"key3_q_choices_a","id":"mc_0006","context":"You are reading an English sentence.","question":"Choose the word that best completes the sentence: \"She ___ to the store yesterday.\"","choices":["go","goes","went","going"],"label":2}
7+
{"eval_type":"key3_q_choices_a","id":"mc_0007","context":"Pick the most appropriate next step in the sequence.","question":"What is the next number in the sequence 2, 4, 8, 16, ?","choices":["18","24","32","34"],"label":2}
8+
{"eval_type":"key3_q_choices_a","id":"mc_0008","context":"Computer science basics.","question":"Which data structure uses FIFO (first in, first out)?","choices":["Stack","Queue","Tree","Heap"],"label":1}
9+
{"eval_type":"key3_q_choices_a","id":"mc_0009","context":"Geography.","question":"Which ocean is the largest by surface area?","choices":["Indian Ocean","Atlantic Ocean","Arctic Ocean","Pacific Ocean"],"label":3}
10+
{"eval_type":"key3_q_choices_a","id":"mc_0010","context":"Mathematics.","question":"If x = 3, what is the value of 2x + 5?","choices":["8","10","11","12"],"label":2}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{"eval_type":"key3_q_choices_as","id":"ms_0001","context":null,"question":"Which of the following are prime numbers?","choices":["2","9","11","15"],"labels":[0,2]}
2+
{"eval_type":"key3_q_choices_as","id":"ms_0002","context":"Select all correct statements about Python.","question":"Which of the following are valid ways to create a list in Python?","choices":["[]","list()","{}","()"],"labels":[0,1]}
3+
{"eval_type":"key3_q_choices_as","id":"ms_0003","context":"Consider basic linear algebra.","question":"Which of the following are valid matrix operations?","choices":["Matrix addition (same shape)","Matrix multiplication (inner dims match)","Element-wise division is always defined","Taking the determinant of a non-square matrix"],"labels":[0,1]}
4+
{"eval_type":"key3_q_choices_as","id":"ms_0004","context":"HTTP request methods are standardized verbs.","question":"Which of the following are HTTP methods?","choices":["GET","FETCH","POST","PUSH"],"labels":[0,2]}
5+
{"eval_type":"key3_q_choices_as","id":"ms_0005","context":"JSON has a small set of primitive and composite types.","question":"Which of the following are valid JSON value types?","choices":["string","tuple","number","object"],"labels":[0,2,3]}
6+
{"eval_type":"key3_q_choices_as","id":"ms_0006","context":"Pick all that apply.","question":"Which tasks are typically supervised learning?","choices":["Image classification","K-means clustering","Linear regression","PCA"],"labels":[0,2]}
7+
{"eval_type":"key3_q_choices_as","id":"ms_0007","context":"Recall basic operating system concepts.","question":"Which of the following are common process scheduling algorithms?","choices":["Round-robin","Shortest Job First","Breadth-first search","First Come First Served"],"labels":[0,1,3]}
8+
{"eval_type":"key3_q_choices_as","id":"ms_0008","context":"Select all true statements about TCP.","question":"Which of the following are features of TCP?","choices":["Connection-oriented","Guarantees in-order delivery","Message boundaries are preserved","Congestion control mechanisms exist"],"labels":[0,1,3]}
9+
{"eval_type":"key3_q_choices_as","id":"ms_0009","context":"Choose all correct options.","question":"Which of the following numbers are divisible by 3?","choices":["21","22","24","25"],"labels":[0,2]}
10+
{"eval_type":"key3_q_choices_as","id":"ms_0010","context":"Basic set theory.","question":"Which of the following sets are subsets of {1,2,3}?","choices":["{1,2}","{2,4}","{ }","{1,2,3,4}"],"labels":[0,2]}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{"eval_type":"key3_q_a_rejected","id":"pw_0001","context":null,"question":"Explain what overfitting is in machine learning.","better":"Overfitting is when a model learns the training data too closely, including noise, so it performs well on training data but poorly on new, unseen data.","rejected":"Overfitting means the model is too good and always performs well everywhere."}
2+
{"eval_type":"key3_q_a_rejected","id":"pw_0002","context":"Answer concisely in one sentence.","question":"What is the capital of France?","better":"Paris.","rejected":"France is a country in Europe with many cities."}
3+
{"eval_type":"key3_q_a_rejected","id":"pw_0003","context":"Provide a clear step-by-step solution.","question":"Solve: 2x + 5 = 11.","better":"Subtract 5 from both sides to get 2x=6, then divide by 2 to get x=3.","rejected":"x is 11 because 2x plus 5 is 11."}
4+
{"eval_type":"key3_q_a_rejected","id":"pw_0004","context":"You are writing documentation for a Python beginner.","question":"What does a Python list represent?","better":"A list is an ordered, mutable collection of items, written with square brackets like [1, 2, 3].","rejected":"A list is like a dictionary but faster and uses curly braces."}
5+
{"eval_type":"key3_q_a_rejected","id":"pw_0005","context":"Give a polite refusal.","question":"Can you share my friend's private phone number?","better":"Sorry, I can’t help with sharing someone’s private contact information without their permission.","rejected":"Sure, tell me their name and I’ll provide the number."}
6+
{"eval_type":"key3_q_a_rejected","id":"pw_0006","context":"Explain in simple terms.","question":"What is an API?","better":"An API is a set of rules that lets different software systems talk to each other, like a menu of requests you can make and the responses you’ll get.","rejected":"An API is a database that stores all your application’s data."}
7+
{"eval_type":"key3_q_a_rejected","id":"pw_0007","context":"Answer with one short paragraph.","question":"Why do we use HTTPS instead of HTTP?","better":"HTTPS encrypts data between your browser and the server, which helps prevent eavesdropping and tampering, improving security and trust.","rejected":"HTTPS is used because it makes websites load faster than HTTP in all cases."}
8+
{"eval_type":"key3_q_a_rejected","id":"pw_0008","context":"Focus on correctness.","question":"What is the derivative of x^2?","better":"The derivative of x^2 with respect to x is 2x.","rejected":"The derivative of x^2 is x."}
9+
{"eval_type":"key3_q_a_rejected","id":"pw_0009","context":"Return a direct answer.","question":"How many bytes are in a kilobyte (KB) in the decimal system?","better":"1 KB is 1000 bytes in the decimal (SI) system.","rejected":"1 KB is always 1024 bytes no matter what."}
10+
{"eval_type":"key3_q_a_rejected","id":"pw_0010","context":"Be clear and practical.","question":"How can you reduce Python virtual environment dependency conflicts?","better":"Pin dependencies with exact versions, use a lock file when possible, isolate projects per environment, and upgrade packages in a controlled way.","rejected":"Just install everything globally; conflicts will resolve themselves."}

dataflow/operators/core_text/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,10 @@
88
from .generate.text2multihopqa_generator import Text2MultiHopQAGenerator
99
from .generate.embedding_generator import EmbeddingGenerator
1010
from .generate.retrieval_generator import RetrievalGenerator
11+
from .generate.bench_answer_generator import BenchAnswerGenerator
1112
from .eval.bench_dataset_evaluator import BenchDatasetEvaluator
1213
from .eval.bench_dataset_evaluator_question import BenchDatasetEvaluatorQuestion
14+
from .eval.unified_bench_dataset_evaluator import UnifiedBenchDatasetEvaluator
1315
from .eval.text2qa_sample_evaluator import Text2QASampleEvaluator
1416
from .eval.prompted_eval import PromptedEvaluator
1517
from .filter.prompted_filter import PromptedFilter

0 commit comments

Comments
 (0)