Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{"text": "This is a simple test sentence to measure perplexity for the first unified bench type."}
{"text": "The capital of France is Paris."}
{"text": "Please evaluate the language model perplexity on this short example."}
{"text": "Machine learning enables computers to learn patterns from data."}
{"text": "Perplexity is a common metric for evaluating language models on text scoring tasks."}
15 changes: 15 additions & 0 deletions dataflow/example/core_text_data/unified_bench_eval_type2.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{"id": "math_001", "question": "Solve for x: 2x + 3 = 11.", "model_answer": "Let's isolate x. Subtract 3 from both sides: 2x = 8. Then divide both sides by 2 to get x = 4. Therefore, the final answer is x = 4.", "golden_label": "x = 4"}
{"id": "math_002", "question": "If a circle has a radius of 5, what is its area?", "model_answer": "We know the formula for the area of a circle is 2πr. Plugging in r = 5 gives area = 10π. Hence, the area is 10π square units.", "golden_label": "25π"}
{"id": "math_003", "question": "Find the derivative of f(x) = x².", "model_answer": "Using the power rule, the derivative of x² is 2x¹ = 2x. Therefore, f'(x) = 2x.", "golden_label": "2x"}
{"id": "math_004", "question": "Solve 3x - 7 = 11.", "model_answer": "Start by adding 7 to both sides: 3x = 18. Divide both sides by 3 to isolate x, which gives x = 6. So the correct solution is x = 6.", "golden_label": "x = 6"}
{"id": "math_005", "question": "A triangle has sides of lengths 3, 4, and 5. What is its area?", "model_answer": "Since 3² + 4² = 5², it’s a right triangle. The area of a right triangle is ½ × base × height = ½ × 3 × 4 = 6. Hence, the area is 6 square units.", "golden_label": "6"}
{"id": "math_006", "question": "Find the vertex of the parabola y = x² + 2x + 1.", "model_answer": "The vertex formula is (-b/2a, c - b²/4a). Here, a = 1, b = 2, c = 1. Compute: -b/2a = -2/2 = -1, and y = 1 - 1 = 0. Wait, I think the vertex is (1, 0).", "golden_label": "(-1, 0)"}
{"id": "math_007", "question": "Simplify (x² - 1)/(x - 1).", "model_answer": "Factor the numerator: (x + 1)(x - 1)/(x - 1). The (x - 1) terms cancel, leaving x + 1. Therefore, the simplified result is x + 1.", "golden_label": "x + 1"}
{"id": "math_008", "question": "Evaluate log₂(8).", "model_answer": "We know that 2² = 8, so log₂(8) = 2. Therefore, the result is 2.", "golden_label": "3"}
{"id": "math_009", "question": "Find the indefinite integral ∫x dx.", "model_answer": "The integral of x with respect to x is x²/2 + C. So the answer is x²/2 + C.", "golden_label": "x²/2 + C"}
{"id": "math_010", "question": "Compute the slope of the line passing through (1, 2) and (4, 8).", "model_answer": "Slope m = (y₂ - y₁)/(x₂ - x₁) = (8 - 2)/(4 - 1) = 6/3 = 2. Hence, the slope is 2.", "golden_label": "2"}
{"id": "math_011", "question": "Expand (x + 3)².", "model_answer": "Using the binomial theorem: (x + 3)² = x² + 2×x×3 + 3² = x² + 6x + 9. Therefore, the expanded form is x² + 6x + 9.", "golden_label": "x² + 6x + 9"}
{"id": "math_012", "question": "What is sin(90°)?", "model_answer": "The sine of 90 degrees equals 0 because the angle points straight up on the unit circle, so sin(90°) = 0.", "golden_label": "1"}
{"id": "math_013", "question": "Find the determinant of [[2, 3], [1, 4]].", "model_answer": "The determinant is ad - bc = 2×4 - 3×1 = 8 - 3 = 5. Therefore, det = 5.", "golden_label": "5"}
{"id": "math_014", "question": "Compute 7 × 8.", "model_answer": "7 × 8 equals 54. I multiplied 7 × 7 = 49 and added 5, giving 54.", "golden_label": "56"}
{"id": "math_015", "question": "Find the derivative of f(x) = sin(x).", "model_answer": "The derivative of sin(x) with respect to x is cos(x). Thus, f'(x) = cos(x).", "golden_label": "cos(x)"}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{"eval_type":"key2_q_ma","id":"ma_0001","context":"France's capital city is Paris.","question":"What is the capital of France?","targets":["Paris","The capital of France is Paris."]}
{"eval_type":"key2_q_ma","id":"ma_0002","context":"The chemical symbol for water is H2O.","question":"What is the chemical formula for water?","targets":["H2O","h2o"]}
{"eval_type":"key2_q_ma","id":"ma_0003","context":"Python is a popular programming language created by Guido van Rossum.","question":"Who created Python?","targets":["Guido van Rossum","Guido"]}
{"eval_type":"key2_q_ma","id":"ma_0004","context":"The largest planet in our solar system is Jupiter.","question":"Which is the largest planet in the solar system?","targets":["Jupiter","The largest planet is Jupiter."]}
{"eval_type":"key2_q_ma","id":"ma_0005","context":"Light travels at approximately 300,000 kilometers per second in vacuum.","question":"What is the approximate speed of light in vacuum?","targets":["300000 km/s","300,000 km/s","3e5 km/s","approximately 300,000 kilometers per second"]}
{"eval_type":"key2_q_ma","id":"ma_0006","context":"Shakespeare wrote the tragedy Hamlet.","question":"Who wrote Hamlet?","targets":["William Shakespeare","Shakespeare"]}
10 changes: 10 additions & 0 deletions dataflow/example/core_text_data/unified_bench_eval_type4.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{"eval_type":"key3_q_choices_a","id":"mc_0001","context":null,"question":"What is the capital of France?","choices":["Paris","London","Berlin","Rome"],"label":0}
{"eval_type":"key3_q_choices_a","id":"mc_0002","context":null,"question":"In Python, what does len([1, 2, 3]) return?","choices":["2","3","4","An error"],"label":1}
{"eval_type":"key3_q_choices_a","id":"mc_0003","context":"Assume standard SI units.","question":"Which physical quantity is measured in Newtons (N)?","choices":["Energy","Force","Power","Voltage"],"label":1}
{"eval_type":"key3_q_choices_a","id":"mc_0004","context":null,"question":"Which planet is the largest in the Solar System?","choices":["Earth","Mars","Jupiter","Venus"],"label":2}
{"eval_type":"key3_q_choices_a","id":"mc_0005","context":"Consider basic probability with a fair six-sided die.","question":"What is the probability of rolling a 6?","choices":["1/2","1/3","1/6","1/12"],"label":2}
{"eval_type":"key3_q_choices_a","id":"mc_0006","context":"You are reading an English sentence.","question":"Choose the word that best completes the sentence: \"She ___ to the store yesterday.\"","choices":["go","goes","went","going"],"label":2}
{"eval_type":"key3_q_choices_a","id":"mc_0007","context":"Pick the most appropriate next step in the sequence.","question":"What is the next number in the sequence 2, 4, 8, 16, ?","choices":["18","24","32","34"],"label":2}
{"eval_type":"key3_q_choices_a","id":"mc_0008","context":"Computer science basics.","question":"Which data structure uses FIFO (first in, first out)?","choices":["Stack","Queue","Tree","Heap"],"label":1}
{"eval_type":"key3_q_choices_a","id":"mc_0009","context":"Geography.","question":"Which ocean is the largest by surface area?","choices":["Indian Ocean","Atlantic Ocean","Arctic Ocean","Pacific Ocean"],"label":3}
{"eval_type":"key3_q_choices_a","id":"mc_0010","context":"Mathematics.","question":"If x = 3, what is the value of 2x + 5?","choices":["8","10","11","12"],"label":2}
10 changes: 10 additions & 0 deletions dataflow/example/core_text_data/unified_bench_eval_type5.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{"eval_type":"key3_q_choices_as","id":"ms_0001","context":null,"question":"Which of the following are prime numbers?","choices":["2","9","11","15"],"labels":[0,2]}
{"eval_type":"key3_q_choices_as","id":"ms_0002","context":"Select all correct statements about Python.","question":"Which of the following are valid ways to create a list in Python?","choices":["[]","list()","{}","()"],"labels":[0,1]}
{"eval_type":"key3_q_choices_as","id":"ms_0003","context":"Consider basic linear algebra.","question":"Which of the following are valid matrix operations?","choices":["Matrix addition (same shape)","Matrix multiplication (inner dims match)","Element-wise division is always defined","Taking the determinant of a non-square matrix"],"labels":[0,1]}
{"eval_type":"key3_q_choices_as","id":"ms_0004","context":"HTTP request methods are standardized verbs.","question":"Which of the following are HTTP methods?","choices":["GET","FETCH","POST","PUSH"],"labels":[0,2]}
{"eval_type":"key3_q_choices_as","id":"ms_0005","context":"JSON has a small set of primitive and composite types.","question":"Which of the following are valid JSON value types?","choices":["string","tuple","number","object"],"labels":[0,2,3]}
{"eval_type":"key3_q_choices_as","id":"ms_0006","context":"Pick all that apply.","question":"Which tasks are typically supervised learning?","choices":["Image classification","K-means clustering","Linear regression","PCA"],"labels":[0,2]}
{"eval_type":"key3_q_choices_as","id":"ms_0007","context":"Recall basic operating system concepts.","question":"Which of the following are common process scheduling algorithms?","choices":["Round-robin","Shortest Job First","Breadth-first search","First Come First Served"],"labels":[0,1,3]}
{"eval_type":"key3_q_choices_as","id":"ms_0008","context":"Select all true statements about TCP.","question":"Which of the following are features of TCP?","choices":["Connection-oriented","Guarantees in-order delivery","Message boundaries are preserved","Congestion control mechanisms exist"],"labels":[0,1,3]}
{"eval_type":"key3_q_choices_as","id":"ms_0009","context":"Choose all correct options.","question":"Which of the following numbers are divisible by 3?","choices":["21","22","24","25"],"labels":[0,2]}
{"eval_type":"key3_q_choices_as","id":"ms_0010","context":"Basic set theory.","question":"Which of the following sets are subsets of {1,2,3}?","choices":["{1,2}","{2,4}","{ }","{1,2,3,4}"],"labels":[0,2]}
10 changes: 10 additions & 0 deletions dataflow/example/core_text_data/unified_bench_eval_type6.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{"eval_type":"key3_q_a_rejected","id":"pw_0001","context":null,"question":"Explain what overfitting is in machine learning.","better":"Overfitting is when a model learns the training data too closely, including noise, so it performs well on training data but poorly on new, unseen data.","rejected":"Overfitting means the model is too good and always performs well everywhere."}
{"eval_type":"key3_q_a_rejected","id":"pw_0002","context":"Answer concisely in one sentence.","question":"What is the capital of France?","better":"Paris.","rejected":"France is a country in Europe with many cities."}
{"eval_type":"key3_q_a_rejected","id":"pw_0003","context":"Provide a clear step-by-step solution.","question":"Solve: 2x + 5 = 11.","better":"Subtract 5 from both sides to get 2x=6, then divide by 2 to get x=3.","rejected":"x is 11 because 2x plus 5 is 11."}
{"eval_type":"key3_q_a_rejected","id":"pw_0004","context":"You are writing documentation for a Python beginner.","question":"What does a Python list represent?","better":"A list is an ordered, mutable collection of items, written with square brackets like [1, 2, 3].","rejected":"A list is like a dictionary but faster and uses curly braces."}
{"eval_type":"key3_q_a_rejected","id":"pw_0005","context":"Give a polite refusal.","question":"Can you share my friend's private phone number?","better":"Sorry, I can’t help with sharing someone’s private contact information without their permission.","rejected":"Sure, tell me their name and I’ll provide the number."}
{"eval_type":"key3_q_a_rejected","id":"pw_0006","context":"Explain in simple terms.","question":"What is an API?","better":"An API is a set of rules that lets different software systems talk to each other, like a menu of requests you can make and the responses you’ll get.","rejected":"An API is a database that stores all your application’s data."}
{"eval_type":"key3_q_a_rejected","id":"pw_0007","context":"Answer with one short paragraph.","question":"Why do we use HTTPS instead of HTTP?","better":"HTTPS encrypts data between your browser and the server, which helps prevent eavesdropping and tampering, improving security and trust.","rejected":"HTTPS is used because it makes websites load faster than HTTP in all cases."}
{"eval_type":"key3_q_a_rejected","id":"pw_0008","context":"Focus on correctness.","question":"What is the derivative of x^2?","better":"The derivative of x^2 with respect to x is 2x.","rejected":"The derivative of x^2 is x."}
{"eval_type":"key3_q_a_rejected","id":"pw_0009","context":"Return a direct answer.","question":"How many bytes are in a kilobyte (KB) in the decimal system?","better":"1 KB is 1000 bytes in the decimal (SI) system.","rejected":"1 KB is always 1024 bytes no matter what."}
{"eval_type":"key3_q_a_rejected","id":"pw_0010","context":"Be clear and practical.","question":"How can you reduce Python virtual environment dependency conflicts?","better":"Pin dependencies with exact versions, use a lock file when possible, isolate projects per environment, and upgrade packages in a controlled way.","rejected":"Just install everything globally; conflicts will resolve themselves."}
2 changes: 2 additions & 0 deletions dataflow/operators/core_text/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,10 @@
from .generate.text2multihopqa_generator import Text2MultiHopQAGenerator
from .generate.embedding_generator import EmbeddingGenerator
from .generate.retrieval_generator import RetrievalGenerator
from .generate.bench_answer_generator import BenchAnswerGenerator
from .eval.bench_dataset_evaluator import BenchDatasetEvaluator
from .eval.bench_dataset_evaluator_question import BenchDatasetEvaluatorQuestion
from .eval.unified_bench_dataset_evaluator import UnifiedBenchDatasetEvaluator
from .eval.text2qa_sample_evaluator import Text2QASampleEvaluator
from .eval.prompted_eval import PromptedEvaluator
from .filter.prompted_filter import PromptedFilter
Expand Down
Loading