Skip to content

Commit 4cfa6b1

Browse files
authored
docs: moved from ipnb files to md files (#1482)
From now on documentation will be expected to written as .md files we have a script to do the conversion `make ipynb-to-md` that will do the conversion. - all generated `md` files should be have the `_` prefix so that we know it's a converted files and **generated md files are not expected to be edited manually.** - We can now use all the md formats available (but should have worked otherwise too) ![image](https://github.com/user-attachments/assets/9cf11719-0f5e-45b8-84a1-aebd7bd0d07b) you can visualize the diff here - https://docs.ragas.io/en/latest/howtos/customizations/metrics/write_your_own_metric/ (Generated from ipynb) - https://ragas--1482.org.readthedocs.build/en/1482/howtos/customizations/metrics/_write_your_own_metric/ (Generated from .md) Do compare them side by side before merging
1 parent 4899f58 commit 4cfa6b1

24 files changed

+3159
-25
lines changed

.readthedocs.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,4 +7,5 @@ build:
77
commands:
88
- pip install -e .[docs]
99
- if [ -n "$GH_TOKEN" ]; then pip install git+https://${GH_TOKEN}@github.com/squidfunk/mkdocs-material-insiders.git; fi
10-
- mkdocs build --site-dir $READTHEDOCS_OUTPUT/html
10+
- make ipynb-to-md
11+
- mkdocs build --site-dir $READTHEDOCS_OUTPUT/html

Makefile

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,14 @@ test-e2e: ## Run end2end tests
3333
run-ci: format lint type test ## Running all CI checks
3434

3535
# Docs
36-
docsite: ## Build and serve documentation
37-
@mkdocs serve --dirty
3836
rewrite-docs: ## Use GPT4 to rewrite the documentation
3937
@echo "Rewriting the documentation in directory $(DIR)..."
4038
@python $(GIT_ROOT)/docs/python alphred.py --directory $(DIR)
39+
ipynb-to-md: ## Convert ipynb files to md files
40+
@python $(GIT_ROOT)/scripts/ipynb_to_md.py
41+
docsite: ## Build and serve documentation
42+
@$(MAKE) ipynb-to-md
43+
@mkdocs serve
4144

4245
# Benchmarks
4346
run-benchmarks-eval: ## Run benchmarks for Evaluation

docs/howtos/applications/_cost.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# Understand Cost and Usage of Operations
2+
3+
When using LLMs for evaluation and test set generation, cost will be an important factor. Ragas provides you some tools to help you with that.
4+
5+
## Understanding `TokenUsageParser`
6+
7+
By default Ragas does not calculate the usage of tokens for `evaluate()`. This is because langchain's LLMs do not always return information about token usage in a uniform way. So in order to get the usage data, we have to implement a `TokenUsageParser`.
8+
9+
A `TokenUsageParser` is function that parses the `LLMResult` or `ChatResult` from langchain models `generate_prompt()` function and outputs `TokenUsage` which Ragas expects.
10+
11+
For an example here is one that will parse OpenAI by using a parser we have defined.
12+
13+
14+
```python
15+
from langchain_openai.chat_models import ChatOpenAI
16+
from langchain_core.prompt_values import StringPromptValue
17+
18+
gpt4o = ChatOpenAI(model="gpt-4o")
19+
p = StringPromptValue(text="hai there")
20+
llm_result = gpt4o.generate_prompt([p])
21+
22+
# lets import a parser for OpenAI
23+
from ragas.cost import get_token_usage_for_openai
24+
25+
get_token_usage_for_openai(llm_result)
26+
```
27+
28+
29+
30+
31+
TokenUsage(input_tokens=9, output_tokens=9, model='')
32+
33+
34+
35+
You can define your own or import parsers if they are defined. If you would like to suggest parser for LLM providers or contribute your own ones please check out this [issue](https://github.com/explodinggradients/ragas/issues/1151) 🙂.
36+
37+
You can use it for evaluations as so. Using example from [get started](get-started-evaluation) here.
38+
39+
40+
```python
41+
from datasets import load_dataset
42+
from ragas.metrics import (
43+
answer_relevancy,
44+
faithfulness,
45+
context_recall,
46+
context_precision,
47+
)
48+
49+
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
50+
amnesty_qa
51+
```
52+
53+
Repo card metadata block was not found. Setting CardData to empty.
54+
55+
56+
57+
58+
59+
DatasetDict({
60+
eval: Dataset({
61+
features: ['question', 'ground_truth', 'answer', 'contexts'],
62+
num_rows: 20
63+
})
64+
})
65+
66+
67+
68+
69+
```python
70+
from ragas import evaluate
71+
from ragas.cost import get_token_usage_for_openai
72+
73+
result = evaluate(
74+
amnesty_qa["eval"],
75+
metrics=[
76+
context_precision,
77+
faithfulness,
78+
answer_relevancy,
79+
context_recall,
80+
],
81+
llm=gpt4o,
82+
token_usage_parser=get_token_usage_for_openai,
83+
)
84+
```
85+
86+
87+
Evaluating: 0%| | 0/80 [00:00<?, ?it/s]
88+
89+
90+
91+
```python
92+
result.total_tokens()
93+
```
94+
95+
96+
97+
98+
TokenUsage(input_tokens=116765, output_tokens=39031, model='')
99+
100+
101+
102+
You can compute the cost for each run by passing in the cost per token to `Result.total_cost()` function.
103+
104+
In this case GPT-4o costs $5 for 1M input tokens and $15 for 1M output tokens.
105+
106+
107+
```python
108+
result.total_cost(cost_per_input_token=5 / 1e6, cost_per_output_token=15 / 1e6)
109+
```
110+
111+
112+
113+
114+
1.1692900000000002
115+
116+

docs/howtos/applications/compare_embeddings.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ test_answers = [[item] for item in test_df['answer'].values.tolist()]
7777
Here I am using llama-index to build a basic RAG pipeline with my documents. The goal here is to collect retrieved contexts and generated answer for each of the test questions from your pipeline. Ragas has integrations with various RAG frameworks which makes evaluating them easier using ragas.
7878

7979
!!! note
80-
refer to [langchain-tutorial](../integrations/langchain.ipynb) see how to evaluate using langchain
80+
refer to [langchain-tutorial](../integrations/_langchain.md) see how to evaluate using langchain
8181

8282
```python
8383

docs/howtos/applications/compare_llms.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ test_answers = [[item] for item in test_df['answer'].values.tolist()]
8383
Here I am using llama-index to build a basic RAG pipeline with my documents. The goal here is to collect retrieved contexts and generated answer for each of the test questions from your pipeline. Ragas has integrations with various RAG frameworks which makes evaluating them easier using ragas.
8484

8585
!!! note
86-
refer to [langchain-tutorial](../integrations/langchain.ipynb) see how to evaluate using langchain
86+
refer to [langchain-tutorial](../integrations/_langchain.md) see how to evaluate using langchain
8787

8888
```python
8989
import nest_asyncio
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# RunConfig
2+
3+
The `RunConfig` allows you to pass in the run parameters to functions like `evaluate()` and `TestsetGenerator.generate()`. Depending on your LLM providers rate limits, SLAs and traffic, controlling these parameters can improve the speed and reliablility of Ragas runs.
4+
5+
How to configure the `RunConfig` in
6+
7+
- [Evaluate](#evaluate)
8+
- [TestsetGenerator]()
9+
10+
## Rate Limits
11+
12+
Ragas leverages parallelism with Async in python but the `RunConfig` has a field called `max_workers` which control the number of concurent requests allowed together. You adjust this to get the maximum concurency your provider allows
13+
14+
15+
```python
16+
from ragas.run_config import RunConfig
17+
18+
# increasing max_workers to 64 and timeout to 60 seconds
19+
20+
my_run_config = RunConfig(max_workers=64, timeout=60)
21+
```
22+
23+
### Evaluate
24+
25+
26+
```python
27+
from ragas import EvaluationDataset, SingleTurnSample
28+
from ragas.metrics import Faithfulness
29+
from datasets import load_dataset
30+
from ragas import evaluate
31+
32+
dataset = load_dataset("explodinggradients/amnesty_qa", "english_v3")
33+
34+
samples = []
35+
for row in dataset["eval"]:
36+
sample = SingleTurnSample(
37+
user_input=row["user_input"],
38+
reference=row["reference"],
39+
response=row["response"],
40+
retrieved_contexts=row["retrieved_contexts"],
41+
)
42+
samples.append(sample)
43+
44+
eval_dataset = EvaluationDataset(samples=samples)
45+
metric = Faithfulness()
46+
47+
_ = evaluate(
48+
dataset=eval_dataset,
49+
metrics=[metric],
50+
run_config=my_run_config,
51+
)
52+
```

docs/howtos/customizations/index.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,17 @@ How to customize various aspects of Ragas to suit your needs.
55
## General
66

77
- [Customize models](customize_models.md)
8-
- [Customize timeouts, retries and others](run_config.ipynb)
8+
- [Customize timeouts, retries and others](./_run_config.md)
99

1010
## Metrics
11-
- [Modify prompts in metrics](metrics/modifying-prompts-metrics.ipynb)
12-
- [Write your own metrics](metrics/write_your_own_metric.ipynb)
13-
- [Adapt metrics to target language](metrics/metrics_language_adaptation.ipynb)
14-
- [Estimate cost of evaluation with metrics](metrics/cost.ipynb)
11+
- [Modify prompts in metrics](./metrics/_modifying-prompts-metrics.md)
12+
- [Write your own metrics](./metrics/_write_your_own_metric.md)
13+
- [Adapt metrics to target language](./metrics/_metrics_language_adaptation.md)
14+
- [Estimate cost of evaluation with metrics](metrics/_cost.md)
1515
- [Tracing evaluations with Observability tools](metrics/tracing.md)
1616

1717

1818
## Testset Generation
1919

2020
- [Add your own test cases](testgenerator/index.md)
21-
- [Seed generations using production data](testgenerator/index.md)
21+
- [Seed generations using production data](testgenerator/index.md)
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Adapting metrics to target language
2+
3+
While using ragas to evaluate LLM application workflows, you may have applications to be evaluated that are in languages other than english. In this case, it is best to adapt your LLM powered evaluation metrics to the target language. One obivous way to do this is to manually change the instruction and demonstration, but this can be time consuming. Ragas here offers automatic language adaptation where you can automatically adapt any metrics to target language by using LLM itself. This notebook demonstrates this with simple example
4+
5+
For the sake of this example, let's choose and metric and inspect the default prompts
6+
7+
8+
```python
9+
from ragas.metrics import SimpleCriteriaScoreWithReference
10+
11+
scorer = SimpleCriteriaScoreWithReference(
12+
name="course_grained_score", definition="Score 0 to 5 by similarity"
13+
)
14+
```
15+
16+
17+
```python
18+
scorer.get_prompts()
19+
```
20+
21+
22+
23+
24+
{'multi_turn_prompt': <ragas.metrics._simple_criteria.MultiTurnSimpleCriteriaWithReferencePrompt at 0x7fcf409c3880>,
25+
'single_turn_prompt': <ragas.metrics._simple_criteria.SingleTurnSimpleCriteriaWithReferencePrompt at 0x7fcf409c3a00>}
26+
27+
28+
29+
As you can see, the instruction and demonstration are both in english. Setting up LLM to be used for this conversion
30+
31+
32+
```python
33+
from ragas.llms import llm_factory
34+
35+
llm = llm_factory()
36+
```
37+
38+
To view the supported language codes
39+
40+
41+
```python
42+
from ragas.utils import RAGAS_SUPPORTED_LANGUAGE_CODES
43+
44+
print(list(RAGAS_SUPPORTED_LANGUAGE_CODES.keys()))
45+
```
46+
47+
['english', 'hindi', 'marathi', 'chinese', 'spanish', 'amharic', 'arabic', 'armenian', 'bulgarian', 'urdu', 'russian', 'polish', 'persian', 'dutch', 'danish', 'french', 'burmese', 'greek', 'italian', 'japanese', 'deutsch', 'kazakh', 'slovak']
48+
49+
50+
Now let's adapt it to 'hindi' as the target language using `adapt` method.
51+
Language adaptation in Ragas works by translating few shot examples given along with the prompts to the target language. Instructions remains in english.
52+
53+
54+
```python
55+
adapted_prompts = await scorer.adapt_prompts(language="hindi", llm=llm)
56+
```
57+
58+
Inspect the adapted prompts and make corrections if needed
59+
60+
61+
```python
62+
adapted_prompts
63+
```
64+
65+
66+
67+
68+
{'multi_turn_prompt': <ragas.metrics._simple_criteria.MultiTurnSimpleCriteriaWithReferencePrompt at 0x7fcf42bc40a0>,
69+
'single_turn_prompt': <ragas.metrics._simple_criteria.SingleTurnSimpleCriteriaWithReferencePrompt at 0x7fcf722de890>}
70+
71+
72+
73+
set the prompts to new adapted prompts using `set_prompts` method
74+
75+
76+
```python
77+
scorer.set_prompts(**adapted_prompts)
78+
```
79+
80+
Evaluate using adapted metrics
81+
82+
83+
```python
84+
from ragas.dataset_schema import SingleTurnSample
85+
86+
sample = SingleTurnSample(
87+
user_input="एफिल टॉवर कहाँ स्थित है?",
88+
response="एफिल टॉवर पेरिस में स्थित है।",
89+
reference="एफिल टॉवर मिस्र में स्थित है",
90+
)
91+
92+
scorer.llm = llm
93+
await scorer.single_turn_ascore(sample)
94+
```
95+
96+
97+
98+
99+
0
100+
101+
102+
103+
Trace of reasoning and score
104+
105+
`{
106+
"reason": "प्रतिक्रिया और संदर्भ के उत्तर में स्थान के संदर्भ में महत्वपूर्ण भिन्नता है।",
107+
"score": 0
108+
}`
109+
110+

0 commit comments

Comments
 (0)