Skip to content

Commit a1eae3f

Browse files
authored
Merge pull request #1637 from stanfordnlp/docs_oct2024
Docs
2 parents 77c2e1c + 8d05895 commit a1eae3f

File tree

2 files changed

+10
-6
lines changed

2 files changed

+10
-6
lines changed

docs/docs/quick-start/getting-started-01.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ dspy.inspect_history(n=1)
4646
```
4747

4848
**Output:**
49-
See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22)
49+
See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22).
5050

5151

5252
DSPy has various built-in modules, e.g. `dspy.ChainOfThought`, `dspy.ProgramOfThought`, and `dspy.ReAct`. These are interchangeable with basic `dspy.Predict`: they take your signature, which is specific to your task, and they apply general-purpose prompting techniques and inference-time strategies to it.
@@ -151,7 +151,7 @@ len(trainset), len(valset), len(devset), len(testset)
151151

152152
What kind of metric can suit our question-answering task? There are many choices, but since the answer are long, we may ask: How well does the system response _cover_ all key facts in the gold response? And the other way around, how well is the system response _not saying things_ that aren't in the gold response?
153153

154-
That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](/docs/building-blocks/modules) using whatever LM we're working with.
154+
That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](https://github.com/stanfordnlp/dspy/blob/77c2e1cceba427c7f91edb2ed5653276fb0c6de7/dspy/evaluate/auto_evaluation.py#L21) using whatever LM we're working with.
155155

156156

157157
```python
@@ -192,7 +192,7 @@ dspy.inspect_history(n=1)
192192
```
193193

194194
**Output:**
195-
See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8)
195+
See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8).
196196

197197
For evaluation, you could use the metric above in a simple loop and just average the score. But for nice parallelism and utilities, we can rely on `dspy.Evaluate`.
198198

docs/docs/quick-start/getting-started-02.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,11 @@ class RAG(dspy.Module):
9393
def forward(self, question):
9494
context = search(question, k=self.num_docs)
9595
return self.respond(context=context, question=question)
96-
96+
```
97+
98+
Let's use the RAG module.
99+
100+
```
97101
rag = RAG()
98102
rag(question="what are high memory and low memory on linux?")
99103
```
@@ -111,7 +115,7 @@ dspy.inspect_history()
111115
```
112116

113117
**Output:**
114-
See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c)
118+
See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c).
115119

116120

117121
In the previous guide with a CoT module, we got nearly 40% in terms of semantic F1 on our `devset`. Would this `RAG` module score better?
@@ -151,7 +155,7 @@ optimized_rag = tp.compile(RAG(), trainset=trainset, valset=valset,
151155
```
152156

153157
**Output:**
154-
See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb)
158+
See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb).
155159

156160

157161
The prompt optimization process here is pretty systematic, you can learn about it for example in this paper. Importantly, it's not a magic button. It's very possible that it can overfit your training set for instance and not generalize well to a held-out set, making it essential that we iteratively validate our programs.

0 commit comments

Comments
 (0)