diff --git a/src/langsmith/analyze-single-experiment.mdx b/src/langsmith/analyze-single-experiment.mdx deleted file mode 100644 index 18b13e955..000000000 --- a/src/langsmith/analyze-single-experiment.mdx +++ /dev/null @@ -1,94 +0,0 @@ ---- -title: Analyze a single experiment -sidebarTitle: Analyze a single experiment ---- - -After running an experiment, you can use LangSmith's experiment view to analyze the results and draw insights about your experiment's performance. - -This guide will walk you through viewing experiment results and highlight the features available in the experiments view. - -## Open the experiment view - -To open the experiment view, select the relevant Dataset from the **Dataset & Experiments** page and then select the experiment you want to view. - -![Open experiment view](/langsmith/images/select-experiment.png) - -## View experiment results - -### Customizing columns - -By default, the experiment view shows the input, output, and reference output for each [example](/langsmith/evaluation-overview#examples) in the dataset, feedback scores from evaluations and experiment metrics like cost, token counts, latency and status. - -You can customize the columns using the **Display** button to make it easier to interpret experiment results: - -- **Break out fields from inputs, outputs, and reference outputs** into their own columns. This is especially helpful if you have long inputs/outputs/reference outputs and want to surface important fields -- **Hide and reorder columns** to create focused views for analysis -- **Control decimal precision on feedback scores**. By default, we surface numerical feedback scores with a decimal precision of 2, but you can customize this setting to be up to 6 decimals - - - You can set default configurations for an entire dataset or temporarily save settings just for yourself. - - -![Experiment view](/langsmith/images/column-config.gif) - -You can also set the high, middle, and low thresholds for numeric feedback scores in your experiment, which affects the threshold at which score chips render as red or green: - -![Column heatmap configuration](/langsmith/images/column-heat-map.png) - -### Sort and filter - -To sort or filter feedback scores, you can use the actions in the column headers. - -![Sort and filter](/langsmith/images/sort-filter.png) - -### Table views - -Depending on the view most useful for your analysis, you can change the formatting of the table by toggling between a compact view, a full, view, and a diff view. - -* The `Compact` view shows each run as a one-line row, for ease of comparing scores at a glance. -* The `Full` view shows the full output for each run for digging into the details of individual runs. -* The `Diff` view shows the text difference between the reference output and the output for each run. - -![Diff view](/langsmith/images/diff-mode.png) - -### View the traces - -Hover over any of the output cells, and click on the trace icon to view the trace for that run. This will open up a trace in the side panel. - -To view the entire tracing project, click on the "View Project" button in the top right of the header. - -![View trace](/langsmith/images/view-trace.png) - -### View evaluator runs - -For evaluator scores, you can view the source run by hovering over the evaluator score cell and clicking on the arrow icon. This will open up a trace in the side panel. If you're running a LLM-as-a-judge evaluator, you can view the prompt used for the evaluator in this run. If your experiment has [repetitions](/langsmith/evaluation-overview#repetitions), you can click on the aggregate average score to find links to all of the individual runs. - -![View evaluator runs](/langsmith/images/evaluator-run.png) - -## Group results by metadata - -You can add metadata to examples to categorize and organize them. For example, if you're evaluating factual accuracy on a question answering dataset, the metadata might include which subject area each question belongs to. Metadata can be added either [via the UI](/langsmith/manage-datasets-in-application#edit-example-metadata) or [via the SDK](/langsmith/manage-datasets-programmatically#update-single-example). - -To analyze results by metadata, use the "Group by" dropdown in the top right corner of the experiment view and select your desired metadata key. This displays average feedback scores, latency, total tokens, and cost for each metadata group. - - - You will only be able to group by example metadata on experiments created after February 20th, 2025. Any experiments before that date can still be grouped by metadata, but only if the metadata is on the experiment traces themselves. - - -![Group by](/langsmith/images/group-by.gif) - -## Repetitions - -If you've run your experiment with [repetitions](/langsmith/evaluation-overview#repetitions), there will be arrows in the output results column so you can view outputs in the table. To view each run from the repetition, hover over the output cell and click the expanded view. - -When you run an experiment with repetitions, LangSmith displays the average for each feedback score in the table. Click on the feedback score to view the feedback scores from individual runs, or to view the standard deviation across repetitions. - -![Repetitions](/langsmith/images/repetitions.png) - -## Compare to another experiment - -In the top right of the experiment view, you can select another experiment to compare to. This will open up a comparison view, where you can see how the two experiments compare. To learn more about the comparison view, see [how to compare experiment results](/langsmith/compare-experiment-results). - -![Compare](/langsmith/images/compare-to-another.png) - - diff --git a/src/langsmith/bind-evaluator-to-dataset.mdx b/src/langsmith/bind-evaluator-to-dataset.mdx index 7ae86813a..2f818aa1d 100644 --- a/src/langsmith/bind-evaluator-to-dataset.mdx +++ b/src/langsmith/bind-evaluator-to-dataset.mdx @@ -50,5 +50,5 @@ def perform_eval(run, example): ## Next steps -* Analyze your experiment results in the [experiments tab](/langsmith/analyze-single-experiment) +* Analyze your experiment results in the [experiments tab](/langsmith/work-with-experiments) * Compare your experiment results in the [comparison view](/langsmith/compare-experiment-results) diff --git a/src/langsmith/download-experiment-results-as-csv.mdx b/src/langsmith/download-experiment-results-as-csv.mdx deleted file mode 100644 index 64b9459cc..000000000 --- a/src/langsmith/download-experiment-results-as-csv.mdx +++ /dev/null @@ -1,13 +0,0 @@ ---- -title: How to download experiment results as a CSV -sidebarTitle: Download experiment results as a CSV ---- - -LangSmith lets you download experiment results as a CSV file, making it easy to analyze and share your results. - - -## Download experiment results as a CSV - -To download your experiment results as a CSV, click the download icon at the top of the experiment view. The icon is directly to the left of the ["Compact" toggle](/langsmith/compare-experiment-results#adjust-the-table-display). - -![Download CSV](/langsmith/images/download-experiment-results-as-csv.png) \ No newline at end of file diff --git a/src/langsmith/evaluate-pairwise.mdx b/src/langsmith/evaluate-pairwise.mdx index c07ed63b3..cc56dd1d5 100644 --- a/src/langsmith/evaluate-pairwise.mdx +++ b/src/langsmith/evaluate-pairwise.mdx @@ -78,7 +78,7 @@ Note that you should choose a feedback key that is distinct from standard feedba The following example uses [a prompt](https://smith.langchain.com/hub/langchain-ai/pairwise-evaluation-2) which asks the LLM to decide which is better between two AI assistant responses. It uses structured output to parse the AI's response: 0, 1, or 2. - In the Python example below, we are pulling [this structured prompt](https://smith.langchain.com/hub/langchain-ai/pairwise-evaluation-2) from the [LangChain Hub](/langsmith/langchain-hub) and using it with a LangChain chat model wrapper. + In the Python example below, we are pulling [this structured prompt](https://smith.langchain.com/hub/langchain-ai/pairwise-evaluation-2) from the [LangChain Hub](/langsmith/manage-prompts) and using it with a LangChain chat model wrapper. **Usage of LangChain is totally optional.** To illustrate this point, the TypeScript example uses the OpenAI SDK directly. diff --git a/src/langsmith/evaluation-overview.mdx b/src/langsmith/evaluation-overview.mdx index 86df0f9b9..190ba26da 100644 --- a/src/langsmith/evaluation-overview.mdx +++ b/src/langsmith/evaluation-overview.mdx @@ -122,7 +122,7 @@ Learn [how run pairwise evaluations](/langsmith/evaluate-pairwise). ## Experiment -Each time we evaluate an application on a dataset, we are conducting an experiment. An experiment contains the results of running a specific version of your application on the dataset. To understand how to use the LangSmith experiment view, see [how to analyze experiment results](/langsmith/analyze-single-experiment). +Each time we evaluate an application on a dataset, we are conducting an experiment. An experiment contains the results of running a specific version of your application on the dataset. To understand how to use the LangSmith experiment view, see [how to analyze experiment results](/langsmith/work-with-experiments). ![Experiment view](/langsmith/images/experiment-view.png) diff --git a/src/langsmith/home.mdx b/src/langsmith/home.mdx index 6ec4f08fd..147c8118e 100644 --- a/src/langsmith/home.mdx +++ b/src/langsmith/home.mdx @@ -40,7 +40,7 @@ The quality and development speed of AI applications depends on high-quality eva * Get started by [creating your first evaluation](/langsmith/run-evaluation-from-prompt-playground). * Quickly assess the performance of your application using our [off-the-shelf evaluators](https://docs.smith.langchain.com/langsmith/prebuilt-evaluators) as a starting point. -* [Analyze results](/langsmith/analyze-single-experiment) of evaluations in the LangSmith UI and [compare results](https://docs.smith.langchain.com/langsmith/compare-experiment-results) over time. +* [Analyze results](/langsmith/work-with-experiments) of evaluations in the LangSmith UI and [compare results](https://docs.smith.langchain.com/langsmith/compare-experiment-results) over time. * Easily collect [human feedback](/langsmith/annotation-queues) on your data to improve your application. ## Prompt Engineering diff --git a/src/langsmith/manage-datasets-in-application.mdx b/src/langsmith/manage-datasets-in-application.mdx index dfcf3b299..c20371d3b 100644 --- a/src/langsmith/manage-datasets-in-application.mdx +++ b/src/langsmith/manage-datasets-in-application.mdx @@ -121,7 +121,7 @@ In order to create and manage splits in the app, you can select some examples in ### Edit example metadata -You can add metadata to your examples by clicking on an example and then clicking "Edit" on the top righthand side of the popover. From this page, you can update/delete existing metadata, or add new metadata. You may use this to store information about your examples, such as tags or version info, which you can then [group by](/langsmith/analyze-single-experiment#group-results-by-metadata) when analyzing experiment results or [filter by](/langsmith/manage-datasets-programmatically#list-examples-by-metadata) when you call `list_examples` in the SDK. +You can add metadata to your examples by clicking on an example and then clicking "Edit" on the top righthand side of the popover. From this page, you can update/delete existing metadata, or add new metadata. You may use this to store information about your examples, such as tags or version info, which you can then [group by](/langsmith/work-with-experiments#group-results-by-metadata) when analyzing experiment results or [filter by](/langsmith/manage-datasets-programmatically#list-examples-by-metadata) when you call `list_examples` in the SDK. ![Add Metadata](/langsmith/images/add-metadata.gif) diff --git a/src/langsmith/manage-prompts-programmatically.mdx b/src/langsmith/manage-prompts-programmatically.mdx index 884aaf971..7c2416c13 100644 --- a/src/langsmith/manage-prompts-programmatically.mdx +++ b/src/langsmith/manage-prompts-programmatically.mdx @@ -214,7 +214,7 @@ Similar to pushing a prompt, you can also pull a prompt as a RunnableSequence of -When pulling a prompt, you can also specify a specific commit hash or [commit tag](/langsmith/prompt-tags) to pull a specific version of the prompt. +When pulling a prompt, you can also specify a specific commit hash or [commit tag](/langsmith/manage-prompts) to pull a specific version of the prompt. diff --git a/src/langsmith/renaming-experiment.mdx b/src/langsmith/renaming-experiment.mdx deleted file mode 100644 index 88e1d3d90..000000000 --- a/src/langsmith/renaming-experiment.mdx +++ /dev/null @@ -1,23 +0,0 @@ ---- -title: How to rename an experiment -sidebarTitle: Rename an experiment ---- - -This guide outlines the available methods to rename an experiment in the LangSmith UI. There are two ways to rename an experiment: - -1. **Within the Playground** -2. **Within the Experiments View** - -Note that experiment names must be unique per workspace. - -## 1. Renaming an experiment in the Playground - -When running experiments in the Playground, a default name with the format `pg::prompt-name::model::uuid` (eg. `pg::gpt-4o-mini::897ee630`) is automatically assigned. - -You can rename an experiment immediately after running it by editing its name in the Playground table header. ![Edit name in playground](/langsmith/images/rename-in-playground.png) - -## 2. Renaming an experiment in the experiments view - -When viewing results in the experiments view, you can rename an experiment by using the pencil icon beside the experiment name. - -![Edit name in experiments view](/langsmith/images/rename-in-experiments-view.png) \ No newline at end of file diff --git a/src/langsmith/work-with-experiments.mdx b/src/langsmith/work-with-experiments.mdx new file mode 100644 index 000000000..130028a33 --- /dev/null +++ b/src/langsmith/work-with-experiments.mdx @@ -0,0 +1,124 @@ +--- +title: Work with experiments +sidebarTitle: Work with experiments +--- + +This page describes some of the essential tasks for working with [_experiments_](/langsmith/evaluation-overview#experiment) in LangSmith: + +- **[Analyze a single experiment](#analyze-a-single-experiment)**: View and interpret experiment results, customize columns, filter data, and compare runs. +- **[Download experiment results as a CSV](#how-to-download-experiment-results-as-a-csv)**: Export your experiment data for external analysis and sharing. +- **[Rename an experiment](#how-to-rename-an-experiment)**: Update experiment names in both the Playground and Experiments view. + +## Analyze a single experiment + +After running an experiment, you can use LangSmith's experiment view to analyze the results and draw insights about your experiment's performance. + +### Open the experiment view + +To open the experiment view, select the relevant [_dataset_](/langsmith/evaluation-overview#datasets) from the **Dataset & Experiments** page and then select the experiment you want to view. + +![Open experiment view](/langsmith/images/select-experiment.png) + +### View experiment results + +#### Customize columns + +By default, the experiment view shows the input, output, and reference output for each [example](/langsmith/evaluation-overview#examples) in the dataset, feedback scores from evaluations and experiment metrics like cost, token counts, latency and status. + +You can customize the columns using the **Display** button to make it easier to interpret experiment results: + +- **Break out fields from inputs, outputs, and reference outputs** into their own columns. This is especially helpful if you have long inputs/outputs/reference outputs and want to surface important fields. +- **Hide and reorder columns** to create focused views for analysis. +- **Control decimal precision on feedback scores**. By default, LangSmith surfaces numerical feedback scores with a decimal precision of 2, but you can customize this setting to be up to 6 decimals. + + +You can set default configurations for an entire dataset or temporarily save settings just for yourself. + + +![Experiment view](/langsmith/images/column-config.gif) + +You can also set the high, middle, and low thresholds for numeric feedback scores in your experiment, which affects the threshold at which score chips render as red or green: + +![Column heatmap configuration](/langsmith/images/column-heat-map.png) + +#### Sort and filter + +To sort or filter feedback scores, you can use the actions in the column headers. + +![Sort and filter](/langsmith/images/sort-filter.png) + +#### Table views + +Depending on the view most useful for your analysis, you can change the formatting of the table by toggling between a compact view, a full, view, and a diff view. + +- The **Compact** view shows each run as a one-line row, for ease of comparing scores at a glance. +- The **Full** view shows the full output for each run for digging into the details of individual runs. +- The **Diff** view shows the text difference between the reference output and the output for each run. + +![Diff view](/langsmith/images/diff-mode.png) + +#### View the traces + +Hover over any of the output cells, and click on the trace icon to view the trace for that run. This will open up a trace in the side panel. + +To view the entire tracing project, click on the **View Project** button in the top right of the header. + +![View trace](/langsmith/images/view-trace.png) + +#### View evaluator runs + +For evaluator scores, you can view the source run by hovering over the evaluator score cell and clicking on the arrow icon. This will open up a trace in the side panel. If you're running a [LLM-as-a-judge evaluator](/langsmith/llm-as-judge), you can view the prompt used for the evaluator in this run. If your experiment has [repetitions](/langsmith/evaluation-overview#repetitions), you can click on the aggregate average score to find links to all of the individual runs. + +![View evaluator runs](/langsmith/images/evaluator-run.png) + +### Group results by metadata + +You can add metadata to examples to categorize and organize them. For example, if you're evaluating factual accuracy on a question answering dataset, the metadata might include which subject area each question belongs to. Metadata can be added either [via the UI](/langsmith/manage-datasets-in-application#edit-example-metadata) or [via the SDK](/langsmith/manage-datasets-programmatically#update-single-example). + +To analyze results by metadata, use the **Group by** dropdown in the top right corner of the experiment view and select your desired metadata key. This displays average feedback scores, latency, total tokens, and cost for each metadata group. + + + You will only be able to group by example metadata on experiments created after February 20th, 2025. Any experiments before that date can still be grouped by metadata, but only if the metadata is on the experiment traces themselves. + + +![Group by](/langsmith/images/group-by.gif) + +### Repetitions + +If you've run your experiment with [_repetitions_](/langsmith/evaluation-overview#repetitions), there will be arrows in the output results column so you can view outputs in the table. To view each run from the repetition, hover over the output cell and click the expanded view. + +When you run an experiment with repetitions, LangSmith displays the average for each feedback score in the table. Click on the feedback score to view the feedback scores from individual runs, or to view the standard deviation across repetitions. + +![Repetitions](/langsmith/images/repetitions.png) + +### Compare to another experiment + +In the top right of the experiment view, you can select another experiment to compare to. This will open up a comparison view, where you can see how the two experiments compare. To learn more about the comparison view, see [how to compare experiment results](/langsmith/compare-experiment-results). + +![Compare](/langsmith/images/compare-to-another.png) + +## Download experiment results as a CSV + +LangSmith lets you download experiment results as a CSV file, which allows you to analyze and share your results. + +To download as a CSV, click the download icon at the top of the experiment view. The icon is directly to the left of the [Compact toggle](/langsmith/compare-experiment-results#adjust-the-table-display). + +![Download CSV](/langsmith/images/download-experiment-results-as-csv.png) + +## Rename an experiment + + +Experiment names must be unique per workspace. + + +You can rename an experiment in the LangSmith UI in: + +- The [Playground](#renaming-an-experiment-in-the-playground). When running experiments in the Playground, a default name with the format `pg::prompt-name::model::uuid` (eg. `pg::gpt-4o-mini::897ee630`) is automatically assigned. + + You can rename an experiment immediately after running it by editing its name in the Playground table header. + + ![Edit name in playground](/langsmith/images/rename-in-playground.png) + +- The [Experiments view](#renaming-an-experiment-in-the-experiments-view). When viewing results in the experiments view, you can rename an experiment by using the pencil icon beside the experiment name. + + ![Edit name in experiments view](/langsmith/images/rename-in-experiments-view.png) \ No newline at end of file