Evals docs revisisions #2849

summerscope · 2025-09-10T08:10:54Z

Fleshing out details, adding a bit more opinion and explanation. Pulls from:
https://www.notion.so/Evals-Data-Model-2488e1f5f27380768e1ad5824036db9f?source=copy_link

Note that I was thinking about updating the docstrings for API reference section a bit too, but I haven't had time to do this and didn't want to block these other changes if you like them.

@dmontagu question I have here is about the concept of Experiments - do we have this in the library or is it only a Logfire thing?

github-actions · 2025-09-10T08:15:41Z

Docs Preview

commit:	`31067f3`
Preview URL:	https://fd6415bc-pydantic-ai-previews.pydantic.workers.dev

dmontagu · 2025-09-10T17:57:58Z

docs/evals.md


 We've designed Pydantic Evals to be useful while not being too opinionated since we (along with everyone else) are still figuring out best practices. We'd love your [feedback](help.md) on the package and how we can improve it.

 !!! note "In Beta"
    Pydantic Evals support was [introduced](https://github.com/pydantic/pydantic-ai/pull/935) in v0.0.47 and is currently in beta. The API is subject to change and the documentation is incomplete.

+## Code-First Evaluation
+
+Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration.


Suggested change

Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration.

Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code, or as serialized data loaded by Python code. This differs from platforms with fully web-based configuration.

dmontagu · 2025-09-10T17:59:42Z

docs/evals.md

+
+Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration.
+
+When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis.


Suggested change

When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis.

When you run an _Experiment_ you'll see progress indicator and can print the results wherever you run your python code (IDE, terminal, etc). You also get a report object back that you can serialize and store or send to a notebook or other application for further visualization and analysis.

dmontagu · 2025-09-10T18:00:12Z

docs/evals.md

+
+When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis.
+
+If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a read-only observability layer - you write and run evals in code, then view and analyze results in the web UI.


Suggested change

If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a read-only observability layer - you write and run evals in code, then view and analyze results in the web UI.

If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a observability layer - you write and run evals in code, then view and analyze results in the web UI.

(We intend to offer more-than-read-only functionality related to this in the immediate future.)

dmontagu · 2025-09-10T18:03:59Z

docs/evals.md

+1. **Dataset → Cases**: One Dataset contains many Cases (composition)
+2. **Dataset → Experiments**: One Dataset can be used in many Experiments
+   over time (aggregation)
+3. **Experiment → Case results**: One Experiment generates results by
+   executing each Case
+4. **Experiment → Task**: One Experiment evaluates one defined Task
+5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators (dataset-level + case-specific)
+6. **Case results → Evaluators**: Individual Case results are scored by both dataset-level evaluators and case-specific evaluators (if they exist)


Suggested change

1. **Dataset → Cases**: One Dataset contains many Cases (composition)

2. **Dataset → Experiments**: One Dataset can be used in many Experiments

over time (aggregation)

3. **Experiment → Case results**: One Experiment generates results by

executing each Case

4. **Experiment → Task**: One Experiment evaluates one defined Task

5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators (dataset-level + case-specific)

6. **Case results → Evaluators**: Individual Case results are scored by both dataset-level evaluators and case-specific evaluators (if they exist)

1. **Dataset → Cases**: One Dataset contains many Cases

2. **Dataset → Experiments**: One Dataset can be used across many Experiments

over time

3. **Experiment → Case results**: One Experiment generates results by

executing each Case

4. **Experiment → Task**: One Experiment evaluates one defined Task

5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators. Dataset-wide Evaluators are run against all cases, and case-specific Evaluators against their respective cases

Sort of struggled to make the points but I feel like if nothing else the (composition) and (aggregation) labels should be dropped.

dmontagu · 2025-09-10T18:04:28Z

docs/evals.md

+
+### Data Flow
+
+1. **Dataset creation**: Define case templates and evaluators in YAML/JSON


Suggested change

1. **Dataset creation**: Define case templates and evaluators in YAML/JSON

1. **Dataset creation**: Define cases and evaluators in YAML/JSON, or directly in Python

You don't have to use JSON/YAML for this

dmontagu · 2025-09-10T18:05:19Z

docs/evals.md

+2. **Experiment execution**: Run `dataset.evaluate_sync(task_function)`
+3. **Cases run**: Each Case is executed against the Task
+4. **Evaluation**: Evaluators score the Task outputs for each Case
+5. **Results**: Experiment collects all Case results and returns a summary report


Suggested change

5. **Results**: Experiment collects all Case results and returns a summary report

5. **Results**: All Case results are collected into a summary report

Sounds weird to me to read "Experiment collects ..."

dmontagu · 2025-09-10T18:06:39Z

docs/evals.md

+    - **Experiments** are like running your entire test suite and getting a
+    coverage report. When you execute `dataset.evaluate_sync(my_ai_function)`,
+    you're running all your cases against your AI system and
+    collecting the results - just like running `pytest` and getting a
+    summary of passes, failures, and performance metrics.


Suggested change

- **Experiments** are like running your entire test suite and getting a

coverage report. When you execute `dataset.evaluate_sync(my_ai_function)`,

you're running all your cases against your AI system and

collecting the results - just like running `pytest` and getting a

summary of passes, failures, and performance metrics.

- **Experiments** are like running your entire test suite and getting a

report. When you execute `dataset.evaluate_sync(my_ai_function)`,

you're running all your cases against your AI system and

collecting the results - just like running `pytest` and getting a

summary of passes, failures, and performance metrics.

I feel like coverage is more about how comprehensive the test suite is rather than whether the tests passed or failed, which is both what this is analogous to and the point already being made in the final sentence here.

dmontagu · 2025-09-10T18:07:12Z

docs/evals.md

+    The key difference from traditional unit testing is that AI systems are
+    probabilistic. If you're type checking you'll still get a simple pass/fail,
+    but scores for text outputs are likely qualitative and/or categorical,
+    and more open to interpretation. Keep in mind that unlike unit test coverage reports, we are looking at  model behavior over the probabilistic space of user inputs, **not** coverage of source code.


Suggested change

and more open to interpretation. Keep in mind that unlike unit test coverage reports, we are looking at model behavior over the probabilistic space of user inputs, **not** coverage of source code.

and more open to interpretation.

I don't think we need to make this point if we don't mention coverage above.

dmontagu · 2025-09-10T18:07:40Z

docs/evals.md

-Pydantic Evals includes several built-in evaluators and allows you to create custom evaluators:
+These can be a classic unit test: deterministic, code-based checks, such as testing model output format with a regex, or checking for the appearance of PII or sensitive data. Alternatively Evaluators can assess the non-deterministic model outputs for qualities like accuracy, precision/recall, hallucinations or instruction-following.
+
+While both kinds of testing are necessary and useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system.


Suggested change

While both kinds of testing are necessary and useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system.

While both kinds of testing are useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system.

🤷‍♂️

dmontagu · 2025-09-10T18:08:41Z

docs/evals.md

+## Run your Experiment

-The evaluation process involves running a task against all cases in a dataset:
+<!-- TODO: check this renaming makes sense? Do we need to add something about naming experiments with commit message in the config? IDK if this is anticipating a change that hasn't arrived in the library yet? -->
+
+This involves running a task against all cases in a dataset:


Suggested change

## Run your Experiment

The evaluation process involves running a task against all cases in a dataset:



This involves running a task against all cases in a dataset:

## Running Experiments

Performing evaluations involves running a task against all cases in a dataset, also known as running an "experiment":

summerscope added 4 commits September 6, 2025 13:57

WIP adding data model and further explanation to the docs

1ca85b4

Further copy editing

fd97d5e

Merge branch 'main' into laura/evals-docs

807d4c5

Kill unnecessary sections

d6c1ae5

summerscope added 3 commits September 10, 2025 11:18

Attempt formatting in the callout

6eff83b

Another attempt to style admonition

8942d2e

Another attempt to style admonition

e85f10c

summerscope requested a review from dmontagu September 10, 2025 08:28

summerscope added 2 commits September 10, 2025 11:31

chore: fix links

8c16d8c

Fix links

31067f3

DouweM assigned dmontagu Sep 10, 2025

dmontagu reviewed Sep 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evals docs revisisions #2849

Evals docs revisisions #2849

summerscope commented Sep 10, 2025

Uh oh!

github-actions bot commented Sep 10, 2025 •

edited

Loading

Uh oh!

dmontagu Sep 10, 2025

Uh oh!

dmontagu Sep 10, 2025

Uh oh!

dmontagu Sep 10, 2025

Uh oh!

dmontagu Sep 10, 2025

Uh oh!

dmontagu Sep 10, 2025

Uh oh!

dmontagu Sep 10, 2025

Uh oh!

dmontagu Sep 10, 2025

Uh oh!

dmontagu Sep 10, 2025

Uh oh!

dmontagu Sep 10, 2025

Uh oh!

dmontagu Sep 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

	Pydantic Evals follows a code-first approach where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration.
	Pydantic Evals follows a code-first approach where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code, or as serialized data loaded by Python code. This differs from platforms with fully web-based configuration.


		Pydantic Evals follows a code-first approach where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration.

		When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis.


		When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis.

		If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a read-only observability layer - you write and run evals in code, then view and analyze results in the web UI.


		### Data Flow

		1. Dataset creation: Define case templates and evaluators in YAML/JSON

	1. Dataset creation: Define case templates and evaluators in YAML/JSON
	1. Dataset creation: Define cases and evaluators in YAML/JSON, or directly in Python

	5. Results: Experiment collects all Case results and returns a summary report
	5. Results: All Case results are collected into a summary report

	and more open to interpretation. Keep in mind that unlike unit test coverage reports, we are looking at model behavior over the probabilistic space of user inputs, not coverage of source code.
	and more open to interpretation.

	While both kinds of testing are necessary and useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system.
	While both kinds of testing are useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system.

Evals docs revisisions #2849

Are you sure you want to change the base?

Evals docs revisisions #2849

Conversation

summerscope commented Sep 10, 2025

Uh oh!

github-actions bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Docs Preview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmontagu Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Sep 10, 2025 •

edited

Loading

dmontagu Sep 10, 2025 •

edited

Loading