Skip to content

Conversation

summerscope
Copy link
Contributor

Fleshing out details, adding a bit more opinion and explanation. Pulls from:
https://www.notion.so/Evals-Data-Model-2488e1f5f27380768e1ad5824036db9f?source=copy_link

Note that I was thinking about updating the docstrings for API reference section a bit too, but I haven't had time to do this and didn't want to block these other changes if you like them.

@dmontagu question I have here is about the concept of Experiments - do we have this in the library or is it only a Logfire thing?

Copy link

github-actions bot commented Sep 10, 2025

Docs Preview

commit: 31067f3
Preview URL: https://fd6415bc-pydantic-ai-previews.pydantic.workers.dev


We've designed Pydantic Evals to be useful while not being too opinionated since we (along with everyone else) are still figuring out best practices. We'd love your [feedback](help.md) on the package and how we can improve it.

!!! note "In Beta"
Pydantic Evals support was [introduced](https://github.com/pydantic/pydantic-ai/pull/935) in v0.0.47 and is currently in beta. The API is subject to change and the documentation is incomplete.

## Code-First Evaluation

Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration.
Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code, or as serialized data loaded by Python code. This differs from platforms with fully web-based configuration.


Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration.

When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis.
When you run an _Experiment_ you'll see progress indicator and can print the results wherever you run your python code (IDE, terminal, etc). You also get a report object back that you can serialize and store or send to a notebook or other application for further visualization and analysis.


When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis.

If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a read-only observability layer - you write and run evals in code, then view and analyze results in the web UI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a read-only observability layer - you write and run evals in code, then view and analyze results in the web UI.
If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a observability layer - you write and run evals in code, then view and analyze results in the web UI.

(We intend to offer more-than-read-only functionality related to this in the immediate future.)

Comment on lines +59 to +66
1. **Dataset → Cases**: One Dataset contains many Cases (composition)
2. **Dataset → Experiments**: One Dataset can be used in many Experiments
over time (aggregation)
3. **Experiment → Case results**: One Experiment generates results by
executing each Case
4. **Experiment → Task**: One Experiment evaluates one defined Task
5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators (dataset-level + case-specific)
6. **Case results → Evaluators**: Individual Case results are scored by both dataset-level evaluators and case-specific evaluators (if they exist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **Dataset → Cases**: One Dataset contains many Cases (composition)
2. **Dataset → Experiments**: One Dataset can be used in many Experiments
over time (aggregation)
3. **Experiment → Case results**: One Experiment generates results by
executing each Case
4. **Experiment → Task**: One Experiment evaluates one defined Task
5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators (dataset-level + case-specific)
6. **Case results → Evaluators**: Individual Case results are scored by both dataset-level evaluators and case-specific evaluators (if they exist)
1. **Dataset → Cases**: One Dataset contains many Cases
2. **Dataset → Experiments**: One Dataset can be used across many Experiments
over time
3. **Experiment → Case results**: One Experiment generates results by
executing each Case
4. **Experiment → Task**: One Experiment evaluates one defined Task
5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators. Dataset-wide Evaluators are run against all cases, and case-specific Evaluators against their respective cases

Sort of struggled to make the points but I feel like if nothing else the (composition) and (aggregation) labels should be dropped.


### Data Flow

1. **Dataset creation**: Define case templates and evaluators in YAML/JSON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **Dataset creation**: Define case templates and evaluators in YAML/JSON
1. **Dataset creation**: Define cases and evaluators in YAML/JSON, or directly in Python

You don't have to use JSON/YAML for this

2. **Experiment execution**: Run `dataset.evaluate_sync(task_function)`
3. **Cases run**: Each Case is executed against the Task
4. **Evaluation**: Evaluators score the Task outputs for each Case
5. **Results**: Experiment collects all Case results and returns a summary report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. **Results**: Experiment collects all Case results and returns a summary report
5. **Results**: All Case results are collected into a summary report

Sounds weird to me to read "Experiment collects ..."

Comment on lines +89 to +93
- **Experiments** are like running your entire test suite and getting a
coverage report. When you execute `dataset.evaluate_sync(my_ai_function)`,
you're running all your cases against your AI system and
collecting the results - just like running `pytest` and getting a
summary of passes, failures, and performance metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Experiments** are like running your entire test suite and getting a
coverage report. When you execute `dataset.evaluate_sync(my_ai_function)`,
you're running all your cases against your AI system and
collecting the results - just like running `pytest` and getting a
summary of passes, failures, and performance metrics.
- **Experiments** are like running your entire test suite and getting a
report. When you execute `dataset.evaluate_sync(my_ai_function)`,
you're running all your cases against your AI system and
collecting the results - just like running `pytest` and getting a
summary of passes, failures, and performance metrics.

I feel like coverage is more about how comprehensive the test suite is rather than whether the tests passed or failed, which is both what this is analogous to and the point already being made in the final sentence here.

The key difference from traditional unit testing is that AI systems are
probabilistic. If you're type checking you'll still get a simple pass/fail,
but scores for text outputs are likely qualitative and/or categorical,
and more open to interpretation. Keep in mind that unlike unit test coverage reports, we are looking at model behavior over the probabilistic space of user inputs, **not** coverage of source code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and more open to interpretation. Keep in mind that unlike unit test coverage reports, we are looking at model behavior over the probabilistic space of user inputs, **not** coverage of source code.
and more open to interpretation.

I don't think we need to make this point if we don't mention coverage above.

Pydantic Evals includes several built-in evaluators and allows you to create custom evaluators:
These can be a classic unit test: deterministic, code-based checks, such as testing model output format with a regex, or checking for the appearance of PII or sensitive data. Alternatively Evaluators can assess the non-deterministic model outputs for qualities like accuracy, precision/recall, hallucinations or instruction-following.

While both kinds of testing are necessary and useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While both kinds of testing are necessary and useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system.
While both kinds of testing are useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system.

🤷‍♂️

Comment on lines +165 to +169
## Run your Experiment

The evaluation process involves running a task against all cases in a dataset:
<!-- TODO: check this renaming makes sense? Do we need to add something about naming experiments with commit message in the config? IDK if this is anticipating a change that hasn't arrived in the library yet? -->

This involves running a task against all cases in a dataset:
Copy link
Contributor

@dmontagu dmontagu Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Run your Experiment
The evaluation process involves running a task against all cases in a dataset:
<!-- TODO: check this renaming makes sense? Do we need to add something about naming experiments with commit message in the config? IDK if this is anticipating a change that hasn't arrived in the library yet? -->
This involves running a task against all cases in a dataset:
## Running Experiments
Performing evaluations involves running a task against all cases in a dataset, also known as running an "experiment":

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants