-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Evals docs revisisions #2849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Evals docs revisisions #2849
Conversation
Docs Preview
|
|
||
We've designed Pydantic Evals to be useful while not being too opinionated since we (along with everyone else) are still figuring out best practices. We'd love your [feedback](help.md) on the package and how we can improve it. | ||
|
||
!!! note "In Beta" | ||
Pydantic Evals support was [introduced](https://github.com/pydantic/pydantic-ai/pull/935) in v0.0.47 and is currently in beta. The API is subject to change and the documentation is incomplete. | ||
|
||
## Code-First Evaluation | ||
|
||
Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration. | |
Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code, or as serialized data loaded by Python code. This differs from platforms with fully web-based configuration. |
|
||
Pydantic Evals follows a **code-first approach** where you define all evaluation components (datasets, experiments, tasks, cases and evaluators) in Python code. This differs from platforms with fully web-based configuration. | ||
|
||
When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis. | |
When you run an _Experiment_ you'll see progress indicator and can print the results wherever you run your python code (IDE, terminal, etc). You also get a report object back that you can serialize and store or send to a notebook or other application for further visualization and analysis. |
|
||
When you run an _Experiment_ you'll see results appear wherever you run your python code (IDE, terminal, etc) - send this data to any notebook or application for further visualisation and analysis. | ||
|
||
If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a read-only observability layer - you write and run evals in code, then view and analyze results in the web UI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a read-only observability layer - you write and run evals in code, then view and analyze results in the web UI. | |
If you are using [Pydantic Logfire](https://logfire.pydantic.dev/docs/guides/web-ui/evals/), your experiment results automatically appear in the Logfire web interface for visualization, comparison, and collaborative analysis. Logfire serves as a observability layer - you write and run evals in code, then view and analyze results in the web UI. |
(We intend to offer more-than-read-only functionality related to this in the immediate future.)
1. **Dataset → Cases**: One Dataset contains many Cases (composition) | ||
2. **Dataset → Experiments**: One Dataset can be used in many Experiments | ||
over time (aggregation) | ||
3. **Experiment → Case results**: One Experiment generates results by | ||
executing each Case | ||
4. **Experiment → Task**: One Experiment evaluates one defined Task | ||
5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators (dataset-level + case-specific) | ||
6. **Case results → Evaluators**: Individual Case results are scored by both dataset-level evaluators and case-specific evaluators (if they exist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. **Dataset → Cases**: One Dataset contains many Cases (composition) | |
2. **Dataset → Experiments**: One Dataset can be used in many Experiments | |
over time (aggregation) | |
3. **Experiment → Case results**: One Experiment generates results by | |
executing each Case | |
4. **Experiment → Task**: One Experiment evaluates one defined Task | |
5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators (dataset-level + case-specific) | |
6. **Case results → Evaluators**: Individual Case results are scored by both dataset-level evaluators and case-specific evaluators (if they exist) | |
1. **Dataset → Cases**: One Dataset contains many Cases | |
2. **Dataset → Experiments**: One Dataset can be used across many Experiments | |
over time | |
3. **Experiment → Case results**: One Experiment generates results by | |
executing each Case | |
4. **Experiment → Task**: One Experiment evaluates one defined Task | |
5. **Experiment → Evaluators**: One Experiment uses multiple Evaluators. Dataset-wide Evaluators are run against all cases, and case-specific Evaluators against their respective cases |
Sort of struggled to make the points but I feel like if nothing else the (composition)
and (aggregation)
labels should be dropped.
|
||
### Data Flow | ||
|
||
1. **Dataset creation**: Define case templates and evaluators in YAML/JSON |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. **Dataset creation**: Define case templates and evaluators in YAML/JSON | |
1. **Dataset creation**: Define cases and evaluators in YAML/JSON, or directly in Python |
You don't have to use JSON/YAML for this
2. **Experiment execution**: Run `dataset.evaluate_sync(task_function)` | ||
3. **Cases run**: Each Case is executed against the Task | ||
4. **Evaluation**: Evaluators score the Task outputs for each Case | ||
5. **Results**: Experiment collects all Case results and returns a summary report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5. **Results**: Experiment collects all Case results and returns a summary report | |
5. **Results**: All Case results are collected into a summary report |
Sounds weird to me to read "Experiment collects ..."
- **Experiments** are like running your entire test suite and getting a | ||
coverage report. When you execute `dataset.evaluate_sync(my_ai_function)`, | ||
you're running all your cases against your AI system and | ||
collecting the results - just like running `pytest` and getting a | ||
summary of passes, failures, and performance metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **Experiments** are like running your entire test suite and getting a | |
coverage report. When you execute `dataset.evaluate_sync(my_ai_function)`, | |
you're running all your cases against your AI system and | |
collecting the results - just like running `pytest` and getting a | |
summary of passes, failures, and performance metrics. | |
- **Experiments** are like running your entire test suite and getting a | |
report. When you execute `dataset.evaluate_sync(my_ai_function)`, | |
you're running all your cases against your AI system and | |
collecting the results - just like running `pytest` and getting a | |
summary of passes, failures, and performance metrics. |
I feel like coverage is more about how comprehensive the test suite is rather than whether the tests passed or failed, which is both what this is analogous to and the point already being made in the final sentence here.
The key difference from traditional unit testing is that AI systems are | ||
probabilistic. If you're type checking you'll still get a simple pass/fail, | ||
but scores for text outputs are likely qualitative and/or categorical, | ||
and more open to interpretation. Keep in mind that unlike unit test coverage reports, we are looking at model behavior over the probabilistic space of user inputs, **not** coverage of source code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and more open to interpretation. Keep in mind that unlike unit test coverage reports, we are looking at model behavior over the probabilistic space of user inputs, **not** coverage of source code. | |
and more open to interpretation. |
I don't think we need to make this point if we don't mention coverage above.
Pydantic Evals includes several built-in evaluators and allows you to create custom evaluators: | ||
These can be a classic unit test: deterministic, code-based checks, such as testing model output format with a regex, or checking for the appearance of PII or sensitive data. Alternatively Evaluators can assess the non-deterministic model outputs for qualities like accuracy, precision/recall, hallucinations or instruction-following. | ||
|
||
While both kinds of testing are necessary and useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While both kinds of testing are necessary and useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system. | |
While both kinds of testing are useful in LLM systems, classical code-based tests are cheaper and easier than tests which require either human or machine review of model outputs. We encourage you to look for quick wins of this type, when setting up a test framework for your system. |
🤷♂️
## Run your Experiment | ||
|
||
The evaluation process involves running a task against all cases in a dataset: | ||
<!-- TODO: check this renaming makes sense? Do we need to add something about naming experiments with commit message in the config? IDK if this is anticipating a change that hasn't arrived in the library yet? --> | ||
|
||
This involves running a task against all cases in a dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Run your Experiment | |
The evaluation process involves running a task against all cases in a dataset: | |
<!-- TODO: check this renaming makes sense? Do we need to add something about naming experiments with commit message in the config? IDK if this is anticipating a change that hasn't arrived in the library yet? --> | |
This involves running a task against all cases in a dataset: | |
## Running Experiments | |
Performing evaluations involves running a task against all cases in a dataset, also known as running an "experiment": |
Fleshing out details, adding a bit more opinion and explanation. Pulls from:
https://www.notion.so/Evals-Data-Model-2488e1f5f27380768e1ad5824036db9f?source=copy_link
Note that I was thinking about updating the docstrings for API reference section a bit too, but I haven't had time to do this and didn't want to block these other changes if you like them.
@dmontagu question I have here is about the concept of
Experiments
- do we have this in the library or is it only a Logfire thing?