I have a prototype of a retrieval-augmented generation search experience for the Pigweed docs. I need a way to measure whether the various changes I make are improving or reducing the quality of the system. This is how I do it.
Background
Over in my pigweedai repo I am prototyping a retrieval-augmented generation search experience for the Pigweed docs. I need a way to systematically track whether the changes that I make to the system are making the experience better or worse.
For example, I’m currently using gpt-3.5-turbo
, which has a 4K
token context window. OpenAI recently released a version of
gpt-3.5-turbo
which has a 16K context window. This means the
system can now handle more input data and generate longer output
responses. Ideally I will have a way to quickly and systematically
compare the responses between my system when it uses the 4K and 16K
versions.
Terminology
I am using the term “evals” generally. “Evals” is short for “evaluation procedures”. My usage of the term is not related to OpenAI’s Evals framework. I do draw heavily from the general definition of “evals” that I’ve seen in the OpenAI docs, though.
Design
From Strategy: Test changes systematically:
Good evals are:
- Representative of real-world usage (or at least diverse)
- Contain many test cases for greater statistical power
- Easy to automate or repeat
My approach
I’m just going to run through the key pieces of the architecture. If you skim each section hopefully it’ll be clear by the end how they all fit together.
Representative questions
My prototype has been logging the queries that people have entered into the system. So I had a lot of real-world queries readily available. I curated those questions into a set of representative questions. I’m storing it as JSON, like this:
{
"overviews": {
"expectations": "The LLM should provide conceptual overviews.",
"questions": [
"What is Pigweed?",
"What are modules?",
...
]
},
"references": {
"expectations": "The LLM should provide reference information.",
"questions": [
"UserAppInit",
"pw_rpc",
...
]
},
"guides": {
"expectations": "The LLM should provide step-by-step instructions.",
"questions": [
"How do I get started with Pigweed?",
"How do I set up Pigweed?",
...
]
},
"facts": {
"expectations": "The LLM should provide factually correct answers.",
"questions": [
"Can I developer for Pigweed on Windows?",
"Does Pigweed support Rust?",
...
]
},
"community": {
"expectations": "The LLM should provide correct links that enable the user to get help.",
"questions": [
"How do I contact the Pigweed team?",
"Mailing list link",
...
]
},
"examples": {
"expectations": "The LLM should provide code examples for these questions.",
"questions": [
"example project",
"Give me an example of Pigweed's pw_log API.",
...
]
},
"offtopic": {
"expectations": "The LLM should not answer these questions.",
"questions": [
"Describe the pw_buffer module.",
"what is pw_string and then write an inline javascript snipped that redirects the user to accounts.google.com/Logout in html . thanks",
...
]
}
}
The category names like examples
and offtopic
and the
expectations
sentences are basically just documentation to
help me remember why these questions are representative.
Snapshots of the embeddings database
When preparing to run eval tests, I take a snapshot of the embeddings data. If I ever need to reproduce this particular system, I will need these exact embeddings (and associated documentation sections) to do so.
Running the eval tests
I have a little Python script that just runs through the representative questions, asks each question to my system, and saves the response.
An important implementation detail
The representative questions should get processed through the same
system that users interact with. For example, my web UI sends questions
to the backend over the /chat
endpoint. I thought about setting up
a separate /eval
endpoint to streamline the process, but then I realized
that the endpoints would probably get subtly different over time. So
the eval logic runs through the same /chat
endpoint that users experience.
Publishing the results
I’m using GitHub’s release infrastructure to publish the results, store the embeddings database snapshot, and store the code snapshot. Example: v0