I have a prototype of a retrieval-augmented generation search experience for the Pigweed docs. I need a way to measure whether the various changes I make are improving or reducing the quality of the system. This is how I do it.
Over in my pigweedai repo I am prototyping a retrieval-augmented generation search experience for the Pigweed docs. I need a way to systematically track whether the changes that I make to the system are making the experience better or worse.
For example, I’m currently using
gpt-3.5-turbo, which has a 4K
token context window. OpenAI recently released a version of
gpt-3.5-turbo which has a 16K context window. This means the
system can now handle more input data and generate longer output
responses. Ideally I will have a way to quickly and systematically
compare the responses between my system when it uses the 4K and 16K
I am using the term “evals” generally. “Evals” is short for “evaluation procedures”. My usage of the term is not related to OpenAI’s Evals framework. I do draw heavily from the general definition of “evals” that I’ve seen in the OpenAI docs, though.
From Strategy: Test changes systematically:
Good evals are:
- Representative of real-world usage (or at least diverse)
- Contain many test cases for greater statistical power
- Easy to automate or repeat
I’m just going to run through the key pieces of the architecture. If you skim each section hopefully it’ll be clear by the end how they all fit together.
My prototype has been logging the queries that people have entered into the system. So I had a lot of real-world queries readily available. I curated those questions into a set of representative questions. I’m storing it as JSON, like this:
The category names like
offtopic and the
expectations sentences are basically just documentation to
help me remember why these questions are representative.
Snapshots of the embeddings database
When preparing to run eval tests, I take a snapshot of the embeddings data. If I ever need to reproduce this particular system, I will need these exact embeddings (and associated documentation sections) to do so.
Running the eval tests
I have a little Python script that just runs through the representative questions, asks each question to my system, and saves the response.
An important implementation detail
The representative questions should get processed through the same
system that users interact with. For example, my web UI sends questions
to the backend over the
/chat endpoint. I thought about setting up
/eval endpoint to streamline the process, but then I realized
that the endpoints would probably get subtly different over time. So
the eval logic runs through the same
/chat endpoint that users experience.
Publishing the results
I’m using GitHub’s release infrastructure to publish the results, store the embeddings database snapshot, and store the code snapshot. Example: v0