Visualising LLM evaluation data

At Fora Health we are exploring the use of LLM-powered systems to answer patient’s health questions. As we build, part of our challenge is to evaluate whether our system answers questions reliably and accurately.

We run evaluations on our system by passing in a set of 1500 questions and then analysing the responses.

I built a little tool using the Observable Plot library that visualises the evaluation dataset. I originally wanted to use Plot’s new waffle mark, but ended up needing more flexibility.

Each dot is a question and response. They can be sorted, stacked, grouped, and split to help understand how the system is performing.