Member-only story
How to Evaluate LLM Applications with DeepEval— Part 1
Measuring Success in RAG-powered Chatbots
Overview
In this article, I plan to complete and critique the work illustrated in the ‘tutorial’ series for DeepEval provided by Confident AI. I’ll explore a provided medical chatbot (powered by our OpenAI key) to demonstrate a replicable process for evaluating LLM driven applications.
This approach provides a framework that is adaptable to a wide range of other use cases. The key steps involve:
- Defining Evaluation Criteria: Choose specific metrics or criteria relevant to your use case.
- Using Evaluation Tools: Utilize DeepEval to assess your system’s performance based on the chosen criteria.
- Iterating on Results: Refine prompts and model configurations iteratively to improve outcome
NOTE: the complete Google Colab Notebook can be found here.
This article is published under my Repl:it series, where I identify articles, new or old, that I want to ‘replicate’; a kind of Read, Evaluate, Print (interpretation) & Loop. In these attempts I often make some changes to suite my own architectural & development habits (e.g. use terraform to build cloud resources or Google Colab as the development and runtime platform). I provide a step-by-step (or play-by-play) report of what I did to replicate the article in focus, including missteps and errors encountered along the way.