AI testing, benchmarks and evals

# AI testing, benchmarks and evals ![Cover](https://wsrv.nl/?url=https%3A%2F%2Fstatic.libsyn.com%2Fp%2Fassets%2Fa%2F1%2F7%2F1%2Fa171dd11ef408014e55e3c100dce7605%2FTWTP_icon_2021.jpg&w=500&h=500) ## Episode metadata - Episode title: AI testing, benchmarks and evals - Show: Thoughtworks Technology Podcast - Owner / Host: Thoughtworks - Guests: [Shayan Mohanty](https://share.snipd.com/person/d44405fe-5e5d-4b03-98d4-d4863b139dca), [John Singleton](https://share.snipd.com/person/1cd70aeb-1558-4c3b-84db-722dfd926ab0) - Episode publish date: 2025-01-23 - Episode AI description: Join Shayan Mohanty, Head of AI Research at Thoughtworks, and John Singleton, Program Manager at the AI Lab, as they dive into the complexities of generative AI. They discuss the vital role of evals, benchmarks, and guardrails in ensuring AI reliability. The duo outlines the differences between testing and evaluations, highlighting their significance for businesses. Additionally, they explore mechanistic interpretability and the need for robust frameworks to enhance trust in AI applications. This conversation is essential for anyone navigating the evolving AI landscape. - Duration: 36:03 - Episode URL: [Open in Snipd](https://share.snipd.com/episode/d2143b1f-8c1c-4825-8e66-cb0fc1d733c3) - Show URL: [Open in Snipd](https://share.snipd.com/show/e24071d9-1cd4-4bd2-8168-5df9c3cf49f9) - Export date: 2026-05-08T10:38:03 ## Snips ### [LLM Evaluation and KPIs](https://share.snipd.com/snip/895c07fa-cdcf-43b0-8fc9-4348d4aff692) 🎧 13:53 - 15:28 (01:35) <iframe src="https://share.snipd.com/embed/obsidian-player/snip/cbf3ced5-37d3-4809-96ca-1132e129e5c1" width="100%" height="100" style="border: none; border-radius: 12px;" sandbox="allow-scripts allow-same-origin allow-forms allow-popups allow-clipboard-write" ></iframe> - Ensure LLM outputs strictly adhere to pre-defined KPIs. - Establish a robust, transparent framework for measuring performance in production. #### 💬 Quote > We have to make sure that all of these things are brutally adhering to those, those KPIs that we have defined and cared about and actually make quote unquote juice worth the squeeze. And we have to have a robust framework [...] to understand how we're measuring performance within that context. > — John Singleton John Singleton on evaluating LLM performance #### 📚 Transcript **John Singleton:** Uh, we have to make sure that all of these things are brutally adhering to those, those KPIs that we have defined and cared about and actually make quote unquote juice worth the squeeze. And we have to have a robust framework that we all, that is transparent and that we all agree upon, uh, to understand how we're measuring performance within that context. So lots of words for focus, really, on what is the actual key problem in using the right tools at the right time and understanding that if you choose this tool, which is a great fit for a set of problems today in the enterprise, that comes with a new set of muscles and new set of actions that you kind of need to learn to adopt this into production. **Lilly Ryan:** I'm always interested in the ways that people get this wrong, because with my role in security, I'm always looking at those edge cases and the failure cases and those kinds of things. We're talking about moving into production, and I've seen quite a lot of apps that have reached the proof of concept stage that don't go beyond that point. And there's, I think, a lot we can learn from the failures of those proofs of concept that can really inform what does make it to production. What should people be looking at when it comes to what goes wrong and where do you actually see it going wrong? I'm **Shayan Mohanty:** going to jump in with a couple things. So when you think about evals at the moment, when you think about kind of like what the industry is pointing to and what is currently being done, there are some intrinsic metrics, things like perplexity that are being used. --- Created with [Snipd](https://www.snipd.com) | Highlight & Take Notes from Podcasts