Evaluation tools for LLM applications

Hello , I’m Lena Shakurova, Conversational AI Advisor & CEO and Founder of parslabs.org and chatbotly.co

Linkedin: https://www.linkedin.com/in/lena-shakurova/

Amsterdam

If you want another tool to be added send me an email to lena@shakurova.io

Need help setting up evaluation workflow for your AI product? Send me a DM on LinkedIn or book a free intro call

Last updated on 10.05.2025

I will try to keep this resource updated as I get to know about new eval tools. New tools will be marked “WIP” until I get time to test them

How to evaluate LLM-based apps: Build with evidence, monitor what matters

To know if you're improving, you must measure.

During development, evaluation shows whether prompt changes or model tweaks help or harm. Guessing slows you down.

But LLM evals are hard. Change one word in a prompt, does it help? Add a new instruction, do past use cases still pass?

LLMs are non-deterministic: the same input might produce five different outputs. How do you decide what's “good” or “bad”? Do you rely on human judgment, unit tests, or automatic scoring? And how do you catch silent regressions when nothing breaks, but quality slips?

In production, monitoring becomes critical. You need alerts when something fails, like the bot refusing basic tasks or drifting off-topic. Test sets help prevent this. Cover edge cases, simulate unexpected inputs: incomplete data, foreign languages, or hostile users.

This page lists tools to evaluate, test, and monitor LLMs, through every stage of development and deployment.

This document includes:

No-code tools for LLM evaluation (including open source)

Python libraries

Moderation checks

Voice evaluation tools

No-code tools for LLM evaluation

Check “Gallery” view for screenshots and “Open Source” to see all open source tools.

No-code tools for LLM evaluation

Table

Demos

Open Source

Name

URL

Github URL

From Lena

Demo link

Price

Multi-select

Note

GitHub stars

1

Agenta

agenta.ai/

github.com/age…agenta

From all the IDEs I tested Agenta was most user friendly and I like it most for production use-cases. You can add your structured test sets, both for completion and conversation models, there is version control for your prompts, you can compare different prompts and different models, define LLM as a judge and use a bunch of pre-defined metrics to run your tests. So far my favourite

youtube.com/wat…Xp2WiI

Free | $49/month | $399/month

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Version control

Add your own test sets

Open Source

Key Features:
• LLM Engineering Platform with tools for prompt engineering, versioning, evaluation, and observability
• Prompt Registry for version control and collaboration
• Systematic evaluation capabilities

Limitations:
• LLM simulations are not explicitly mentioned
• Lack of clarity on latency/cost tracking and hallucination evaluation
• No mention of jailbreak detection

2

Prompknit

promptknit.com/

Simple UI, you can compare how your prompt performs on different LLM models, but for the rest the use-cases are limited. It’s more of a playground than evaluation software

youtube.com/wat…Pn63nI

Free | $7/month | $21/month

Compare LLM models

Compare prompts side by side

Version control

Key Features:
• AI playground for prompt designers
• Supports various LLMs
• Prompt management for storing, editing, and running prompts

Limitations:
• Lacks explicit mention of monitoring/alerts, latency/cost tracking, hallucination evaluation, or jailbreak detection
• The ability to add custom test sets and add LLM as a judge is not specified

3

PromptHub

prompthub.us/

This is more of a prompt management software, for keeping your prompt library organised, however they also have evaluation metrics. Good for personal use but not for production use-cases

youtube.com/wat…XvgKio

Free | $9/month | $15/user/month

Compare LLM models

Compare prompts side by side

Version control

Add your own test sets

Add LLM as judge

Key Features:
• Community-driven platform for managing, versioning, and deploying prompts
• Git-based version control
• LLM-based evaluation

Limitations:
• Lacks mention of monitoring/alerts, LLM simulations, latency/cost tracking, or jailbreak detection
• Hallucination evaluation is not explicitly mentioned

4

promptmetheus

promptmetheus.com/

My favourite feature of promptmetheus is that you can chunk your prompt and hide/display them, to see how different parts of your prompt affect your output. Sometimes when doing prompt engineering we end up adding extra instructions that don’t affect the output much and then we forget to clean them up, which makes our prompt too long and hard for LLM to follow. With promptmetheus you can fix this

youtu.be/i70…t=1345

Free trial | $29/month | $99/month

Compare LLM models

Compare prompts side by side

Version control

Add your own test sets

Latency/cost tracking

Key Features:
• Modular prompt composition using LEGO-like blocks
• Supports multiple LLMs and inference APIs
• Robust testing tools with datasets and completion ratings
• Team collaboration features with shared workspaces
• Prompt templates library
• Analytics dashboard for performance insights
• Cost estimation for inference under different configurations

Limitations:
• Lacks explicit mention of monitoring/alerts, LLM simulations, hallucination evaluation, or jailbreak detection.
• The ability to add LLM as a judge is not specified.

5

helicone

helicone.ai/

github.com/hel…licone

youtube.com/wat…bl3RdE

Free | $20/month | $200/month

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Version control

Add your own test sets

Latency/cost tracking

Evaluate hallucinations

Add LLM as judge

Jailbreak detection

Open Source

Key Features:
• Observability platform for monitoring, debugging, and improving production-ready LLM applications
• Supports testing prompt variations, real-time monitoring, and regression detection
• Offers LLM-as-a-judge evaluations
• Provides an API Cost Calculator
Limitations:
• LLM simulations are not explicitly mentioned

6

vellum ai

vellum.ai/

youtube.com/wat…BQ2iQg

Custom plans

Compare LLM models

Monitoring/alerts

Add your own test sets

Latency/cost tracking

Add LLM as judge

Key Features:
• GUI and SDK for AI development
• Supports various models
• Offers tools for testing, evaluation, and monitoring
• Provides AI specialists for support
Limitations:
• LLM simulations are not explicitly mentioned
• Hallucination evaluation and jailbreak detection are not mentioned

7

PromptPerfect

promptperfect.jina.ai/

youtube.com/wat…FfNlIM

Free | $19/month | $99/month

Compare LLM models

Compare prompts side by side

Add your own test sets

Add LLM as judge

Key Features:
• AI-powered platform for prompt engineering
• Supports multi-model testing and prompt A/B testing
• Enables data-driven prompt optimization
• Allows fine-grained control with custom scoring functions
Limitations:
• Lacks explicit mention of monitoring/alerts, LLM simulations, latency/cost tracking, or jailbreak detection
• Details on version control are limited

8

promptfoo

github.com/pro…mptfoo

github.com/pro…mptfoo

youtube.com/wat…5XwhKs

Free

Compare LLM models

Compare prompts side by side

Add your own test sets

Latency/cost tracking

Add LLM as judge

Jailbreak detection

Open Source

Key Features:
• Open-source tool for testing and evaluating prompts
• Supports multiple LLM providers
• Allows defining assertions to validate prompt outputs
• Tracks token usage and cost
• Has a plugin for red teaming and jailbreak detection
Limitations:
• Lacks native monitoring and version control features
• Hallucination evaluation is not explicitly mentioned
• Does not offer LLM simulations

9

Vercel playground

sdk.vercel.ai/playground

Simple UI to quickly compare different prompts and models side by side

youtube.com/wat…5d9HD0

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Latency/cost tracking

Jailbreak detection

Key Features:
• Offers a secure platform for experimenting with multiple LLMs.
• Provides A/B testing for prompt optimization.
• Includes advanced monitoring and protection against abuse, bots, and unauthorized use via integration with Kasada and Vercel's middleware.
Limitations:
• Hallucination evaluation, LLM simulations, and explicit support for custom test sets are not mentioned.
• Version control remains unclear based on available information.

10

LangWatch

langwatch.ai/

github.com/lan…gwatch

youtube.com/wat…-NUqGM

Free | $59/month | $199/month

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Add your own test sets

Evaluate hallucinations

Add LLM as judge

Latency/cost tracking

Human evaluation

LLM simulations

Version control

Open Source

Key Features:
• Open source Observability and Evaluation platform for debugging, evaluating, and optimizing LLM applications
• Library of 40+ evaluation metrics for LLM pipelines
• Run / build custom evaluations, collect user-feedback
• Human-in-the-loop feedback integration for annotation and dataset building 
• Alerting system to notify users about risks like hallucinations in real-time
• Automated evaluations in CI/CD pipelines and historical tracking of metrics 
• LLM simulations, latency/cost tracking, and prompt version control
• Support all major LLM’s and LLM frameworks 
• Enterprise-ready: ISO27001, GDPR compliant 
• On-premise, self-hosted and hybrid solutions (use Cloud but keep customer data on customer side)

11

Orq.ai

orq.ai/

youtube.com/wat…4wuhrs

Custom plans

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Version control

Latency/cost tracking

Add LLM as judge

Key Features:
• End-to-end platform for managing the lifecycle of LLM apps
• Supports prompt engineering, RAG, deployment, data management, observability, and evaluation
• Offers an AI Gateway for accessing multiple AI models
• Provides tools for experimentation, tracing, and monitoring LLM app performance
Limitations:
• Explicit mention of LLM simulations, hallucination evaluation, and jailbreak detection is lacking

12

Langtail

langtail.com/

youtube.com/wat…oWEZjg

Free | $99/month | $499/month

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Add your own test sets

Evaluate hallucinations

Key Features:
• Testing and debugging of AI applications
• Support for multiple LLM providers
• SDK and OpenAPI for integration
• Designed for use across product, engineering, and business teams
• Beautiful visualizations and powerful testing tools
Limitations:
• Version control, LLM simulations, latency/cost tracking, adding LLM as a judge, and jailbreak detection are not explicitly mentioned

13

OpenLIT

openlit.io/

github.com/ope…penlit

youtube.com/wat…_zKynY

Free

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Version control

Latency/cost tracking

Open Source

Key Features:
• Open-source platform for AI engineering
• Application and request tracing
• Manages prompts in a centralized prompt repository, with version control
• Vault offers secure way to store and manage secrets
• Granular Usage Insights for LLM, Vectordb & GPU performance and costs.
Limitations:
• Adding custom test sets, LLM simulations, hallucination evaluation, adding LLM as a judge, and jailbreak detection are not explicitly mentioned

14

Laminar

lmnr.ai/

github.com/lmn…i/lmnr

youtube.com/wat…WYCZZc

Free | $25/month | $50/month

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Version control

Add your own test sets

Latency/cost tracking

Evaluate hallucinations

Add LLM as judge

Open Source

Key Features:
• Open-source platform for observability, tracing, and evaluation of AI applications
• Supports real-time monitoring of latency, cost, token usage, and input/output
• Provides tools for dynamic few-shot examples to improve prompts
• Allows pipeline versioning and iterative development with Git-like commits
• Facilitates offline and online evaluations with human-in-the-loop feedback or automated systems
Limitations:
• LLM simulations and jailbreak detection are not explicitly mentioned

15

Datadog LLM Observatory

datadoghq.com/pro…ility/

Different packages

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Add your own test sets

Latency/cost tracking

Evaluate hallucinations

Jailbreak detection

Key Features:
• End-to-end visibility into LLM chains with detailed tracing of input-output behavior
• Real-time monitoring of operational metrics like latency, cost, token usage, and security risks
• Out-of-the-box quality evaluations for functional performance, topic relevance, toxicity, and security
• Detects hallucinations, drifts, and prompt injections to enhance accuracy and security
• Seamless integration with Datadog Application Performance Monitoring (APM) for comprehensive observability
Limitations:
• Explicit mention of version control and LLM simulations is lacking.
• The ability to add LLM as a judge is not specified

16

Confident AI

confident-ai.com/

github.com/con…epeval

Free - Custom pricing

A/B test

Add LLM as judge

LLM simulations

Compare LLM models

Open Source

Key Features:
• A single platform to collect, manage, and test LLM datasets
• Lets you run tests on LLMs with metrics you can customize
• Connects to your CI/CD pipeline for automatic testing
• Tracks real-world LLM outputs and updates your dataset
• Supports LLM-as-a-judge for scoring results the way you want
• Built on DeepEval — open-source and trusted

17

LangFuse

langfuse.com/

github.com/lan…ngfuse

A bit of a cluttered UI, but has the most of important production features: LLM as a judge, human evaluation, in production monitoring, playground for experimentation, prompt management e.t.c.

youtube.com/wat…&t=28s

Free | $59/month | $199/month

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Version control

Add LLM as judge

Human evaluation

Open Source

Key Features:
• Full-stack LLM engineering platform for debugging, evaluating, and improving LLM applications.
• Built-in prompt management with versioning, deployment, and low-latency retrieval.
• Playground for testing prompts and models across different providers within the UI.
• Evaluation toolkit: collect user feedback, run custom evaluation functions, and manually annotate responses.
• Metrics tracking: monitor cost, latency, and quality metrics of your LLM workflows.
• Supports all major LLM frameworks (OpenAI, LangChain, LlamaIndex, etc.) with SDKs for Python and JS/TS.
• Enterprise-ready: SOC 2 Type II, ISO 27001, and GDPR compliant.
• Flexible hosting: Langfuse Cloud (managed) or self-hosted.

Limitations:
• No built-in tooling for hallucination detection or jailbreak testing
• UI is a bit complex

18

Hangar5

hangar5.ai/

Unknown

LLM simulations

Evaluate hallucinations

Key Features:
• Simulation of real-user conversations with chatbots to test behavior in realistic scenarios before deployment
• Automated evaluation of chatbot outputs for relevance, business alignment, and consistency with problem statements
• Detection of hallucinations, inaccuracies, and inconsistent behavior to boost chatbot reliability and customer satisfaction

19

Maxim AI

getmaxim.ai/

Free | $29/month | $49/month | Custom

Version control

Monitoring/alerts

Add your own test sets

Synthetic test sets

Add LLM as judge

LLM simulations

Key Features:
- Simulate AI agents using AI-generated test scenarios.
- Evaluate agent performance with built-in and custom metrics, including LLM-as-a-judge.
- Track and version prompts, models, tools, and context without code changes.
- Monitor agent behavior with real-time traces, logs, and quality metrics.
- Trigger alerts based on performance or safety regressions.
- Generate analytics and reports for experiment tracking.
- Automate evaluations via CI/CD and manage human review workflows.
- Enterprise-ready: in-VPC deployment, SSO, SOC 2 compliance, and role-based access control.

20

Uptrain

uptrain.ai/

github.com/upt…ptrain

Unknown

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Add your own test sets

Add LLM as judge

Evaluate hallucinations

Jailbreak detection

A/B test

Synthetic test sets

Open Source

Key Features:
• Compare multiple LLM models and prompts side by side
• Automated monitoring with alerts for performance issues
• Add custom and synthetic test sets for evaluation
• Use various LLMs as automated judges for output quality
• Support for A/B testing different models and prompts

21

Giskard

giskard.ai/

github.com/Gis…iskard

Free | Custom

Monitoring/alerts

Add your own test sets

Add LLM as judge

Evaluate hallucinations

Jailbreak detection

A/B test

Human evaluation

Synthetic test sets

Open Source

Key Features
• Continuous and automated testing of AI models to detect quality, security, and compliance risks
• Exhaustive risk detection including hallucinations, prompt injections, and harmful content
• Custom test generation using business data and synthetic scenarios
• Collaboration tools for business users and technical teams, including annotation and red-teaming playground
• Use of LLMs as automated judges to evaluate AI outputs and detect vulnerabilities
• Metric-driven comparisons to avoid regressions and support A/B testing
• Enterprise-grade security with on-premise/cloud deployment, role-based access, and GDPR compliance

22

Literal AI

literalai.com/

Free | Custom

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Version control

Add your own test sets

Latency/cost tracking

Evaluate hallucinations

A/B test

Key Features
• Prompt Management: Version prompts, test variables, and deploy templates via API
• Evaluation Suite: Automated scoring and human-in-the-loop evaluation for outputs.
• Observability: Real-time monitoring of LLM performance, latency, and costs.
• Experiment Tracking: Run experiments against datasets to avoid regressions.
• Collaboration Tools: Streamline workflows between engineers, product teams, and SMEs

23

Galileo AI

usegalileo.ai/explore

$16/month | $32/month | Custom

Compare LLM models

Compare prompts side by side

Add your own test sets

Evaluate hallucinations

A/B test

Human evaluation

Key Features:
• Multi-LLM Testing & Comparison: Easily test and compare outputs from various large language models side by side.
• Custom Dataset Support: Upload and use your own datasets to evaluate model performance on specific tasks.
• Hallucination & Factuality Evaluation: Tools to assess and reduce hallucinations in model outputs.
• Human-in-the-Loop Evaluation: Incorporates human feedback and labeling to improve evaluation quality.
• A/B Testing Framework: Enables systematic A/B testing to optimize prompts and model choices.

24

Prompt Safe

prompt-safe.com/

Not specified

Monitoring/alerts

Jailbreak detection

Key Features of Prompt-Safe
• Prompt Injection & Jailbreak Detection: Actively detects and alerts on prompt injection and jailbreak attempts.
• Real-time Monitoring & Alerts: Provides monitoring and alerting for suspicious prompt activity.
• API Integration: Offers API-based integration to protect LLM endpoints.
• Security Focus: Specializes in securing LLM applications against adversarial prompts.
• Dashboard Analytics: Presents analytics and reporting on detected threats and prompt security events.

25

Bespoken

bespoken.ai/

$2000-Custom

Monitoring/alerts

Add your own test sets

• Automated Testing: Fully automated functional and exploratory testing for IVR, chatbots, and conversational AI.
• Monitoring & Alerts: 24/7 monitoring with real-time alerts for outages, defects, and performance issues.
• Custom Test Creation: Users can create and run their own test cases to validate conversational experiences.
• Load Testing: Supports load testing to ensure scalability and reliability of conversational systems.
• Defect Identification & Analytics: Identifies and helps triage defects, with analytics to optimize customer journeys.

26

Opik by Comet

comet.com/sit…/opik/

github.com/com…l/opik

Free, $39/month | Custom

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Version control

Add your own test sets

Add LLM as judge

Evaluate hallucinations

A/B test

Human evaluation

Open Source

• Open Source LLM Evaluation: Fully open-source, allowing local deployment and customization.
• Trace Logging & Analysis: Records detailed traces and spans of LLM app responses for deep debugging and understanding.
• Built-in & Custom Metrics: Comes with pre-configured evaluation metrics and allows defining custom metrics via an SDK.
• CI/CD Integration: Enables LLM unit tests and comprehensive test suites to be integrated into continuous deployment pipelines.
• LLM Judges for Quality: Includes automated judges for hallucination detection, factuality, and moderation to ensure output quality.

27

HumanLoop

humanloop.com/

Free Trial | Custom

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Version control

Add your own test sets

Add LLM as judge

Evaluate hallucinations

A/B test

Human evaluation

• Prompt & Model Management: Develop, version, and compare prompts and models in code or UI, with full version control.
• Automated & Human Evaluation: Supports both code/AI-based automatic evaluations and expert human review.
• Monitoring & Alerting: Real-time alerts and guardrails to detect and notify about issues before they affect users.
• CI/CD Integration: Incorporate evaluations into deployment pipelines to catch regressions early.
• Data Privacy & Security: Enterprise-grade security, including VPC, RBAC, SSO/SAML, SOC-2, GDPR, and HIPAA compliance.

28

Deep Checks

deepchecks.com/

github.com/dee…checks

Pay as you go | Basic $1000/month

Compare LLM models

Monitoring/alerts

Version control

Add your own test sets

Add LLM as judge

Evaluate hallucinations

A/B test

Human evaluation

Open Source

• Automated & Manual Evaluation: Combines automated “estimated annotation” with optional human review for robust LLM output validation.
• Golden Set Management: Enables the creation and management of custom test sets (“Golden Sets”) for comprehensive evaluation.
• Continuous Monitoring & Alerts: Provides real-time monitoring and alerting for model and data drift in production environments.
• Open Source & Open Core: Offers an open-source framework for ML and LLM testing, with a robust, widely tested product.
• Hallucination & Policy Compliance Detection: Systematically detects hallucinations, policy deviations, bias, and harmful content in LLM outputs.

29

Arize

arize.com/

github.com/Ari…hoenix

$50/month | Custom

Compare LLM models

Compare prompts side by side

Monitoring/alerts

Version control

Add your own test sets

LLM simulations

Add LLM as judge

Latency/cost tracking

Evaluate hallucinations

A/B test

Human evaluation

Open Source

• Unified Observability & Evaluation: End-to-end monitoring, tracing, and evaluation from development through production for LLMs and AI agents.
• Automated & Human Evaluation: Combines LLM-as-a-Judge, code-based tests, and human-in-the-loop annotation for robust output quality assessment.
• Real-Time Monitoring & Alerts: Provides instant visibility, anomaly detection, and customizable alerts to catch issues early.
• Experimentation & Versioning: Supports A/B testing, model/prompt/version comparison, and continuous improvement workflows.
• Open Source & Interoperability: Built on open standards (OpenTelemetry, OpenInference), with no data lock-in and strong integration capabilities.

30

Haize Labs

haizelabs.com/

Not specified

Monitoring/alerts

Add LLM as judge

Evaluate hallucinations

Jailbreak detection

Synthetic test sets

• Automated Red-Teaming: Generates adversarial prompts and conducts rigorous red-teaming to uncover vulnerabilities and jailbreaking risks.
• Customizable Judges: Allows configuration and deployment of automated judges tailored to specific use cases for evaluation and safety.
• Continuous Monitoring: Provides ongoing monitoring and alerting to ensure AI systems remain robust and safe in production.
• Cascade & Edge Case Testing: Dynamically tests AI systems for edge cases and cascading failures to improve reliability.
• Actionable Reporting & Analytics: Delivers clear, actionable insights and reports to help teams quickly address issues and optimize models.

31

PromptLayer

promptlayer.com/

Start for Free

Version control

Latency/cost tracking

Compare LLM models

WIP

32

Evidently AI

evidentlyai.com/

github.com/evi…dently

Start for free

Open Source

33

Traceloop

traceloop.com/

github.com/tra…lmetry

Start for free

Open Source

Python libraries for LLM evaluation

Python libraries

Name

URL

From Lena

Open Source or Commercial

Price

Moderation checks

Name

URL

Price

Multi-select

Note

1

OpenAI Moderation

platform.openai.com/doc…kstart

Free

Compare LLM models

Jailbreak detection

Key Features:
• Classifies text and images for harmful content
• Supports multiple categories like hate speech, violence, and self-harm
• Provides a confidence score for each category
• Offers multi-modal input support (text and images) with the omni-moderation-latest model

7

Guardrails AI

github.com/gua…drails

Free

Key Features:
• Framework for validating and correcting LLM inputs/outputs using "guardrails"

2

Azure OpenAI Content Filtering

learn.microsoft.com/en-…on-new

Free

Compare LLM models

Monitoring/alerts

Evaluate hallucinations

Jailbreak detection

Key Features:
• Multi-class classification models to detect and filter harmful content
• Covers hate, sexual content, violence, and self-harm categories.
• Configurable severity levels for filtering
• Optional models for jailbreak risk and known content detection
• Groundedness detection for non-streaming scenarios
• Detects user prompt attacks and indirect attacks

3

Groq's content moderation

console.groq.com/doc…ration

Free

Compare LLM models

Jailbreak detection

Key Features:
• Uses Llama Guard 3 to classify content safety in LLM inputs and outputs
• Identifies 14 harmful categories based on the MLCommons taxonomy
• Supports multiple languages: English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai
• Provides a simple "safe" or "unsafe" classification with specific category violations listed when content is unsafe
• Easy to integrate via the Groq API

4

Meta's Llama Guard 3

huggingface.co/met…d-3-8B

Free

Compare LLM models

Jailbreak detection

Add LLM as judge

Key Features:
• Content safety classification using Llama Guard 3, an 8B parameter LLM
• Classifies 14 categories of potential hazards based on the MLCommons taxonomy
• Supports multiple languages: English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai
• Provides a "safe" or "unsafe" classification with specific category violations listed when content is unsafe
• Designed to safeguard against the MLCommons standardized hazards taxonomy and supports Llama 3.1 capabilities
• Optimized for safety and security for search and code interpreter tool calls

5

DeepEval

docs.confident-ai.com/

Free + commercial use

Compare LLM models

Compare prompts side by side

Version control

Add your own test sets

Evaluate hallucinations

Add LLM as judge

Jailbreak detection

Key Features:
• Open-source Python framework for unit testing LLM applications
• Pre-built metrics for safety, security, and quality (e.g., toxicity, drift, groundedness)
• CI/CD integration for automated evaluations in development pipelines
• Customizable test cases and datasets for iterative prompt/model refinement

6

RefChecker

github.com/ama…hecker

Free

Compare LLM models

Add your own test sets

Evaluate hallucinations

Add LLM as judge

Key Features:
• Fine-grained hallucination detection: Breaks responses into knowledge triplets ([head, relation, tail]
) for atomic fact verification
• Multi-context support: Handles zero-context (Open QA), noisy-context (RAG), and accurate-context (summarization) scenarios
• Modular architecture:
◦ Customizable extractors (LLM-based or rule-based)
◦ Multiple checker options (LLM judges, AlignScore, NLI models)
◦ Flexible aggregators (strict/majority voting/soft thresholds)
• Benchmark dataset: 2.1k human-annotated LLM responses across 300 test samples
• Multi-model support: Integrates with OpenAI, Anthropic, AWS Bedrock, and self-hosted models via vLLM

8

Deepteam

trydeepteam.com/

Not specified

Jailbreak detection

Bias

PII leakage

Toxicity

• Automated vulnerability scanning for bias, PII leakage, toxicity, etc.
• Prompt injection and jailbreak detection (including gray box attacks).
• Compliance checks based on OWASP Top 10 for LLMs and NIST AI standards.
• Open-source red teaming framework for LLM security.

Voice evaluation tools

Name

URL

From Lena

Price

Note

Multi-select

1

Coval

coval.dev/

Had a demo with them and tested it, the UI is still not always self-explanatory but I liked their voice simulation feature

Not specified

Coval simulates and evaluates AI agents via voice and chat, using AI-powered tests to ensure reliability and performance. It helps developers optimize AI agents efficiently.

Production Call Analytics

Production Alerts

Performance Analytics

Test Sets

Scenario Simulation

LLM as a judge

2

Hamming AI

hamming.ai/

Had a demo with them and tested it, they have most important voice metrics

Not specified

Hamming automates AI voice agent testing, simulating thousands of concurrent phone calls to find bugs. 

Automated AI Voice Agent

Prompt Management

Prompt Optimizier

Production Call Analytics

Scenario Simulation

Voice Experimentation Tracking

3

Cekura (former Vocera)

cekura.ai/

$250-$1000 or custom

Vocera automates AI voice agent testing by simulating realistic conversations with workflows and personas. It offers monitoring, alerting, and performance insights.

Production Alerts

Production Call Analytics

Scenario Simulation

4

fixa

github.com/fix…v/fixa

Free

Fixa is a Python package for AI voice agent testing. It uses voice agents to call your voice agent, then uses an LLM to evaluate the conversation. It integrates with Pipecat, Cartesia, Deepgram, OpenAI, and Twilio.

Scenario Simulation

LLM as a judge

5

Test Ai

nbulatest.ai/

Different plans and packages

nbulatest.ai simplifies AI testing with simulated scenarios, custom datasets, and real-time tracking. It offers performance insights, notifications, and a user-friendly interface for optimization.

AI-Crafted Datasets

Perfomance Monitoring

Actionable Insights

Scenario Simulation

6

BlueJay

getbluejay.ai/

Not specified

Bluejay is a cloud-based incident management platform designed primarily to streamline and optimize the alert management process for engineering teams. Its key purpose is to reduce downtime and Mean Time to Resolution (MTTR) by ensuring comprehensive and effective alerting before incidents occur, rather than just reacting to incidents after they happen

Production Alerts

Performance Analytics

Perfomance Monitoring

Actionable Insights

You team needs help setting up evaluation workflow?

Send me a DM on LinkedIn or book a free intro call