Hello
, I’m Lena Shakurova, Conversational AI Advisor & CEO and Founder of parslabs.org and chatbotly.co
Need help setting up evaluation workflow for your AI product? Send me a DM on LinkedIn or book a free intro call
Last updated on 10.05.2025
I will try to keep this resource updated as I get to know about new eval tools. New tools will be marked “WIP” until I get time to test them
How to evaluate LLM-based apps: Build with evidence, monitor what matters
To know if you're improving, you must measure.
During development, evaluation shows whether prompt changes or model tweaks help or harm. Guessing slows you down.
But LLM evals are hard. Change one word in a prompt, does it help? Add a new instruction, do past use cases still pass?
LLMs are non-deterministic: the same input might produce five different outputs. How do you decide what's “good” or “bad”? Do you rely on human judgment, unit tests, or automatic scoring? And how do you catch silent regressions when nothing breaks, but quality slips?
In production, monitoring becomes critical. You need alerts when something fails, like the bot refusing basic tasks or drifting off-topic. Test sets help prevent this. Cover edge cases, simulate unexpected inputs: incomplete data, foreign languages, or hostile users.
This page lists tools to evaluate, test, and monitor LLMs, through every stage of development and deployment.
This document includes:
No-code tools for LLM evaluation (including open source)
Python libraries
Moderation checks
Voice evaluation tools
No-code tools for LLM evaluation
Check “Gallery” view for screenshots and “Open Source” to see all open source tools.
ID
Name
URL
Github URL
From Lena
Demo link
Price
Multi-select
Note
GitHub stars
1
Agenta
From all the IDEs I tested Agenta was most user friendly and I like it most for production use-cases. You can add your structured test sets, both for completion and conversation models, there is version control for your prompts, you can compare different prompts and different models, define LLM as a judge and use a bunch of pre-defined metrics to run your tests. So far my favourite
Free | $49/month | $399/month
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Version control
Add your own test sets
Open Source
Key Features:
• LLM Engineering Platform with tools for prompt engineering, versioning, evaluation, and observability
• Prompt Registry for version control and collaboration
• Systematic evaluation capabilities
Limitations:
• LLM simulations are not explicitly mentioned
• Lack of clarity on latency/cost tracking and hallucination evaluation
• No mention of jailbreak detection
2
Prompknit
Simple UI, you can compare how your prompt performs on different LLM models, but for the rest the use-cases are limited. It’s more of a playground than evaluation software
Free | $7/month | $21/month
Compare LLM models
Compare prompts side by side
Version control
Key Features:
• AI playground for prompt designers
• Supports various LLMs
• Prompt management for storing, editing, and running prompts
Limitations:
• Lacks explicit mention of monitoring/alerts, latency/cost tracking, hallucination evaluation, or jailbreak detection
• The ability to add custom test sets and add LLM as a judge is not specified
3
PromptHub
This is more of a prompt management software, for keeping your prompt library organised, however they also have evaluation metrics. Good for personal use but not for production use-cases
Free | $9/month | $15/user/month
Compare LLM models
Compare prompts side by side
Version control
Add your own test sets
Add LLM as judge
Key Features:
• Community-driven platform for managing, versioning, and deploying prompts
• Git-based version control
• LLM-based evaluation
Limitations:
• Lacks mention of monitoring/alerts, LLM simulations, latency/cost tracking, or jailbreak detection
• Hallucination evaluation is not explicitly mentioned
4
promptmetheus
My favourite feature of promptmetheus is that you can chunk your prompt and hide/display them, to see how different parts of your prompt affect your output. Sometimes when doing prompt engineering we end up adding extra instructions that don’t affect the output much and then we forget to clean them up, which makes our prompt too long and hard for LLM to follow. With promptmetheus you can fix this
Free trial | $29/month | $99/month
Compare LLM models
Compare prompts side by side
Version control
Add your own test sets
Latency/cost tracking
Key Features:
• Modular prompt composition using LEGO-like blocks
• Supports multiple LLMs and inference APIs
• Robust testing tools with datasets and completion ratings
• Team collaboration features with shared workspaces
• Prompt templates library
• Analytics dashboard for performance insights
• Cost estimation for inference under different configurations
Limitations:
• Lacks explicit mention of monitoring/alerts, LLM simulations, hallucination evaluation, or jailbreak detection.
• The ability to add LLM as a judge is not specified.
5
helicone
Free | $20/month | $200/month
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Version control
Add your own test sets
Latency/cost tracking
Evaluate hallucinations
Add LLM as judge
Jailbreak detection
Open Source
Key Features:
• Observability platform for monitoring, debugging, and improving production-ready LLM applications
• Supports testing prompt variations, real-time monitoring, and regression detection
• Offers LLM-as-a-judge evaluations
• Provides an API Cost Calculator
Limitations:
• LLM simulations are not explicitly mentioned
6
vellum ai
Custom plans
Compare LLM models
Monitoring/alerts
Add your own test sets
Latency/cost tracking
Add LLM as judge
Key Features:
• GUI and SDK for AI development
• Supports various models
• Offers tools for testing, evaluation, and monitoring
• Provides AI specialists for support
Limitations:
• LLM simulations are not explicitly mentioned
• Hallucination evaluation and jailbreak detection are not mentioned
7
PromptPerfect
Free | $19/month | $99/month
Compare LLM models
Compare prompts side by side
Add your own test sets
Add LLM as judge
Key Features:
• AI-powered platform for prompt engineering
• Supports multi-model testing and prompt A/B testing
• Enables data-driven prompt optimization
• Allows fine-grained control with custom scoring functions
Limitations:
• Lacks explicit mention of monitoring/alerts, LLM simulations, latency/cost tracking, or jailbreak detection
• Details on version control are limited
8
promptfoo
Free
Compare LLM models
Compare prompts side by side
Add your own test sets
Latency/cost tracking
Add LLM as judge
Jailbreak detection
Open Source
Key Features:
• Open-source tool for testing and evaluating prompts
• Supports multiple LLM providers
• Allows defining assertions to validate prompt outputs
• Tracks token usage and cost
• Has a plugin for red teaming and jailbreak detection
Limitations:
• Lacks native monitoring and version control features
• Hallucination evaluation is not explicitly mentioned
• Does not offer LLM simulations
9
Vercel playground
Simple UI to quickly compare different prompts and models side by side
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Latency/cost tracking
Jailbreak detection
Key Features:
• Offers a secure platform for experimenting with multiple LLMs.
• Provides A/B testing for prompt optimization.
• Includes advanced monitoring and protection against abuse, bots, and unauthorized use via integration with Kasada and Vercel's middleware.
Limitations:
• Hallucination evaluation, LLM simulations, and explicit support for custom test sets are not mentioned.
• Version control remains unclear based on available information.
10
LangWatch
Free | $59/month | $199/month
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Add your own test sets
Evaluate hallucinations
Add LLM as judge
Latency/cost tracking
Human evaluation
LLM simulations
Version control
Open Source
Key Features:
• Open source Observability and Evaluation platform for debugging, evaluating, and optimizing LLM applications
• Library of 40+ evaluation metrics for LLM pipelines
• Run / build custom evaluations, collect user-feedback
• Human-in-the-loop feedback integration for annotation and dataset building
• Alerting system to notify users about risks like hallucinations in real-time
• Automated evaluations in CI/CD pipelines and historical tracking of metrics
• LLM simulations, latency/cost tracking, and prompt version control
• Support all major LLM’s and LLM frameworks
• Enterprise-ready: ISO27001, GDPR compliant
• On-premise, self-hosted and hybrid solutions (use Cloud but keep customer data on customer side)
11
Custom plans
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Version control
Latency/cost tracking
Add LLM as judge
Key Features:
• End-to-end platform for managing the lifecycle of LLM apps
• Supports prompt engineering, RAG, deployment, data management, observability, and evaluation
• Offers an AI Gateway for accessing multiple AI models
• Provides tools for experimentation, tracing, and monitoring LLM app performance
Limitations:
• Explicit mention of LLM simulations, hallucination evaluation, and jailbreak detection is lacking
12
Langtail
Free | $99/month | $499/month
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Add your own test sets
Evaluate hallucinations
Key Features:
• Testing and debugging of AI applications
• Support for multiple LLM providers
• SDK and OpenAPI for integration
• Designed for use across product, engineering, and business teams
• Beautiful visualizations and powerful testing tools
Limitations:
• Version control, LLM simulations, latency/cost tracking, adding LLM as a judge, and jailbreak detection are not explicitly mentioned
13
OpenLIT
Free
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Version control
Latency/cost tracking
Open Source
Key Features:
• Open-source platform for AI engineering
• Application and request tracing
• Manages prompts in a centralized prompt repository, with version control
• Vault offers secure way to store and manage secrets
• Granular Usage Insights for LLM, Vectordb & GPU performance and costs.
Limitations:
• Adding custom test sets, LLM simulations, hallucination evaluation, adding LLM as a judge, and jailbreak detection are not explicitly mentioned
14
Laminar
Free | $25/month | $50/month
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Version control
Add your own test sets
Latency/cost tracking
Evaluate hallucinations
Add LLM as judge
Open Source
Key Features:
• Open-source platform for observability, tracing, and evaluation of AI applications
• Supports real-time monitoring of latency, cost, token usage, and input/output
• Provides tools for dynamic few-shot examples to improve prompts
• Allows pipeline versioning and iterative development with Git-like commits
• Facilitates offline and online evaluations with human-in-the-loop feedback or automated systems
Limitations:
• LLM simulations and jailbreak detection are not explicitly mentioned
15
Datadog LLM Observatory
Different packages
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Add your own test sets
Latency/cost tracking
Evaluate hallucinations
Jailbreak detection
Key Features:
• End-to-end visibility into LLM chains with detailed tracing of input-output behavior
• Real-time monitoring of operational metrics like latency, cost, token usage, and security risks
• Out-of-the-box quality evaluations for functional performance, topic relevance, toxicity, and security
• Detects hallucinations, drifts, and prompt injections to enhance accuracy and security
• Seamless integration with Datadog Application Performance Monitoring (APM) for comprehensive observability
Limitations:
• Explicit mention of version control and LLM simulations is lacking.
• The ability to add LLM as a judge is not specified
16
Confident AI
Free - Custom pricing
A/B test
Add LLM as judge
LLM simulations
Compare LLM models
Open Source
Key Features:
• A single platform to collect, manage, and test LLM datasets
• Lets you run tests on LLMs with metrics you can customize
• Connects to your CI/CD pipeline for automatic testing
• Tracks real-world LLM outputs and updates your dataset
• Supports LLM-as-a-judge for scoring results the way you want
• Built on DeepEval — open-source and trusted
17
LangFuse
A bit of a cluttered UI, but has the most of important production features: LLM as a judge, human evaluation, in production monitoring, playground for experimentation, prompt management e.t.c.
Free | $59/month | $199/month
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Version control
Add LLM as judge
Human evaluation
Open Source
Key Features:
• Full-stack LLM engineering platform for debugging, evaluating, and improving LLM applications.
• Built-in prompt management with versioning, deployment, and low-latency retrieval.
• Playground for testing prompts and models across different providers within the UI.
• Evaluation toolkit: collect user feedback, run custom evaluation functions, and manually annotate responses.
• Metrics tracking: monitor cost, latency, and quality metrics of your LLM workflows.
• Supports all major LLM frameworks (OpenAI, LangChain, LlamaIndex, etc.) with SDKs for Python and JS/TS.
• Enterprise-ready: SOC 2 Type II, ISO 27001, and GDPR compliant.
• Flexible hosting: Langfuse Cloud (managed) or self-hosted.
Limitations:
• No built-in tooling for hallucination detection or jailbreak testing
• UI is a bit complex
18
Hangar5
Unknown
LLM simulations
Evaluate hallucinations
Key Features:
• Simulation of real-user conversations with chatbots to test behavior in realistic scenarios before deployment
• Automated evaluation of chatbot outputs for relevance, business alignment, and consistency with problem statements
• Detection of hallucinations, inaccuracies, and inconsistent behavior to boost chatbot reliability and customer satisfaction
19
Maxim AI
Free | $29/month | $49/month | Custom
Version control
Monitoring/alerts
Add your own test sets
Synthetic test sets
Add LLM as judge
LLM simulations
Key Features:
- Simulate AI agents using AI-generated test scenarios.
- Evaluate agent performance with built-in and custom metrics, including LLM-as-a-judge.
- Track and version prompts, models, tools, and context without code changes.
- Monitor agent behavior with real-time traces, logs, and quality metrics.
- Trigger alerts based on performance or safety regressions.
- Generate analytics and reports for experiment tracking.
- Automate evaluations via CI/CD and manage human review workflows.
- Enterprise-ready: in-VPC deployment, SSO, SOC 2 compliance, and role-based access control.
20
Uptrain
Unknown
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Add your own test sets
Add LLM as judge
Evaluate hallucinations
Jailbreak detection
A/B test
Synthetic test sets
Open Source
Key Features:
• Compare multiple LLM models and prompts side by side
• Automated monitoring with alerts for performance issues
• Add custom and synthetic test sets for evaluation
• Use various LLMs as automated judges for output quality
• Support for A/B testing different models and prompts
21
Giskard
Free | Custom
Monitoring/alerts
Add your own test sets
Add LLM as judge
Evaluate hallucinations
Jailbreak detection
A/B test
Human evaluation
Synthetic test sets
Open Source
Key Features
• Continuous and automated testing of AI models to detect quality, security, and compliance risks
• Exhaustive risk detection including hallucinations, prompt injections, and harmful content
• Custom test generation using business data and synthetic scenarios
• Collaboration tools for business users and technical teams, including annotation and red-teaming playground
• Use of LLMs as automated judges to evaluate AI outputs and detect vulnerabilities
• Metric-driven comparisons to avoid regressions and support A/B testing
• Enterprise-grade security with on-premise/cloud deployment, role-based access, and GDPR compliance
22
Literal AI
Free | Custom
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Version control
Add your own test sets
Latency/cost tracking
Evaluate hallucinations
A/B test
Key Features
• Prompt Management: Version prompts, test variables, and deploy templates via API
• Evaluation Suite: Automated scoring and human-in-the-loop evaluation for outputs.
• Observability: Real-time monitoring of LLM performance, latency, and costs.
• Experiment Tracking: Run experiments against datasets to avoid regressions.
• Collaboration Tools: Streamline workflows between engineers, product teams, and SMEs
23
Galileo AI
$16/month | $32/month | Custom
Compare LLM models
Compare prompts side by side
Add your own test sets
Evaluate hallucinations
A/B test
Human evaluation
Key Features:
• Multi-LLM Testing & Comparison: Easily test and compare outputs from various large language models side by side.
• Custom Dataset Support: Upload and use your own datasets to evaluate model performance on specific tasks.
• Hallucination & Factuality Evaluation: Tools to assess and reduce hallucinations in model outputs.
• Human-in-the-Loop Evaluation: Incorporates human feedback and labeling to improve evaluation quality.
• A/B Testing Framework: Enables systematic A/B testing to optimize prompts and model choices.
24
Prompt Safe
Not specified
Monitoring/alerts
Jailbreak detection
Key Features of Prompt-Safe
• Prompt Injection & Jailbreak Detection: Actively detects and alerts on prompt injection and jailbreak attempts.
• Real-time Monitoring & Alerts: Provides monitoring and alerting for suspicious prompt activity.
• API Integration: Offers API-based integration to protect LLM endpoints.
• Security Focus: Specializes in securing LLM applications against adversarial prompts.
• Dashboard Analytics: Presents analytics and reporting on detected threats and prompt security events.
25
Bespoken
$2000-Custom
Monitoring/alerts
Add your own test sets
• Automated Testing: Fully automated functional and exploratory testing for IVR, chatbots, and conversational AI.
• Monitoring & Alerts: 24/7 monitoring with real-time alerts for outages, defects, and performance issues.
• Custom Test Creation: Users can create and run their own test cases to validate conversational experiences.
• Load Testing: Supports load testing to ensure scalability and reliability of conversational systems.
• Defect Identification & Analytics: Identifies and helps triage defects, with analytics to optimize customer journeys.
26
Opik by Comet
Free, $39/month | Custom
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Version control
Add your own test sets
Add LLM as judge
Evaluate hallucinations
A/B test
Human evaluation
Open Source
• Open Source LLM Evaluation: Fully open-source, allowing local deployment and customization.
• Trace Logging & Analysis: Records detailed traces and spans of LLM app responses for deep debugging and understanding.
• Built-in & Custom Metrics: Comes with pre-configured evaluation metrics and allows defining custom metrics via an SDK.
• CI/CD Integration: Enables LLM unit tests and comprehensive test suites to be integrated into continuous deployment pipelines.
• LLM Judges for Quality: Includes automated judges for hallucination detection, factuality, and moderation to ensure output quality.
27
HumanLoop
Free Trial | Custom
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Version control
Add your own test sets
Add LLM as judge
Evaluate hallucinations
A/B test
Human evaluation
• Prompt & Model Management: Develop, version, and compare prompts and models in code or UI, with full version control.
• Automated & Human Evaluation: Supports both code/AI-based automatic evaluations and expert human review.
• Monitoring & Alerting: Real-time alerts and guardrails to detect and notify about issues before they affect users.
• CI/CD Integration: Incorporate evaluations into deployment pipelines to catch regressions early.
• Data Privacy & Security: Enterprise-grade security, including VPC, RBAC, SSO/SAML, SOC-2, GDPR, and HIPAA compliance.
28
Deep Checks
Pay as you go | Basic $1000/month
Compare LLM models
Monitoring/alerts
Version control
Add your own test sets
Add LLM as judge
Evaluate hallucinations
A/B test
Human evaluation
Open Source
• Automated & Manual Evaluation: Combines automated “estimated annotation” with optional human review for robust LLM output validation.
• Golden Set Management: Enables the creation and management of custom test sets (“Golden Sets”) for comprehensive evaluation.
• Continuous Monitoring & Alerts: Provides real-time monitoring and alerting for model and data drift in production environments.
• Open Source & Open Core: Offers an open-source framework for ML and LLM testing, with a robust, widely tested product.
• Hallucination & Policy Compliance Detection: Systematically detects hallucinations, policy deviations, bias, and harmful content in LLM outputs.
29
Arize
$50/month | Custom
Compare LLM models
Compare prompts side by side
Monitoring/alerts
Version control
Add your own test sets
LLM simulations
Add LLM as judge
Latency/cost tracking
Evaluate hallucinations
A/B test
Human evaluation
Open Source
• Unified Observability & Evaluation: End-to-end monitoring, tracing, and evaluation from development through production for LLMs and AI agents.
• Automated & Human Evaluation: Combines LLM-as-a-Judge, code-based tests, and human-in-the-loop annotation for robust output quality assessment.
• Real-Time Monitoring & Alerts: Provides instant visibility, anomaly detection, and customizable alerts to catch issues early.
• Experimentation & Versioning: Supports A/B testing, model/prompt/version comparison, and continuous improvement workflows.
• Open Source & Interoperability: Built on open standards (OpenTelemetry, OpenInference), with no data lock-in and strong integration capabilities.
30
Haize Labs
Not specified
Monitoring/alerts
Add LLM as judge
Evaluate hallucinations
Jailbreak detection
Synthetic test sets
• Automated Red-Teaming: Generates adversarial prompts and conducts rigorous red-teaming to uncover vulnerabilities and jailbreaking risks.
• Customizable Judges: Allows configuration and deployment of automated judges tailored to specific use cases for evaluation and safety.
• Continuous Monitoring: Provides ongoing monitoring and alerting to ensure AI systems remain robust and safe in production.
• Cascade & Edge Case Testing: Dynamically tests AI systems for edge cases and cascading failures to improve reliability.
• Actionable Reporting & Analytics: Delivers clear, actionable insights and reports to help teams quickly address issues and optimize models.
31
PromptLayer
Start for Free
Version control
Latency/cost tracking
Compare LLM models
WIP
Python libraries for LLM evaluation
ID
Name
URL
From Lena
Open Source or Commercial
Price
Tags
Note
3
Deep Eval
This is the library I personally use the most, because it has most of the features I need and it’s easy to install
Open Source
Free
Compare different LLM models
Add test sets
Evaluate hallucinations
Add LLM as judge
• Specializes in hallucination evaluation using context-vs-output comparison
• Flexible LLM integration for judgment capabilities
• Focused on test case evaluation rather than production monitoring or versioning
• No evidence of jailbreak detection or latency/cost tracking in shown documentation
1
Ragas
Open Source
Free
Compare different LLM models
Evaluate hallucinations
Add LLM as judge
Key strengths include LLM comparison capabilities, hallucination evaluation through faithfulness metrics, and LLM judge integration. Monitoring is limited to performance tracking without explicit alert features, and test set customization focuses on synthetic data rather than user-uploaded sets.ShareExportRewrite
8
Evidently AI
Open Source library + Commercial platform
Free + commercial use
Compare different LLM models
Monitoring/alerts from production
Add test sets
Evaluate hallucinations
Add LLM as judge
OpenSource
Key features:
• Open-source Python library with 100+ pre-made metrics.
• Framework for custom ML, LLM, and data evaluations.
• Ad hoc reports, automated pipeline checks, and monitoring dashboards.
• LLM and text evaluations, data quality, data drift, etc.
Limitations:
• Does not offer LLM simulations or native version control.
• Jailbreak detection is not a listed feature.
• No side-by-side prompt comparison feature.
4
Open Source
Free
Compare different LLM models
Add LLM as judge
Evaluate hallucinations
Key features:
• Specialized benchmark for truthfulness evaluation
• Multiple metrics (GPT-judge, BLEURT, ROUGE, BLEU)
• Supports both generation and multiple-choice tasks
• Pre-defined dataset with human-aligned falsehood detection
Limitations:
• No production monitoring capabilities
• Fixed question set (no custom test uploads)
• Focused solely on truthfulness rather than other operational aspects
6
LangSmith
Open Source library + Commercial platform
Free + commercial use
Compare different LLM models
Monitoring/alerts from production
Add test sets
Observe latency/costs
Evaluate hallucinations
Add LLM as judge
Key Features:
• Tools to debug, test, and monitor AI application performance, whether you're building with LangChain or not.
• Trace every agent run to find bottlenecks.
• Experiment with models and prompts in the Playground, and compare outputs across different prompt versions.
• Track latency, cost, and issues with quality before your users do.
7
Langfuse
Open Source library + Commercial platform
Free + commercial use
Compare different LLM models
Compare prompts side by side
Monitoring/alerts from production
Version control
Add test sets
Observe latency/costs
Evaluate hallucinations
Add LLM as judge
Key features:
• Comprehensive platform for LLMOps.
• Designed for tracking, monitoring, and improving LLM applications.
• Integrates with various tools for evaluations, including hallucination detection and LLM-based judging.
Limitations:
• Does not appear to offer LLM simulation capabilities directly.
• Lacks explicit mention of jailbreak detection.
9
saga-llm-evaluation
Open Source
Free
Compare different LLM models
Add test sets
Evaluate hallucinations
Add LLM as judge
Key Features:
• Versatile Python library for evaluating LLMs.
• Metrics divided into embedding-based, language-model-based, and LLM-based categories.
• Hallucination score, relevance, correctness, and faithfulness metrics, among others.
• Uses Hugging Face Transformers and LangChain.
• Supports LLM-based evaluation and custom LLMs.
Limitations:
• No native version control.
• Does not offer LLM simulations.
• No built-in latency/cost tracking.
• No side-by-side prompt comparison.
• No monitoring or alerts.
• Jailbreak detection is not mentioned.
11
lm-evaluation-harness
Open Source
Free
Compare different LLM models
Add test sets
Add LLM as judge
Key Features:
• Framework for benchmarking language models across numerous tasks.
• Extensible with custom tasks, datasets, and metrics.
• Integration with visualization tools like Weights & Biases and Zeno.
Limitations:
• Lacks native production monitoring or version control.
• No built-in latency/cost tracking.
• Does not focus on hallucination evaluation or jailbreak detection.
• No side-by-side prompt comparison.
12
Trulens
Open Source
Free
Answer relevance
Fairness and bias
Sentiment
Key Features:
• Evaluate LLM apps with feedback functions (e.g. groundedness, relevance, safety, sentiment).
13
Hugging Face Evaluate
Open Source
Free
Key Features:
- Standard NLP metrics (BLEU, ROUGE, METEOR, BERTScore, etc.)
14
Distilabel
Open Source
Free
Synthetic test data
Compare different LLM models
Add test sets
Add LLM as judge
Human Evaluation
• Synthetic Data Generation: Programmatic creation of high-quality, diverse synthetic datasets to accelerate AI development.
• AI Feedback Integration: Unified API to incorporate feedback from any LLM provider to judge and improve dataset quality.
• Scalable Pipelines: Designed for scalable, fault-tolerant pipelines for data synthesis and AI feedback loops.
• Human + AI Feedback Loops: Combines automated AI judging with human-in-the-loop review for robust dataset curation.
• Research-Backed Methodologies: Implements verified research techniques for data synthesis and evaluation to ensure quality.
15
Giskard
Open Source
Free
Answer relevance
Add LLM as judge
Add test sets
Evaluate hallucinations
Synthetic test data
• Automated Test Set Generation (RAGET): Automatically generates test sets to evaluate RAG agents on various components and question types.
• LLM-as-a-Judge Evaluation: Uses LLMs to automatically assess answer correctness and quality against reference answers.
• Component-wise Scoring: Breaks down evaluation scores by RAG components like Generator, Retriever, Rewriter, Router, and Knowledge Base.
• RAGAS Metrics Integration: Supports advanced evaluation metrics such as context precision, faithfulness (hallucination detection), and answer relevancy to deeply analyze output quality.
• Flexible Evaluation API: Allows wrapping any RAG agent with a simple function interface and outputs detailed reports for programmatic or visual analysis.
16
Opper
N/A
Guardrails
Answer relevance
Sentiment
Compare different LLM models
Monitoring/alerts from production
Add test sets
Add LLM as judge
Human Evaluation
• Online & Offline Evaluation: Real-time evaluation on every function call and systematic testing against curated datasets.
• Custom & LLM-based Evaluators: Create tailored evaluators in code or use LLMs to assess tone, relevance, correctness, and other metrics.
• Integration with Tracing: Evaluation results are linked to trace spans, enabling deep insights into function performance and output quality.
• Automated Feedback & Guardrails: Online evaluation provides instant feedback to catch issues early, acting as guardrails during deployment.
• Flexible SDKs: Supports Python and JavaScript SDKs for seamless integration and programmatic evaluation workflows.
Moderation checks
ID
Name
URL
Price
Multi-select
Note
1
OpenAI Moderation
Free
Compare LLM models
Jailbreak detection
Key Features:
• Classifies text and images for harmful content
• Supports multiple categories like hate speech, violence, and self-harm
• Provides a confidence score for each category
• Offers multi-modal input support (text and images) with the omni-moderation-latest model
7
Guardrails AI
Free
Key Features:
• Framework for validating and correcting LLM inputs/outputs using "guardrails"
2
Azure OpenAI Content Filtering
Free
Compare LLM models
Monitoring/alerts
Evaluate hallucinations
Jailbreak detection
Key Features:
• Multi-class classification models to detect and filter harmful content
• Covers hate, sexual content, violence, and self-harm categories.
• Configurable severity levels for filtering
• Optional models for jailbreak risk and known content detection
• Groundedness detection for non-streaming scenarios
• Detects user prompt attacks and indirect attacks
3
Groq's content moderation
Free
Compare LLM models
Jailbreak detection
Key Features:
• Uses Llama Guard 3 to classify content safety in LLM inputs and outputs
• Identifies 14 harmful categories based on the MLCommons taxonomy
• Supports multiple languages: English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai
• Provides a simple "safe" or "unsafe" classification with specific category violations listed when content is unsafe
• Easy to integrate via the Groq API
4
Meta's Llama Guard 3
Free
Compare LLM models
Jailbreak detection
Add LLM as judge
Key Features:
• Content safety classification using Llama Guard 3, an 8B parameter LLM
• Classifies 14 categories of potential hazards based on the MLCommons taxonomy
• Supports multiple languages: English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai
• Provides a "safe" or "unsafe" classification with specific category violations listed when content is unsafe
• Designed to safeguard against the MLCommons standardized hazards taxonomy and supports Llama 3.1 capabilities
• Optimized for safety and security for search and code interpreter tool calls
5
DeepEval
Free + commercial use
Compare LLM models
Compare prompts side by side
Version control
Add your own test sets
Evaluate hallucinations
Add LLM as judge
Jailbreak detection
Key Features:
• Open-source Python framework for unit testing LLM applications
• Pre-built metrics for safety, security, and quality (e.g., toxicity, drift, groundedness)
• CI/CD integration for automated evaluations in development pipelines
• Customizable test cases and datasets for iterative prompt/model refinement
6
RefChecker
Free
Compare LLM models
Add your own test sets
Evaluate hallucinations
Add LLM as judge
Key Features:
• Fine-grained hallucination detection: Breaks responses into knowledge triplets (
[head, relation, tail]
) for atomic fact verification
• Multi-context support: Handles zero-context (Open QA), noisy-context (RAG), and accurate-context (summarization) scenarios
• Modular architecture:
◦ Customizable extractors (LLM-based or rule-based)
◦ Multiple checker options (LLM judges, AlignScore, NLI models)
◦ Flexible aggregators (strict/majority voting/soft thresholds)
• Benchmark dataset: 2.1k human-annotated LLM responses across 300 test samples
• Multi-model support: Integrates with OpenAI, Anthropic, AWS Bedrock, and self-hosted models via vLLM8
Deepteam
Not specified
Jailbreak detection
Bias
PII leakage
Toxicity
• Automated vulnerability scanning for bias, PII leakage, toxicity, etc.
• Prompt injection and jailbreak detection (including gray box attacks).
• Compliance checks based on OWASP Top 10 for LLMs and NIST AI standards.
• Open-source red teaming framework for LLM security.
Voice evaluation tools
ID
Name
URL
From Lena
Price
Note
Multi-select
1
Coval
Had a demo with them and tested it, the UI is still not always self-explanatory but I liked their voice simulation feature
Not specified
Coval simulates and evaluates AI agents via voice and chat, using AI-powered tests to ensure reliability and performance. It helps developers optimize AI agents efficiently.
Production Call Analytics
Production Alerts
Performance Analytics
Test Sets
Scenario Simulation
LLM as a judge
2
Hamming AI
Had a demo with them and tested it, they have most important voice metrics
Not specified
Hamming automates AI voice agent testing, simulating thousands of concurrent phone calls to find bugs.
Automated AI Voice Agent
Prompt Management
Prompt Optimizier
Production Call Analytics
Scenario Simulation
Voice Experimentation Tracking
3
Cekura (former Vocera)
$250-$1000 or custom
Vocera automates AI voice agent testing by simulating realistic conversations with workflows and personas. It offers monitoring, alerting, and performance insights.
Production Alerts
Production Call Analytics
Scenario Simulation
4
fixa
Free
Fixa is a Python package for AI voice agent testing. It uses voice agents to call your voice agent, then uses an LLM to evaluate the conversation. It integrates with Pipecat, Cartesia, Deepgram, OpenAI, and Twilio.
Scenario Simulation
LLM as a judge
5
Test Ai
Different plans and packages
nbulatest.ai simplifies AI testing with simulated scenarios, custom datasets, and real-time tracking. It offers performance insights, notifications, and a user-friendly interface for optimization.
AI-Crafted Datasets
Perfomance Monitoring
Actionable Insights
Scenario Simulation
6
BlueJay
Not specified
Bluejay is a cloud-based incident management platform designed primarily to streamline and optimize the alert management process for engineering teams. Its key purpose is to reduce downtime and Mean Time to Resolution (MTTR) by ensuring comprehensive and effective alerting before incidents occur, rather than just reacting to incidents after they happen
Production Alerts
Performance Analytics
Perfomance Monitoring
Actionable Insights
You team needs help setting up evaluation workflow?
Send me a DM on LinkedIn or book a free intro call