LiveMedBench

A Live Medical Benchmark for Evaluating Large Language Models

LiveMedBench is designed to measure not only overall medical quality, but also robustness over time under a rubric-based evaluation framework using real-world, live-update medical data.

🤗 Hugging Face 📦 GitHub 📄 arXiv

Key Features

🕒

Live & Time-stamped

Real-world medical cases with temporal information, enabling evaluation of model robustness over time.

📋

Rubric-based Evaluation

Objective, criterion-specific evaluation framework aligned with physician assessment standards.

🌐

Real-world Cases

Authentic medical consultation scenarios rather than static exam-style questions.

📊

Comprehensive Metrics

Per-month and overall performance metrics for temporal trend analysis.

LiveMedBench Pipeline

Data Collection

Multi-Agent Curation

Rubric Generation

Model Evaluation

Human Assessment

Dataset Overview

2,756

Medical Cases

Medical Specialties

2023-2026

Time Range

16,702

Evaluation Criterion

Model Performance

Last updated:

Rank	Model	Type	Overall Score