LiveMedBench

A Live Medical Benchmark for Evaluating Large Language Models

LiveMedBench is designed to measure not only overall medical quality, but also robustness over time under a rubric-based evaluation framework using real-world, live-update medical data.

Key Features

🕒

Live & Time-stamped

Real-world medical cases with temporal information, enabling evaluation of model robustness over time.

📋

Rubric-based Evaluation

Objective, criterion-specific evaluation framework aligned with physician assessment standards.

🌐

Real-world Cases

Authentic medical consultation scenarios rather than static exam-style questions.

📊

Comprehensive Metrics

Per-month and overall performance metrics for temporal trend analysis.

LiveMedBench Pipeline

LiveMedBench Pipeline
1

Data Collection

2

Multi-Agent Curation

3

Rubric Generation

4

Model Evaluation

5

Human Assessment

Dataset Overview

LiveMedBench Dataset Statistics
2,756
Medical Cases
38
Medical Specialties
2023-2026
Time Range
16,702
Evaluation Criterion

Model Performance

Last updated:

Rank Model Type Overall Score
LiveMedBench Benchmark Results