Model Leaderboard

AIWF Medium Context Benchmark — 30-turn multi-turn conversation evaluation with ~12K token knowledge base. Models are scored on tool use accuracy, instruction following, and knowledge base grounding.

View benchmark on GitHub

Judged by Claude Opus 4.5