Model Leaderboard
AIWF Medium Context Benchmark — 30-turn multi-turn conversation evaluation with ~12K token knowledge base. Models are scored on tool use accuracy, instruction following, and knowledge base grounding.
View benchmark on GitHubAIWF Medium Context Benchmark — 30-turn multi-turn conversation evaluation with ~12K token knowledge base. Models are scored on tool use accuracy, instruction following, and knowledge base grounding.
View benchmark on GitHub