Turtle Benchmark

A crowdsourced, cheat-proof benchmark for evaluating LLM reasoning & understanding capabilities.

We first discovered in anonline lateral thinking puzzle game that as judges in the game, many LLMs are far inferior to humans in reasoning and accuracy of judgments on human questions. We hope that the Turtle benchmark can serve as a standardized test metric to evaluate LLMs' reasoning and understanding abilities, helping researchers and AI companies improve their models.

August 9, 2024

第0期: 用2万条真人海龟汤游戏评估LLM推理能力