- Science & Technology
- Understanding the Effects of Technology on Economics and Governance
Unlike exam-style benchmarks, MENTAT captures the real ambiguities psychiatrists face daily across 5 critical tasks: diagnosis, treatment, monitoring, triage & documentation. Each question has five answer options for which we remove all non-decision-relevant demographic information of patients to allow for detailed studies of how patient demographic information (age, gender, ethnicity, nationality, …) impacts model performance.
The questions in the triage and documentation categories are designed to be ambiguous to reflect the challenges and nuances of these tasks, for which we collect annotations and create a preference dataset with a hierarchical Bradley-Terry model to enable more nuanced analysis with soft labels and uncertainties.
We find that
- models show significant bias based on patient demographics (gender, ethnicity, age)
- high multiple-choice accuracy ≠ consistent free-form responses
- even top models struggle with ambiguous real-world scenarios
MENTAT was created by nine practicing psychiatrists without LLM assistance with expert-annotated questions designed to expose fairness issues that only surface at scale.
Check out the paper and code here: https://openreview.net/forum?id=tSy7OtONsg
Cite the paper [APA]:
Lamparth, M., Grabb, D., Franks, A., Gershan, S., Kunstman, K. N., Lulla, A., Roots, M. D., Sharma, M., Shrivastava, A., Vasan, N., & Waickman, C. (2026). Moving beyond medical exams: A clinician-annotated fairness dataset of real-world tasks and ambiguity in mental healthcare. In Proceedings of the Fourteenth International Conference on Learning Representations (ICLR 2026).