Psychiatry-Bench: A Multi-Task Benchmark for LLMs in Psychiatry
Anzeige
Ähnliche Artikel
arXiv – cs.AI
•
Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective
arXiv – cs.AI
•
HardcoreLogic: Benchmark prüft Logikmodelle mit seltenen Rätselvarianten
arXiv – cs.AI
•
FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation
arXiv – cs.AI
•
TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation
arXiv – cs.AI
•
Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards
arXiv – cs.AI
•
Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology