Live Leaderboards

MathArena: Uncontaminated Math Competitions
BaxBench: Secure and Correct Backends
SWT-Bench: Assessing Test-writing Capabilities
EU AI Act Compliance Leaderboard

Publications

2025

BaxBench: Can LLMs Generate Secure and Correct Backends?
Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev
ICML 2025 Spotlight
MathConstruct: Challenging LLM Reasoning with Constructive Proofs
Mislav Balunović*, Jasper Dekoninck*, Nikola Jovanović, Ivo Petrov, Martin Vechev
ICML 2025 * Equal contribution
The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs
Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Milenov Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar D. Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev
arXiv 2025
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Mislav Balunović, Jasper Dekoninck, Nikola Jovanović, Ivo Petrov, Martin Vechev
arXiv 2025
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
Jasper Dekoninck, Maximilian Baader, Martin Vechev
ICLR 2025

2024

A Synthetic Dataset for Personal Attribute Inference
Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev
NeurIPS Datasets and Benchmarks 2024
ConStat: Performance-Based Contamination Detection in Large Language Models
Jasper Dekoninck, Mark Niklas Müller, Martin Vechev
NeurIPS 2024
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents
Niels Mündler, Mark Niklas Müller, Jingxuan He, Martin Vechev
NeurIPS 2024
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act
Philipp Guldimann, Alexander Spiridonov, Robin Staab, Nikola Jovanović, Mark Vero, Velko Vechev, Anna Gueorguieva, Mislav Balunović, Nikola Konstantinov, Pavol Bielik, Petar Tsankov, Martin Vechev
arXiv 2024
Evading Data Contamination Detection for Language Models is (too) Easy
Jasper Dekoninck, Mark Niklas Müller, Maximilian Baader, Marc Fischer, Martin Vechev
arXiv 2024