Live Leaderboards
Publications
2025
BaxBench: Can LLMs Generate Secure and Correct Backends?
Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev
ICML
2025
Spotlight

MathConstruct: Challenging LLM Reasoning with Constructive Proofs
Mislav Balunović*, Jasper Dekoninck*, Nikola Jovanović, Ivo Petrov, Martin Vechev
ICML
2025
* Equal contribution
The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs
Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Milenov Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar D. Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev
arXiv
2025
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Mislav Balunović, Jasper Dekoninck, Nikola Jovanović, Ivo Petrov, Martin Vechev
arXiv
2025
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
Jasper Dekoninck, Maximilian Baader, Martin Vechev
ICLR
2025
2024
A Synthetic Dataset for Personal Attribute Inference
Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev
NeurIPS Datasets and Benchmarks
2024
ConStat: Performance-Based Contamination Detection in Large Language Models
Jasper Dekoninck, Mark Niklas Müller, Martin Vechev
NeurIPS
2024
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents
Niels Mündler, Mark Niklas Müller, Jingxuan He, Martin Vechev
NeurIPS
2024
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act
Philipp Guldimann, Alexander Spiridonov, Robin Staab, Nikola Jovanović, Mark Vero, Velko Vechev, Anna Gueorguieva, Mislav Balunović, Nikola Konstantinov, Pavol Bielik, Petar Tsankov, Martin Vechev
arXiv
2024
Evading Data Contamination Detection for Language Models is (too) Easy
Jasper Dekoninck, Mark Niklas Müller, Maximilian Baader, Marc Fischer, Martin Vechev
arXiv
2024