MdEval: Massively Multilingual Code Debugging

Shukai Liu¹, Linzheng Chai¹, Jian Yang¹, Jiajun Shi¹, Liran Wang¹, Ke Jin¹, Wei Zhang², Hualei Zhu¹, Shuyue Guo², Tao Sun¹, Jiaheng Liu¹, Yunlong Duan² Yu Hao² Liqun Yang¹ Guanglin Niu¹ Ge Zhang² Zhoujun Li¹

¹CCSE, Beihang University ²M-A-P

Paper Code Evaluation Data MdEval-Instruct Leaderboard

Introduction

To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.9K test samples of 20 programming languages and covers the automated program repair (APR) task, the bug localization (BI) task, and the bug identification (BI) task.

Benchmark Overview

Comparison between MdEval and other code debugging benchmarks. MdEval significantly provides a comprehensive multilingual view by expanding the variety of programming languages and language-specific error types.

Error types

Error types in MdEval. Part (a) shows generic error types, and Part (b) lists language-specific error types.

Human Annotation & Quality Control

We collected and filtered code snippets from GitHub. Before annotation, we summarized error types. Annotators then labeled the code based on these types. To ensure quality, they used GPT-4o to evaluate the annotations on four criteria: difficulty, ambiguity, error type, and solvable. Finally, they exchanged data with each other to minimize bias and errors.

Examples of three tasks

Examples of multilingual automated program repair (APR), bug localization (BI), and bug identification (BI).

Performance across error types

The performance of models on the automated program repair task varies across different error types, highlighting the strengths and weaknesses of these models in addressing specific challenges.

Another Two Automated Program Repair Settings

wo additional automated program repair settings are designed to simulate realistic user queries. Part (a) presents results for the scenario where models are provided with buggy code along with example test cases, while Part (b) illustrates results for the scenario where only the buggy code is provided to the models.

Code Review Task

Besides automated program repair tasks, code review tasks also play a crucial role in software development. To analyze the performance of different models on code review tasks, we conducted experiments based on MdEval.

Effect of Bug Location for Automated Program Repair

To verify whether the bug location information can have a positive impact when using large language models for automated program repair, we designed and conducted a series of comparative experiments. The figure shows the Pass@1 (%) scores with only the buggy code provided versus when additional bug location information is supplied.