Teaser

Introduction

To advance the field of multilingual debugging with LLMs, we propose the first massively multilingual debugging benchmark, which includes 3.9K test samples of 20 programming languages and covers the automated program repair (APR) task, the bug localization (BI) task, and the bug identification (BI) task.

Benchmark Overview

HumanEval Overfitting

Comparison between MdEval and other code debugging benchmarks. MdEval significantly provides a comprehensive multilingual view by expanding the variety of programming languages and language-specific error types.

Error types

HumanEval Overfitting

Error types in MdEval. Part (a) shows generic error types, and Part (b) lists language-specific error types.

Human Annotation & Quality Control

HumanEval Overfitting

We collected and filtered code snippets from GitHub. Before annotation, we summarized error types. Annotators then labeled the code based on these types. To ensure quality, they used GPT-4o to evaluate the annotations on four criteria: difficulty, ambiguity, error type, and solvable. Finally, they exchanged data with each other to minimize bias and errors.

Examples of three tasks

HumanEval Overfitting

Examples of multilingual automated program repair (APR), bug localization (BI), and bug identification (BI).

Performance across error types

HumanEval Overfitting

The performance of models on the automated program repair task varies across different error types, highlighting the strengths and weaknesses of these models in addressing specific challenges.

Another Two Automated Program Repair Settings

HumanEval Overfitting

wo additional automated program repair settings are designed to simulate realistic user queries. Part (a) presents results for the scenario where models are provided with buggy code along with example test cases, while Part (b) illustrates results for the scenario where only the buggy code is provided to the models.

Code Review Task

HumanEval Overfitting

Besides automated program repair tasks, code review tasks also play a crucial role in software development. To analyze the performance of different models on code review tasks, we conducted experiments based on MdEval.

Effect of Bug Location for Automated Program Repair

HumanEval Overfitting

To verify whether the bug location information can have a positive impact when using large language models for automated program repair, we designed and conducted a series of comparative experiments. The figure shows the Pass@1 (%) scores with only the buggy code provided versus when additional bug location information is supplied.