I'm currently working in the research field of Software Engineering. I'm working on new patch generation and fault localization techniques. I'm particularly interested in the integration of the patch generation technique in the production environment.
- PDF • Experiment Results- Proceedings of the 18th International Conference on Mining Software Repositories (MSR '21) - Research in automatic program repair has shown that real bugs can be automatically fixed. However, there are several challenges involved in such a task that are not yet fully addressed. As an example, consider that a test-suite-based repair tool performs a change in a program to fix a bug spotted by a failing test case, but then the same or another test case fails. This could mean that the change is a partial fix for the bug or that another bug was manifested. However, the repair tool discards the change and possibly performs other repair attempts. One might wonder if the applied change should be also applied in other locations in the program so that the bug is fully fixed. In this paper, we are interested in investigating the extent of bug fix changes being cloned by developers within patches. Our goal is to investigate the need of multi-location repair by using identical or similar changes in identical or similar contexts. To do so, we analyzed 3,049 multi-hunk patches from the ManySStuBs4J dataset, which is a large dataset of single statement bug fix changes. We found out that 68% of the multi-hunk patches contain at least one change clone group. Moreover, most of these patches (70%) are strictly-cloned ones, which are patches fully composed of changes belonging to one single change clone group. Finally, most of the strictly-cloned patches (89%) contain change clones with identical changes, independently of their contexts. We conclude that automated solutions for creating patches composed of identical or similar changes can be useful for fixing bugs.
Acceptance rate: 66%, 8/12.
- PDF • Dataset- Proceedings of the 18th International Conference on Mining Software Repositories (MSR '21) - Software engineering researchers look for software artifacts to study their characteristics or to evaluate new techniques. In this paper, we introduce DUETS, a new dataset of software libraries and their clients. This dataset can be exploited to gain many different insights, such as API usage, usage inputs, or novel observations about the test suites of clients and libraries. DUETS is meant to support both static and dynamic analysis. This means that the libraries and the clients compile correctly, they are executable and their test suites pass. The dataset is composed of open-source projects that have more than five stars on GitHub. The final dataset contains 395 libraries and 2,874 clients. Additionally, we provide the raw data that we use to create this dataset, such as 34,560 pom.xml files or the complete file list from 34,560 projects. This dataset can be used to study how libraries are used by their clients or as a list of software projects that succesfully build. The client’s test suite can be used as an additional verification step for code transformation techniques that modify the libraries.
- PDF • Source code- IEEE Transactions on Software Engineering - Automatic program repair (APR) aims to reduce the cost of manually fixing software defects. However, APR suffers from generating a multitude of overfitting patches, those patches that fail to correctly repair the defect beyond making the tests pass. This paper presents a novel overfitting patch detection system called ODS to assess the correctness of APR patches. ODS first statically compares a patched program and a buggy program in order to extract code features at the abstract syntax tree (AST) level. Then, ODS uses supervised learning with the captured code features and patch correctness labels to automatically learn a probabilistic model. The learned ODS model can then finally be applied to classify new and unseen program repair patches. We conduct a large-scale experiment to evaluate the effectiveness of ODS on patch correctness classification based on 10,302 patches from Defects4J, Bugs.jar and Bears benchmarks. The empirical evaluation shows that ODS is able to correctly classify 71.9% of program repair patches from 26 projects, which improves the state-of-the-art. ODS is applicable in practice and can be employed as a post-processing procedure to classify the patches generated by different APR systems.
- PDF • Experiment Results- Journal of Systems and Software - Automatic program repair papers tend to repeatedly use the same benchmarks. This poses a threat to the external validity of the findings of the program repair research community. In this paper, we perform an empirical study of automatic repair on a benchmark of bugs called QuixBugs, which has been little studied. In this paper, (1) We report on the characteristics of QuixBugs; (2) We study the effectiveness of 10 program repair tools on it; (3) We apply three patch correctness assessment techniques to comprehensively study the presence of overfitting patches in QuixBugs. Our key results are: (1) 16/40 buggy programs in QuixBugs can be repaired with at least a test suite adequate patch; (2) A total of 338 plausible patches are generated on the QuixBugs by the considered tools, and 53.3% of them are overfitting patches according to our manual assessment; (3) The three automated patch correctness assessment techniques, RGTEvosuite, RGTInputSampling and GTInvariants, achieve an accuracy of 98.2%, 80.8% and 58.3% in overfitting detection, respectively. To our knowledge, this is the largest empirical study of automatic repair on QuixBugs, combining both quantitative and qualitative insights. All our empirical results are publicly available on GitHub in order to facilitate future research on automatic program repair.
- Source code • WebSite- Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE) - Over the last few years, there has been substantial research on automated analysis, testing, and debugging of Ethereum smart contracts. However, it is not trivial to compare and reproduce that research. To address this, we present an empirical evaluation of 9 state-of-the-art automated analysis tools using two new datasets: i) a dataset of 69 annotated vulnerable smart contracts that can be used to evaluate the precision of analysis tools; and ii) a dataset with all the smart contracts in the Ethereum Blockchain that have Solidity source code available on Etherscan (a total of 47,518 contracts). The datasets are part of SmartBugs, a new extendable execution framework that we created to facilitate the integration and comparison between multiple analysis tools and the analysis of Ethereum smart contracts. We used SmartBugs to execute the 9 automated analysis tools on the two datasets. In total, we ran 428,337 analyses that took approximately 564 days and 3 hours, being the largest experimental setup to date both in the number of tools and in execution time. We found that only 42% of the vulnerabilities from our annotated dataset are detected by all the tools, with the tool Mythril having the higher accuracy (27%). When considering the largest dataset, we observed that 97% of contracts are tagged as vulnerable, thus suggesting a considerable number of false positives. Indeed, only a small number of vulnerabilities (and of only two categories) were detected simultaneously by four or more tools.
Acceptance rate: 62%, 18/29.
- PDF • Source code • Experiment Results- Proceedings of the 17th International Conference on Mining Software Repositories (MSR '20) - Continuous Integration (CI) is a development practice where developers frequently integrate code into a common codebase. After the code is integrated, the CI server runs a test suite and other tools to produce a set of reports (e.g., output of linters and tests). If the result of a CI test run is unexpected, developers have the option to manually restart the build, re-running the same test suite on the same code; this can reveal build flakiness, if the restarted build outcome differs from the original build. In this study, we analyze restarted builds, flaky builds, and their impact on the development workflow. We observe that developers restart at least 1.72% of builds, amounting to 56,522 restarted builds in our Travis CI dataset. We observe that more mature and more complex projects are more likely to include restarted builds. The restarted builds are mostly builds that are initially failing due to a test, network problem, or a Travis CI limitations such as execution timeout. Finally, we observe that restarted builds have a major impact on development workflow. Indeed, in 54.42% of the restarted builds, the developers analyze and restart a build within an hour of the initial failure. This suggests that developers wait for CI results, interrupting their workflow to address the issue. Restarted builds also slow down the merging of pull requests by a factor of three, bringing median merging time from 16h to 48h.
Acceptance rate: 30%, 41/138.
- PDF • Source code • Experiment Results • WebSite- Proceedings of the 42nd International Conference on Software Engineering (ICSE '20) - Over the last few years, there has been substantial research on automated analysis, testing, and debugging of Ethereum smart contracts. However, it is not trivial to compare and reproduce that research. To address this, we present an empirical evaluation of 9 state-of-the-art automated analysis tools using two new datasets: i) a dataset of 69 annotated vulnerable smart contracts that can be used to evaluate the precision of analysis tools; and ii) a dataset with all the smart contracts in the Ethereum Blockchain that have Solidity source code available on Etherscan (a total of 47,518 contracts). The datasets are part of SmartBugs, a new extendable execution framework that we created to facilitate the integration and comparison between multiple analysis tools and the analysis of Ethereum smart contracts. We used SmartBugs to execute the 9 automated analysis tools on the two datasets. In total, we ran 428,337 analyses that took approximately 564 days and 3 hours, being the largest experimental setup to date both in the number of tools and in execution time. We found that only 42% of the vulnerabilities from our annotated dataset are detected by all the tools, with the tool Mythril having the higher accuracy (27%). When considering the largest dataset, we observed that 97% of contracts are tagged as vulnerable, thus suggesting a considerable number of false positives. Indeed, only a small number of vulnerabilities (and of only two categories) were detected simultaneously by four or more tools.
Acceptance rate: 20%, 129/617.
- PDF • Slide • Source code • Benchmark- Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution (ICSME'19) - Travis CI handles automatically thousands of builds every day to, amongst other things, provide valuable feedback to thousands of open-source developers. In this paper, we investigate Travis CI to firstly understand who is using it, and when they start to use it. Secondly, we investigate how the developers use Travis CI and finally, how frequently the developers change the Travis CI configurations. We observed during our analysis that the main users of Travis CI are corporate users such as Microsoft. And the programming languages used in Travis CI by those users do not follow the same popularity trend than on GitHub, for example, Python is the most popular language on Travis CI, but it is only the third one on GitHub. We also observe that Travis CI is set up on average seven days after the creation of the repository and the jobs are still mainly used (60%) to run tests. And finally, we observe that 7.34% of the commits modify the Travis CI configuration. We share the biggest benchmark of Travis CI jobs (to our knowledge): it contains 35,793,144 jobs from 272,917 different GitHub projects.
Acceptance rate: 56%, 26/46.
- Source code- ACM Ubiquity - Repairnator is a bot. It constantly monitors software bugs discovered during continuous integration of open-source software and tries to fix them automatically. If it succeeds in synthesizing a valid patch, Repairnator proposes the patch to the human developers, disguised under a fake human identity. To date, Repairnator has been able to produce patches that were accepted by the human developers and permanently merged into the code base. This is a milestone for human-competitiveness in software engineering research on automatic program repair.
- PDF • Slide • Source code • Experiment Results • WebSite
Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts/ 2019- Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE '19) - In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are evaluated on a single benchmark of bugs, which are also rarely reproduced by other researchers. In this paper, we present a large-scale experiment using 11 Java test-suite-based repair tools and 5 benchmarks of bugs. Our goal is to have a better understanding of the current state of automatic program repair tools on a large diversity of benchmarks. Our investigation is guided by the hypothesis that the repairability of repair tools might not be generalized across different benchmarks of bugs. We found that the 11 tools 1) are able to generate patches for 21% of the bugs from the 5 benchmarks, and 2) have better performance on Defects4J compared to other benchmarks, by generating patches for 47% of the bugs from Defects4J compared to 10-30% of bugs from the other benchmarks. Our experiment comprises 23,551 repair attempts in total, which we used to find the causes of non-patch generation. These causes are reported in this paper, which can help repair tool designers to improve their techniques and tools.
Acceptance rate: 24%, 73/303.
- PDF • Source code- Proceedings in International Workshop on Intelligent Bug Fixing (IBF 2019), co-located with SANER2019 - Automatic program repair papers tend to repeatedly use the same benchmarks. This poses a threat to the external validity of the findings of the program repair research community. In this paper, we perform an automatic repair experiment on a benchmark called QuixBugs that has never been studied in the context of program repair. In this study, we report on the characteristics of QuixBugs, and study five repair systems, Arja, Astor, Nopol, NPEFix and RSRepair, which are representatives of generate-and-validate repair techniques and synthesis repair techniques. We propose three patch correctness assessment techniques to comprehensively study overfitting and incorrect patches. Our key results are: 1) 15 / 40 buggy programs in the QuixBugs can be repaired with a test-suite adequate patch; 2) a total of 64 plausible patches for those 15 buggy programs in the QuixBugs are present in the search space of the considered tools; 3) the three patch assessment techniques discard in total 33 / 64 patches that are overfitting. This sets a baseline for future research of automatic repair on QuixBugs. Our experiment also highlights the major properties and challenges of how to perform automated correctness assessment of program repair patches. All experimental results are publicly available on Github in order to facilitate future research on automatic program repair.
- PDF • Slide • Source code- Proceedings of the VI Workshop on Software Visualization, Evolution and Maintenance (VEM 2018) - The characterization of bug datasets is essential to support the evaluation of automatic program repair tools. In a previous work, we manually studied almost 400 human-written patches (bug fixes) from the Defects4J dataset and annotated them with properties, such as repair patterns. However, manually finding these patterns in different datasets is tedious and time-consuming. To address this activity, we designed and implemented PPD, a detector of repair patterns in patches, which performs source code change analysis at abstract-syntax tree level. In this paper, we report on PPD and its evaluation on Defects4J, where we compare the results from the automated detection with the results from the previous manual analysis. We found that PPD has overall precision of 91% and overall recall of 92%, and we conclude that PPD has the potential to detect as many repair patterns as human manual analysis.
Acceptance rate: 61%, 24/39.
Acceptance rate: 24%, 23/96.
- PDF • Source code
Alleviating Patch Overfitting with Automatic Test Generation: A Study of Feasibility and Effectiveness for the Nopol Repair System/ 2018- Proceedings at Empirical Software Engineering (EMSE) - Among the many different kinds of program repair techniques, one widely studied family of techniques is called test suite based repair. However, test suites are in essence input-output specifications and are thus typically inadequate for completely specifying the expected behavior of the program under repair. Consequently, the patches generated by test suite based repair techniques can just overfit to the used test suite, and fail to generalize to other tests. We deeply analyze the overfitting problem in program repair and give a classification of this problem. This classification will help the community to better understand and design techniques to defeat the overfitting problem. We further propose and evaluate an approach called UnsatGuided, which aims to alleviate the overfitting problem for synthesis-based repair techniques with automatic test case generation. The approach uses additional automatically generated tests to strengthen the repair constraint used by synthesis-based repair techniques. We analyze the effectiveness of UnsatGuided: 1) analytically with respect to alleviating two different kinds of overfitting issues; 2) empirically based on an experiment over the 224 bugs of the Defects4J repository. The main result is that automatic test generation is effective in alleviating one kind of overfitting issue–regression introduction, but due to oracle problem, has minimal positive impact on alleviating the other kind of overfitting issue–incomplete fixing.
- PDF • Slide • Source code- Proceedings of the 11th IEEE Conference on Software Testing, Validation and Verification (ICST'18) - High-availability of software systems requires automated handling of crashes in presence of errors. Failure-oblivious computing is one technique that aims to achieve high availability. We note that failure-obliviousness has not been studied in depth yet, and there is very few study that helps understand why failure-oblivious techniques work. In order to make failure-oblivious computing to have an impact in practice, we need to deeply understand failure-oblivious behaviors in software. In this paper, we study, design and perform an experiment that analyzes the size and the diversity of the failure-oblivious behaviors. Our experiment consists of exhaustively computing the search space of 16 field failures of large-scale open-source Java software. The outcome of this experiment is a much better understanding of what really happens when failure-oblivious computing is used, and this opens new promising research directions.
Acceptance rate: 25%, 30/119.
- PDF • Source code • WebSite- Proceedings of the 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER'18) - Well-designed and publicly available datasets of bugs are an invaluable asset to advance research fields such as fault localization and program repair as they allow directly and fairly comparison between competing techniques and also the replication of experiments. These datasets need to be deeply understood by researchers: the answer for questions like 'which bugs can my technique handle?' and 'for which bugs is my technique effective?'' depends on the comprehension of properties related to bugs and their patches. However, such properties are usually not included in the datasets, and there is still no widely adopted methodology for characterizing bugs and patches. In this work, we deeply study 395 patches of the Defects4J dataset. Quantitative properties (patch size and spreading) were automatically extracted, whereas qualitative ones (repair actions and patterns) were manually extracted using a thematic analysis-based approach. We found that 1) the median size of Defects4J patches is four lines, and almost 30% of the patches contain only addition of lines; 2) 92% of the patches change only one file, and 38% has no spreading at all; 3) the top-3 most applied repair actions are addition of method calls, conditionals, and assignments, occurring in 77% of the patches; and 4) nine repair patterns were found for 95% of the patches, where the most prevalent, appearing in 43% of the patches, is on conditional blocks. These results are useful for researchers to perform advanced analysis on their techniques' results based on Defects4J. Moreover, our set of properties can be used to characterize and compare different bug datasets.
Acceptance rate: 27%, 39/146.
- PDF • Slide- Proceeding of ICSE NIER - We present an original concept for patch generation: we propose to do it directly in production. Our idea is to generate patches on-the-fly based on automated analysis of the failure context. By doing this in production, the repair process has complete access to the system state at the point of failure. We propose to perform live regression testing of the generated patches directly on the production traffic, by feeding a sandboxed version of the application with a copy of the production traffic, the 'shadow traffic'. Our concept widens the applicability of program repair because it removes the requirements of having a failing test case.
Acceptance rate: 16%, 14/85.
- PDF • Slide • Source code • Experiment Results- Proceedings of the 25th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER'17) - Null pointer exceptions (NPE) are the number one cause of uncaught crashing exceptions in production. In this paper, we aim at exploring the search space of possible patches for null pointer exceptions with metaprogramming. Our idea is to transform the program under repair with automated code transformation, so as to obtain a metaprogram. This metaprogram contains automatically injected hooks, that can be activated to emulate a null pointer exception patch. This enables us to perform a fine-grain analysis of the runtime context of null pointer exceptions. We set up an experiment with 16 real null pointer exceptions that have happened in the field. We compare the effectiveness of our metaprogramming approach against simple templates for repairing null pointer exceptions.
Acceptance rate: 24%, 34/135.
- PDF • Slide • Experiment Results- Proceedings at Empirical Software Engineering (EMSE) - Defects4J is a large, peer-reviewed, structured dataset of real-world Java bugs. Each bug in Defects4J comes with a test suite and at least one failing test case that triggers the bug. In this paper, we report on an experiment to explore the effectiveness of automatic test-suite based repair on Defects4J. The result of our experiment shows that the considered state-of-the-art repair methods can generate patches for 47 out of 224 bugs. However, those patches are only test-suite adequate, which means that they pass the test suite and may potentially be incorrect beyond the test-suite satisfaction correctness criterion. We have manually analyzed 84 different patches to assess their real correctness. In total, 9 real Java bugs can be correctly repaired with a test-suite based repair. This analysis shows that test-suite based repair suffers from under-specified bugs, for which trivial or incorrect patches still pass the test suite. With respect to practical applicability, it takes on average 14.8 minutes to find a patch. The experiment was done on a scientific grid, totaling 17.6 days of computation time. All the repair systems and experimental results are publicly available on Github in order to facilitate future research on automatic repair.
- PDF • Source code • Experiment Results- IEEE Transactions on Software Engineering, Institute of Electrical and Electronics Engineers (TSE) - We propose Nopol, an approach to automatic repair of buggy conditional statements (i.e., if-then-else statements). This approach takes a buggy program as well as a test suite as input and generates a patch with a conditional expression as output. The test suite is required to contain passing test cases to model the expected behavior of the program and at least one failing test case that reveals the bug to be repaired. The process of Nopol consists of three major phases. First, Nopol employs angelic fix localization to identify expected values of a condition during the test execution. Second, runtime trace collection is used to collect variables and their actual values, including primitive data types and objected-oriented features (e.g., nullness checks), to serve as building blocks for patch generation. Third, Nopol encodes these collected data into an instance of a Satisfiability Modulo Theory (SMT) problem; then a feasible solution to the SMT instance is translated back into a code patch. We evaluate Nopol on 22 real-world bugs (16 bugs with a buggy if conditions and six bugs with missing preconditions) on two large open-source projects, namely Apache Commons Math and Apache Commons Lang. Empirical analysis on these bugs shows that our approach can effectively fix bugs with buggy if conditions and missing preconditions. We illustrate the capabilities and limitations of Nopol using case studies of real bug fixes.
- PDF • Slide • Source code- 11th International Workshop in Automation of Software Test (AST 2016) - Automatic software repair is the process of automatically fixing bugs. The Nopol repair system repairs Java code using code synthesis. We have designed a new code synthesis engine for Nopol based on dynamic exploration, it is called DynaMoth. The main design goal is to be able to generate patches with method calls. We evaluate DynaMoth over 224 of the Defects4J dataset. The evaluation shows that Nopol with DynaMoth is capable of synthesizing patches and enables Nopol to repair new bugs of the dataset.
- PDF • Benchmark- Reproducible and comparative research requires well-designed and publicly available benchmarks. We present IntroClassJava, a benchmark of 297 small Java programs, specified by JUnit test cases, and usable by any fault localization or repair system for Java. The dataset is based on the IntroClass benchmark and is publicly available on Github.
- Best Paper: DUETS: A Dataset of Reproducible Pairs of Java Library-Clients, MSR 2021
- Distinguished Paper: Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts, ESEC/FSE'19
- Best Thesis: Accessit price for the Best Thesis GDR GPL 2018 (French Software Engenerring Group)
- Best Paper: A Comprehensive Study of Automatic Program Repair on the QuixBugs Benchmark, IBF 2019
- Best Paper: Towards an automated approach for bug fix pattern detection, VEM 2018
Title: From Runtime Failures to Patches: Study of Patch Generation in Production
Directors: Martin Monperrus and Lionel Seinturier
Started in: September 2015, Defended: 25th September 2018
Slide: Defense slide
OPEN-SOURCE RESEARCH TOOLS
- Github- Automatic Patch Generation Technique for Java Server
- Github- Maven Plugin to Automatic Generate Patches for your Projects
- Github- Automatic Patch Generation for Java
- Github- Automatic Patch Synthesizer for Java
- Github- Automatic Patch Generation for Null Pointer Exception