AI Replacing My Job Part 1: Automatic Program Repair
Published:
When I told my friends that I was working on machine learning this summer, even I was astonished by what I was saying. Out of all people, I, the biggest machine learning hater, decided out of my own free will to work on machine learning. In my defense, I was intrigued by the security aspect of this project and the intersection of two of the hottest fields in computer science.
Over the past week, I have read just a few papers that exist in the vast literature on the topic of automatic program repair (APR). No, I have not read through all 179 references in the SoK draft that my professor is publishing, and no, I still do not believe I’m cut out for machine learning. Despite my reservations, I have compiled a working webpage of higher-level ideas I have read in these papers.
Task
After reading these papers, I believe that the goal they are trying to achieve is fundamentally different from the goal we are trying to achieve. The task of these papers can be loosely summarized as: identifying security vulnerabilites and accurately repairing them.
According to Wikipedia, APR is the automatic repair of software bugs without the intervention of a human programmer. From this definition, we can see that the task of these papers fall under a subset of tasks that APR can encompass.
Our task is to create a more complete APR technology. The key difference that we are proposing is allowing for specification of the expected behavior. In many previously published papers, the authors provided the specification for the code to accomplish. However, because LLMs have become more prevalent and accurate, we now have the technology to bridge the gap between user intention and code. We can see this work is just getting started in 2024 with VSP, where security vulnerabilities are repaired by using chain-of-thought (CoT) prompting.
At its core, security is more than just attacking and defending code. People often think that security is the neon green perfectly aligned font against a black background or some mysterious anonymous guy in a hoodie in the middle of the desert who is actively running five monitors. In reality, security is about intention, that what the programmer intends is how the code will run. Attacks will indeed render the code to not run as intended, but there’s more to security than just that.
Process
This begs the question, how do we go about creating a technology to repair code? Everyone seems to be using a different suite of technologies to accomplish this task, but everyone more or less follows the same general process. The process can be broken down as follows:
- Detect: analyze the code for potential errors.
- Localize: determine where the buggy code is located.
- Repair: generate code to patch the problem.
- Validate: verify whether the patch meets the criteria of the test cases.
Considerations
There are many aspects that need to be carefully considered when formulating the solution to this problem. One of the biggest issues is search space explosion. The search space in this scenario is the set of all possible patches that can be applied to the buggy code.
Random collection of questions and todos: How does each paper deal with the possible search space explosion problem? What are the limitations of not using an LLM? Benefits? Is our plan future proof? LLMs should be better at learning code syntax and semantics because there has already been a lot of training done instead of training a deep learning model. Good vs bad synthetic data? Required amount of data? Fix templates? Machine learning only has potential to succeed as long as it performs better than humans. Correctness should have a concrete representation instead of just eye-balling it. What’s the point of looking for and fixing a known bug? Comparison of machine vs human? Overfitting?
Summaries
VRepair employs transfer learning on a dataset of C vulnerabilities from GitHub commits. SVEN enhances the production of secure code using security hardening and adversarial testing without changing the LM’s pretrained weights. Instead, the model learns prefixes, allowing for the task to be cost-effective. CURE proposes a new NMT approach. They pre-train a model on a specific programming language in order to learn code syntax and use beam search to search for the best patch. GenProg can be analogized as biological evolution, where the variant with the highest fitness is evolved until it passes all test cases. GenProg is used to repair legacy C code. PAR builds upon GenProg by training on human-written patches focused on semantics instead of syntaxes. DLFix is a two-layer approach mainly fixing a code misalignment problem, ensuring that context remains in the code. It employs a Context Learning Layer (CLL) and Transformation Learning Layer (TLL). SymlogRepair translates the buggy code into a Datalog database. The query of the database provides the constraints for the SMT solver. VSP uses CoT prompting to perform three analyses of vulnerabilities: identify, discover, and patch. Using CoT provides intermediate reasoning steps, allowing VSP to learn as it goes.