Ira Ceka

Research

Understanding APR Agents Through the Lens of Traceability

Ira Ceka*, Hailie Mitchell*, Saurabh Pujar, Luca Buratti, Shyam Ramji, Junfeng Yang, Gail Kaiser, Baishakhi Ray

ISSTA '26

Automated Program Repair (APR) agents leverage Large Language Models (LLMs) to autonomously diagnose and fix software bugs through reasoning, planning, and tool use. Despite impressive leaderboard gains on benchmarks such as SWE-bench, little is understood about how these agents take actions, where they fail, and how their behavior compares to that of human developers. This paper presents the first systematic analysis of five state-of-the-art APR agents across 500 real-world repair tasks, tracing their full decision-making pipelines from issue description to patch validation. Our study reveals that while agents excel at simple fixes, they struggle with logic-intensive bugs, often producing verbose or overfitted patches that merely satisfy existing tests. We find that test generation and regression test selection remain major bottlenecks, with agents frequently failing to reproduce issues or run relevant regression tests. Moreover, most agents operate with primitive tooling (e.g., bash scripts) and lack access to debuggers or program analyzers, which constrains their reasoning and patch quality. These findings highlight key limitations in current APR systems and motivate a shift-left approach emphasizing early, high-quality test generation and validation to reduce spurious fixes and improve semantic correctness. We further outline concrete directions for next-generation APR design: (1) richer and more integrated tool ecosystems, (2) diversified agentic architectures that combine complementary strengths, and (3) benchmarks that prioritize semantic repair quality and test generation fidelity over surface-level success metrics.

Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action

Saurabh Pujar, Ira Ceka, Irene L Manotas, Gail Kaiser, Baishakhi Ray, Shyam Ramji

arXiv '25

The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks. Their performance on certain tasks can be further enhanced by incorporating test-time reasoning techniques. These inference-time advances have been adopted into the code domain, enabling complex software engineering (SWE) tasks such as code generation, test generation and issue resolution. However, the impact of different reasoning techniques on code-centric SWE tasks has not been systematically explored. In this work, we survey code reasoning techniques that underpin these capabilities, with a focus on test-time compute and inference-time reasoning paradigms. We examine a variety of code-specific reasoning methods and progressively build up to SWE agents, which combine planning, tool use, and multi-step interaction. We also compare the impact of different techniques on coding tasks, highlighting their relative importance and outlining open challenges and future research directions. Across commonly used models and benchmarks, we find that approaches exploiting code-specific signals (e.g., structure and execution feedback) are frequently associated with improved performance, motivating a dedicated study of code reasoning beyond natural-language reasoning.

Can LLM Prompting Serve as a Proxy for Static Analysis in Vulnerability Detection

Ira Ceka, Feitong Qiao*, Anik Dey*, Aastha Valecha, Gail Kaiser, Baishakhi Ray

arXiv '24

Despite their remarkable success, large language models (LLMs) have shown limited ability on safety-critical code tasks such as vulnerability detection. Typically, static analysis (SA) tools, like CodeQL, CodeGuru Security, etc., are used for vulnerability detection. SA relies on predefined, manually-crafted rules for flagging various vulnerabilities. Thus, effectiveness of SA in detecting vulnerabilities depends on human experts and is known to report high error rates. In this study we investigate whether LLM prompting can be an effective alternative to these static analyzers in the partial code setting. We propose prompting strategies that integrate natural language instructions of vulnerabilities with contrastive chain-of-thought reasoning, augmented using contrastive samples from a synthetic dataset. Our findings demonstrate that security-aware prompting techniques can be effective alternatives to the laborious, hand-crafted rules of static analyzers, which often result in high false negative rates in the partial code setting. When leveraging SOTA reasoning models such as DeepSeek-R1, each of our prompting strategies exceeds the static analyzer baseline, with the best strategies improving accuracy by as much as 31.6%, F1-scores by 71.7%, pairwise accuracies by 60.4%, and reducing FNR by as much as 37.6%.

Towards causal deep learning for vulnerability detection

Md Mahbubur Rahman, Ira Ceka, Chengzhi Mao, Saikat Chakraborty, Baishakhi Ray, Wei Le

ICSE '24

Deep learning vulnerability detection has shown promising results in recent years. However, an important challenge that still blocks it from being very useful in practice is that the model is not robust under perturbation and it cannot generalize well over the out-of-distribution (OOD) data, e.g., applying a trained model to unseen projects in real world. We hypothesize that this is because the model learned non-robust features, e.g., variable names, that have spurious correlations with labels. When the perturbed and OOD datasets no longer have the same spurious features, the model prediction fails. To address the challenge, in this paper, we introduced causality into deep learning vulnerability detection. Our approach CausalVul consists of two phases. First, we designed novel perturbations to discover spurious features that the model may use to make predictions. Second, we applied the causal learning algorithms, specifically, do-calculus, on top of existing deep learning models to systematically remove the use of spurious features and thus promote causal based prediction. Our results show that CausalVul consistently improved the model accuracy, robustness and OOD performance for all the state-of-the-art models and datasets we experimented.

Research

Teaching

Honors & Awards

Connect