Rethinking How AI Agents Work in Practice

Recent work from the Data, Agents, and Processes Lab (DAPLab) explores a core challenge in AI agents: how to move from impressive demos to systems that actually work in real-world settings. Across three projects, PhD students worked alongside professors and industry partners to examine reliability, optimization, and evaluation, arguing that today’s agents often fail not because of raw capability but because of how they’re designed, deployed, and measured.

Reya Vir’s project, Your AI Agent Doesn’t Care About Your README, highlights a key disconnect between how humans document systems and how agents actually operate. README files are written for people, but agents don’t reliably interpret or follow them, leading to breakdowns in execution. This work points to a deeper issue: aligning agent behavior with real workflows requires more than instructions; it requires structured, machine-interpretable processes and better grounding in how systems actually function in practice. To solve this, a tool called the Hierarchical Context Compressor was developed to automatically generate directory-specific guidance. This system prioritizes concise, relevant metadata over generic advice, ensuring agents receive only the specific instructions needed for their current workspace. 

The second project, titled Why Your Agent Needs a Model Combo Optimizer, is a collaboration between researchers at Microsoft Frontier AI and DAPLab researchers.  The work shifts focus to the performance and cost of agentic workflow execution. The research shows that agents perform better when different models are assigned specialized roles (like planning vs. solving) and systematically evaluated. In response, the team released a drop-in library called AgentOpt that automatically finds the most effective model choices based on the developer’s specific needs for quality, speed, and budget. AgentOpt selects the best combination of models for a given task, reflecting a broader trend toward modular, multi-model systems rather than one-size-fits-all approaches.

Finally, Benchmarking Mortgage Underwriting Agents by Matthew Toles is a collaboration with DAPLab partner Tidalwave, and tackles evaluation in a high-stakes domain. The project introduces one of the first realistic benchmarks for AI agents in mortgage origination, showing that domain-specific agents significantly outperform general-purpose models on compliance-heavy tasks. While the agents demonstrate human-level accuracy in identifying transaction patterns and verifying consistency, researchers noted a conservative bias in which the models occasionally omit relevant financial transactions. The research highlights the potential for AI to assist loan officers while emphasizing the necessity of policy enforcement and human oversight to ensure reliability.

Taken together, these projects emphasize a shift in AI research: success is no longer just about model capability, but about system design, how agents interpret instructions, how they coordinate multiple models, and how their performance is measured in real-world contexts.