Publications
2026
- Safeguarding LLM Agents from Misalignment through Provenance AnalysisYining She, Yiliang Liang, and Eunsuk KangUnder Review, 2026
As LLM agents gain increasing access to powerful tools, ensuring that their actions are aligned with the user’s intent becomes critical. When an agent’s proposed tool invocation deviates from the user’s intent—a phenomenon called misalignment—it may lead to harmful consequences that are difficult to undo. Existing runtime guardrails rely on an LLM-as-a-judge paradigm that lacks a systematic framework for reasoning about alignment, often producing judgments that are inconsistent or difficult to audit. Motivated by provenance analysis, we propose a provenance-based conceptual framework that formalizes misalignment detection as determining whether a proposed tool call is supported by traceable evidence in the agent’s context. Building on this framework, we propose ProvenanceGuard, a multi-stage pipeline that analyzes the agent’s action for three types of misalignment before the selected tool is executed and only allows the action to take place when it is considered aligned with the user’s input query. We evaluated our proposed approach on two different benchmarks, Agent-SafetyBench and WorkBench, across 10 backbone LLMs. Compared to the LLM-as-a-judge baseline, ProvenanceGuard reduces error rate on misaligned traces from 42.9% to 1.8% on Agent-SafetyBench and from 32.1% to 17.3% on WorkBench, while reducing intervention burden on task-successful traces from 30.5% to 12.8% and introducing no statistically significant increase in unnecessary interventions on aligned traces. These results demonstrate that structured, provenance-based reasoning provides an effective and practical foundation for safeguarding LLM agents from misalignment.
@article{she2025provenanceguard, title = {Safeguarding LLM Agents from Misalignment through Provenance Analysis}, author = {She, Yining and Liang, Yiliang and Kang, Eunsuk}, journal = {Under Review}, year = {2026}, } - Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing UtilityYining Hong, Yining She, Eunsuk Kang, Christopher S Timperley, and Christian KästnerUnder Review, 2026
AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a systematic review of 80 state-of-the-art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on Tau2-Bench, CAR-bench, and MedAgentBench. We find that 85% of benchmarks lack concrete policies, relying instead on underspecified high-level goals or common sense. Among the specified policies, 74% of policy requirements can be enforced by symbolic guardrails, often using simple, low-cost mechanisms. These guardrails improve safety and security without sacrificing agent utility. Overall, our results suggest that symbolic guardrails are a practical and effective way to guarantee some safety and security requirements, especially for domain-specific AI agents. We release all codes and artifacts at https://github.com/hyn0027/agent-symbolic-guardrails.
@article{hong2026symbolicguardrails, title = {Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility}, author = {Hong, Yining and She, Yining and Kang, Eunsuk and Timperley, Christopher S and Kästner, Christian}, journal = {Under Review}, year = {2026}, }
2025
- RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style ContextsYining She, Daniel W Peterson, Marianne Menglin Liu, Vikas Upadhyay, Mohammad Hossein Chaghazardi, and 2 more authorsUnder Review, 2025
With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.
@article{she2025rag, title = {RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts}, author = {She, Yining and Peterson, Daniel W and Liu, Marianne Menglin and Upadhyay, Vikas and Chaghazardi, Mohammad Hossein and Kang, Eunsuk and Roth, Dan}, journal = {Under Review}, year = {2025}, } - FairSense: Long-Term Fairness Analysis of ML-Enabled SystemsYining She, Sumon Biswas, Christian Kästner, and Eunsuk KangIn 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), 2025
Algorithmic fairness of machine learning (ML) models has raised significant concern in the recent years. Many testing, verification, and bias mitigation techniques have been proposed to identify and reduce fairness issues in ML models. The existing methods are model-centric and designed to detect fairness issues under static settings. However, many ML-enabled systems operate in a dynamic environment where the predictive decisions made by the system impact the environment, which in turn affects future decision-making. Such a self-reinforcing feedback loop can cause fairness violations in the long term, even if the immediate outcomes are fair. In this paper, we propose a simulation-based framework called FairSense to detect and analyze long-term unfairness in ML-enabled systems. Given a fairness requirement, FairSense performs Monte-Carlo simulation to enumerate evolution traces for each system configuration. Then, FairSense performs sensitivity analysis on the space of possible configurations to understand the impact of design options and environmental factors on the long-term fairness of the system. We demonstrate FairSense’s potential utility through three real-world case studies: Loan lending, opioids risk scoring, and predictive policing.
@inproceedings{she2025fairsense, title = {FairSense: Long-Term Fairness Analysis of ML-Enabled Systems}, author = {She, Yining and Biswas, Sumon and K{\"a}stner, Christian and Kang, Eunsuk}, booktitle = {2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)}, year = {2025}, organization = {IEEE Computer Society}, }
2023
- Towards Safe ML-Based Systems in Presence of Feedback LoopsSumon Biswas, Yining She, and Eunsuk KangIn SE4SafeML workshop in ESEC/FSE’2023: The 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, California, Dec 2023
Machine learning (ML) based software is increasingly being deployed in a myriad of socio-technical systems, such as drug monitoring, loan lending, and predictive policing. Although not commonly considered safety-critical, these systems have a potential to cause serious, long-lasting harm to users and the environment due to their close proximity and effect on the society. One type of emerging problem in these systems is unintended side effects from a feedback loop; the decision of ML-based system induces certain changes in the environment, which, in turn, generates observations that are fed back into the system for further decision-making. When this cyclic interaction between the system and the environment repeats over time, its effect may be amplified and ultimately result in an undesirable. In this position paper, we bring attention to the safety risks that are introduced by feedback loops in ML-based systems, and the challenges of identifying and addressing them. In particular, due to their gradual and long-term impact, we argue that feedback loops are difficult to detect and diagnose using existing techniques in software engineering. We propose a set of research problems in modeling, analyzing, and testing ML-based systems to identify, monitor, and mitigate the effects of an undesirable feedback loop.
@inproceedings{biswas2023towards, author = {Biswas, Sumon and She, Yining and Kang, Eunsuk}, booktitle = {SE4SafeML workshop in ESEC/FSE'2023: The 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering}, title = {Towards Safe ML-Based Systems in Presence of Feedback Loops}, location = {San Francisco, California}, month = dec, year = {2023}, }
2022
- Stable Interaction of Autonomous Vehicle Platoons with Human-Driven VehiclesMohammad Pirani, Yining She, Renzhi Tang, Zhihao Jiang, and Yash Vardhan PantIn 2022 American Control Conference (ACC), Dec 2022
@inproceedings{9867210, author = {Pirani, Mohammad and She, Yining and Tang, Renzhi and Jiang, Zhihao and Vardhan Pant, Yash}, booktitle = {2022 American Control Conference (ACC)}, title = {Stable Interaction of Autonomous Vehicle Platoons with Human-Driven Vehicles}, year = {2022}, volume = {}, number = {}, pages = {633-640}, doi = {10.23919/ACC53348.2022.9867210}, }