|
|
Proactive Defense Against LLM Jailbreak
Weiliang Zhao,
JinJun Peng,
Daniel Ben-Levi,
Junfeng Yang,
Chengzhi Mao,
Under Review
A proactive defense framework that injects strategically crafted spurious outputs to mislead
attackers’ optimization loops, prematurely collapsing multi-turn jailbreak searches and dramatically
reducing LLM vulnerability.
|
|
|
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao,
Daniel Ben-Levi,
Junfeng Yang,
Chengzhi Mao,
NAACL, 2025, Oral
arXiv
A Generalised jailbreaking technique by encouraging higher levels of diversification and
adjacent
obfuscated prompting to evaluate the vulnerabilities of LLMs.
|
|
|
Learning to Rewrite: Generalized LLM-Generated Text Detection
Wei Hao,
Ran Li ,
Weiliang Zhao,
Junfeng Yang,
Chengzhi Mao,
ACL, 2025
arXiv
We propose a method designed to enhance
the detection of LLM-generated text by learning
to rewrite more on LLM-generated inputs and less
on human generated inputs.
|
|
🤝
Acknowledgement
I would like to acknowledge the Thinker Research Grants support from
Thinking Machine.
|
|
🗺️
Visitor Map
|
Feel free to steal this website's source
code. Do not scrape the HTML from this page itself, as it includes
analytics
tags that you do not want on your own website — use the github code instead. Also,
consider
using Leonid Keselman's Jekyll fork of this page.
|
|