
ydyjya/Awesome-LLM-Safety - GitHub
We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles.
LLM Safety 最新论文推介 - 2025.03.12 - 知乎 - 知乎专栏
3 天之前 · 关键词: Long-Context Safety&LLM Evaluation&Safety Benchmark. 摘要:随着大型语言模型(LLMs)在长文本理解和生成任务上的进步,长上下文带来的安全问题也逐渐显现。然而,目前关于长上下文任务的安全性研究仍处于初步阶段,缺乏系统的评估方法和改进策略。
LLM Safety 最新论文推介 - 2024.3.22 - 知乎 - 知乎专栏
2024年3月22日 · 本文介绍了EasyJailbreak,一个统一框架,简化了对LLMs进行越狱攻击的构建和评估过程。 它通过四个组成部分——选择器、变异器、约束器和评估器——构建越狱攻击,使研究人员可以轻松地从新旧组件的组合中构建攻击。 目前,EasyJailbreak支持11种不同的越狱方法,并促进了对广泛LLMs安全性的验证。 通过对10种不同LLMs的验证,发现存在显著的安全漏洞,平均攻击成功率为60%。 特别是,即使是先进的模型如GPT-3.5-Turbo和GPT-4也表现出平 …
tjunlp-lab/Awesome-LLM-Safety-Papers - GitHub
This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks.
GitHub - thu-coai/SafetyBench: Official github repo for …
SafetyBench is a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. SafetyBench also incorporates both Chinese and English data, …
Agent-SafetyBench: Evaluating the Safety of LLM Agents
2024年12月19日 · In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.
LLM Safety论文推介 - 知乎
2 天之前 · 该系列将定期更新arxiv上有关Safety的paper,将会不定时更新,旨在帮助为LLM Safety领域的研究者推送最新的研究进展,并进行快速了解 。 此外,我们也将会在GitHub上维护我们有关Safety的Repo,该Repo将会更新LLM Safety的经典Paper以及其他的资料,并且同步更新 …
[2412.17686] Large Language Model Safety: A Holistic Survey
2024年12月23日 · This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks.
LLM Safety 最新论文推介 - 2024.1.10 - 知乎 - 知乎专栏
2024年1月10日 · 该系列将定期更新arxiv上有关Safety的paper,将会不定时更新,旨在帮助为 LLM Safety 领域的研究者推送最新的研究进展,并进行快速了解 。 此外,我们也将会在 GitHub 上维护我们有关Safety的Repo,该Repo将会更新LLM Safety的经典Paper以及其他的资料,并且同步更新最新的Paper信息,地址⬇️
Improving LLM Safety Alignment with Dual-Objective Optimization
2025年3月6日 · Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and ...
- 某些结果已被删除