
ydyjya/Awesome-LLM-Safety - GitHub
We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety). But we don't stop there; included are also relevant …
LLM Safety 最新论文推介 - 2025.03.12 - 知乎 - 知乎专栏
3 天之前 · 关键词: Long-Context Safety&LLM Evaluation&Safety Benchmark. 摘要:随着大型语言模型(LLMs)在长文本理解和生成任务上的进步,长上下文带来的安全问题也逐渐显现。然 …
LLM Safety 最新论文推介 - 2024.3.22 - 知乎 - 知乎专栏
2024年3月22日 · 本文介绍了EasyJailbreak,一个统一框架,简化了对LLMs进行越狱攻击的构建和评估过程。 它通过四个组成部分——选择器、变异器、约束器和评估器——构建越狱攻击, …
tjunlp-lab/Awesome-LLM-Safety-Papers - GitHub
This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and …
GitHub - thu-coai/SafetyBench: Official github repo for …
SafetyBench is a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety …
Agent-SafetyBench: Evaluating the Safety of LLM Agents
2024年12月19日 · In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 …
LLM Safety论文推介 - 知乎
2 天之前 · 该系列将定期更新arxiv上有关Safety的paper,将会不定时更新,旨在帮助为LLM Safety领域的研究者推送最新的研究进展,并进行快速了解 。 此外,我们也将会在GitHub上 …
[2412.17686] Large Language Model Safety: A Holistic Survey
2024年12月23日 · This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial …
LLM Safety 最新论文推介 - 2024.1.10 - 知乎 - 知乎专栏
2024年1月10日 · 该系列将定期更新arxiv上有关Safety的paper,将会不定时更新,旨在帮助为 LLM Safety 领域的研究者推送最新的研究进展,并进行快速了解 。 此外,我们也将会在 …
Improving LLM Safety Alignment with Dual-Objective Optimization
2025年3月6日 · Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely …
- 某些结果已被删除