LLM Safety - 搜索

约 10,400,000 个结果

在新选项卡中打开链接

时间不限

github.com
https://github.com › ydyjya › Awesome-LLM-Safety
ydyjya/Awesome-LLM-Safety - GitHub
We've curated a collection of the latest 😋, most comprehensive 😎, and most valuable 🤩 resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles.
zhihu.com
https://zhuanlan.zhihu.com
LLM Safety 最新论文推介 - 2025.03.12 - 知乎 - 知乎专栏
3 天之前 · 关键词: Long-Context Safety&LLM Evaluation&Safety Benchmark. 摘要：随着大型语言模型（LLMs）在长文本理解和生成任务上的进步，长上下文带来的安全问题也逐渐显现。然而，目前关于长上下文任务的安全性研究仍处于初步阶段，缺乏系统的评估方法和改进策略。
zhihu.com
https://zhuanlan.zhihu.com
LLM Safety 最新论文推介 - 2024.3.22 - 知乎 - 知乎专栏
2024年3月22日 · 本文介绍了EasyJailbreak，一个统一框架，简化了对LLMs进行越狱攻击的构建和评估过程。它通过四个组成部分——选择器、变异器、约束器和评估器——构建越狱攻击，使研究人员可以轻松地从新旧组件的组合中构建攻击。目前，EasyJailbreak支持11种不同的越狱方法，并促进了对广泛LLMs安全性的验证。通过对10种不同LLMs的验证，发现存在显著的安全漏洞，平均攻击成功率为60%。特别是，即使是先进的模型如GPT-3.5-Turbo和GPT-4也表现出平 …
github.com
https://github.com › tjunlp-lab › Awesome-LLM-Safety-Papers
tjunlp-lab/Awesome-LLM-Safety-Papers - GitHub
This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks.
github.com
https://github.com › thu-coai › SafetyBench
GitHub - thu-coai/SafetyBench: Official github repo for …
SafetyBench is a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. SafetyBench also incorporates both Chinese and English data, …
arxiv.org
https://arxiv.org › abs
Agent-SafetyBench: Evaluating the Safety of LLM Agents
2024年12月19日 · In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.
zhihu.com
https://www.zhihu.com › column
LLM Safety论文推介 - 知乎
2 天之前 · 该系列将定期更新arxiv上有关Safety的paper，将会不定时更新，旨在帮助为LLM Safety领域的研究者推送最新的研究进展，并进行快速了解。此外，我们也将会在GitHub上维护我们有关Safety的Repo，该Repo将会更新LLM Safety的经典Paper以及其他的资料，并且同步更新 …
arxiv.org
https://arxiv.org › abs
[2412.17686] Large Language Model Safety: A Holistic Survey
2024年12月23日 · This survey provides a comprehensive overview of the current landscape of LLM safety, covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks.
zhihu.com
https://zhuanlan.zhihu.com
LLM Safety 最新论文推介 - 2024.1.10 - 知乎 - 知乎专栏
2024年1月10日 · 该系列将定期更新arxiv上有关Safety的paper，将会不定时更新，旨在帮助为 LLM Safety 领域的研究者推送最新的研究进展，并进行快速了解。此外，我们也将会在 GitHub 上维护我们有关Safety的Repo，该Repo将会更新LLM Safety的经典Paper以及其他的资料，并且同步更新最新的Paper信息，地址⬇️
arxiv.org
https://arxiv.org › abs
Improving LLM Safety Alignment with Dual-Objective Optimization
2025年3月6日 · Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and ...
某些结果已被删除
分页
- 1
- 2
- 3
- 4
- 下一页

ydyjya/Awesome-LLM-Safety - GitHub

LLM Safety 最新论文推介 - 2025.03.12 - 知乎 - 知乎专栏

LLM Safety 最新论文推介 - 2024.3.22 - 知乎 - 知乎专栏

tjunlp-lab/Awesome-LLM-Safety-Papers - GitHub

GitHub - thu-coai/SafetyBench: Official github repo for …

Agent-SafetyBench: Evaluating the Safety of LLM Agents

LLM Safety论文推介 - 知乎

[2412.17686] Large Language Model Safety: A Holistic Survey

LLM Safety 最新论文推介 - 2024.1.10 - 知乎 - 知乎专栏

Improving LLM Safety Alignment with Dual-Objective Optimization