
lm-evaluation-harness/lm_eval/tasks/mmlu/README.md at main
mmlu: Original multiple-choice MMLU benchmark; mmlu_continuation: MMLU but with continuation prompts; mmlu_generation: MMLU generation; MMLU is the original benchmark ...
lm-evaluation-harness/lm_eval/tasks/mmlu_pro/README.md at …
Title: MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark Abstract: In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains.However, …
Measuring Massive Multitask Language Understanding - GitHub
This is the repository for Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This repository contains OpenAI API evaluation code, and the test is available for download here ...
GitHub - aryopg/mmlu-redux
We fine-tune the Llama-3 (8B-Instruct) using LabelChaos datasets. To balance the distribution, where most instances are labelled as "correct", we adjusted the label distribution to: 0.1 (Wrong Ground Truth), 0.1 (Poor Question Clarity), 0.1 (No Correct Answers), 0.1 (Unclear Options), 0.1 (Multiple Correct Answers), and 0.5 (correct).The training involves 2048 steps, with a batch …
MMLU-Pro/README.md at main · TIGER-AI-Lab/MMLU-Pro - GitHub
2024年10月10日 · We introduce MMLU-Pro, an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. Building on the Massive Multitask Language Understanding (MMLU) dataset, MMLU-Pro integrates more challenging, reasoning-focused questions and increases the answer choices per question from …
MMLU_Chinese/README.md at master - GitHub
Measuring Massive Multitask Language Understanding | ICLR 2021 - MMLU_Chinese/README.md at master · chaoswork/MMLU_Chinese
GitHub - VILA-Lab/Mobile-MMLU: Mobile-MMLU: A Mobile …
Mobile-MMLU is a comprehensive benchmark designed to evaluate mobile-compatible Large Language Models (LLMs) across 80 diverse fields including Education, Healthcare, and Technology. Our benchmark is redefining mobile intelligence evaluation for a smarter future, with a focus on real-world ...
GitHub - MoonshotAI/Moonlight
Contribute to MoonshotAI/Moonlight development by creating an account on GitHub. Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully …
TIGER-AI-Lab/MMLU-Pro - GitHub
We introduce MMLU-Pro, an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. Building on the Massive Multitask Language Understanding (MMLU) dataset, MMLU-Pro integrates more challenging, reasoning-focused questions and increases the answer choices per question from four to ten, significantly raising …
GitHub - deepseek-ai/DeepSeek-V3
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2.