
lm-evaluation-harness/lm_eval/tasks/mmlu/README.md at main
mmlu: Original multiple-choice MMLU benchmark; mmlu_continuation: MMLU but with continuation prompts; mmlu_generation: MMLU generation; MMLU is the original benchmark ...
lm-evaluation-harness/lm_eval/tasks/mmlu_pro/README.md at …
Title: MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark Abstract: In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains.However, …
Measuring Massive Multitask Language Understanding - GitHub
This is the repository for Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This repository contains OpenAI API evaluation code, and the test is available for download here ...
Qwen/eval/EVALUATION.md at main · QwenLM/Qwen · GitHub
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud. - Qwen/eval/EVALUATION.md at main · QwenLM/Qwen
MMLU-Pro/README.md at main · TIGER-AI-Lab/MMLU-Pro - GitHub
We introduce MMLU-Pro, an enhanced benchmark designed to evaluate language understanding models across broader and more challenging tasks. Building on the Massive Multitask Language Understanding (MMLU) dataset, MMLU-Pro integrates more challenging, reasoning-focused questions and increases the answer choices per question from four to ten, significantly raising …
MMLU_Chinese/README.md at master - GitHub
Measuring Massive Multitask Language Understanding | ICLR 2021 - MMLU_Chinese/README.md at master · chaoswork/MMLU_Chinese
GitHub - VILA-Lab/Mobile-MMLU: Mobile-MMLU: A Mobile …
Mobile-MMLU is a comprehensive benchmark designed to evaluate mobile-compatible Large Language Models (LLMs) across 80 diverse fields including Education, Healthcare, and Technology. Our benchmark is redefining mobile intelligence evaluation for a smarter future, with a focus on real-world ...
GitHub - nlp-waseda/JMMLU: 日本語マルチタスク言語理解ベンチ …
日本語マルチタスク言語理解ベンチマーク Japanese Massive Multitask Language Understanding Benchmark. 日本語🇯🇵 | English🇬🇧 | 中文🇨🇳 JMMLUは、マルチタスク言語理解ベンチマークMMLU (Paper, Github)の一部を日本語に翻訳した問題(翻訳問題)、および、日本独自の文化的背景に基づく問題(日本問題)によって ...
GitHub - aryopg/mmlu-redux
We fine-tune the Llama-3 (8B-Instruct) using LabelChaos datasets. To balance the distribution, where most instances are labelled as "correct", we adjusted the label distribution to: 0.1 (Wrong Ground Truth), 0.1 (Poor Question Clarity), 0.1 (No Correct Answers), 0.1 (Unclear Options), 0.1 (Multiple Correct Answers), and 0.5 (correct).The training involves 2048 steps, with a batch …
Wang-ML-Lab/MMLU-SR - GitHub
This is the official repository for "MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models ...