-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 192 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 35
Collections
Discover the best community collections!
Collections including paper arxiv:2502.13595
-
MMTEB: Massive Multilingual Text Embedding Benchmark
Paper • 2502.13595 • Published • 32 -
MTEB: Massive Text Embedding Benchmark
Paper • 2210.07316 • Published • 6 -
The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding
Paper • 2406.02396 • Published -
Extending the Massive Text Embedding Benchmark to French
Paper • 2405.20468 • Published • 2
-
Self-Boosting Large Language Models with Synthetic Preference Data
Paper • 2410.06961 • Published • 16 -
Qwen2.5 Technical Report
Paper • 2412.15115 • Published • 352 -
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Paper • 2412.13649 • Published • 20 -
NeoBERT: A Next-Generation BERT
Paper • 2502.19587 • Published • 38
-
CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery
Paper • 2406.08587 • Published • 16 -
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper • 2406.09170 • Published • 27 -
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Paper • 2407.18901 • Published • 33 -
Benchmarking Agentic Workflow Generation
Paper • 2410.07869 • Published • 26
-
LoRA+: Efficient Low Rank Adaptation of Large Models
Paper • 2402.12354 • Published • 6 -
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper • 2402.12659 • Published • 21 -
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Paper • 2402.13249 • Published • 13 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69