ben burtenshaw's picture

ben burtenshaw

burtenshaw

AI & ML interests

None yet

Recent Activity

updated a dataset 4 minutes ago
agents-course/certificates
updated a dataset 31 minutes ago
reasoning-course/certificates
updated a dataset 37 minutes ago
reasoning-course/certificates
View all activity

Organizations

Hugging Face's profile picture Hugging Face Course's profile picture Argilla's profile picture Blog-explorers's profile picture MLX Community's profile picture distilabel-internal-testing's profile picture Data Is Better Together's profile picture Social Post Explorers's profile picture Hugging Face Discord Community's profile picture argilla-internal-testing's profile picture Open Human Feedback's profile picture Argilla Warehouse's profile picture uplimit's profile picture open/ acc's profile picture Data Is Better Together Contributor's profile picture Open Source AI Research Community's profile picture FeeL (Feedback Loop)'s profile picture Hugging Face Agents Course's profile picture Agents Course Students's profile picture Agents Course Finishers's profile picture Open R1's profile picture Hugging Face Reasoning Course's profile picture

Posts 24

view post
Post
226
The open LLM leaderboard is completed, retired, dead, ‘ascended to a higher plane’. And in its shadow we have an amazing range of leaderboards built and maintained by the community.

In this post, I just want to list some of those great leaderboards that you should bookmark for staying up to date:

- Chatbot Arena LLM Leaderboard is the first port of call for checking out the best model. It’s not the fastest because humans will need to use the models to get scores, but it’s worth the wait. lmarena-ai/chatbot-arena-leaderboard

- OpenVLM Leaderboard is great for getting scores on vision language models opencompass/open_vlm_leaderboard

- Ai2 are doing a great job on RewardBench and I hope they keep it up because reward models are the unsexy workhorse of the field. allenai/reward-bench

- The GAIA leaderboard is great for evaluating agent applications. gaia-benchmark/leaderboard

🤩 This seems like such a sustainable way of building for the long term, where rather than leaning on a single company to evaluate all LLMs, we share the load.

Articles 13

Article
6

Making Gemma 3 think