Publications

Llemma: An Open Language Model For Mathematics

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck
16 Oct 2023
ICLR 2024 (poster) and MATH-AI Workshop at NeurIPS 2023 (poster)
One year ago, Google published Minerva, a powerful LLM capable of impressive mathematical reasoning. Minerva is not publicly accessible, preventing research from building on these advances.
Llemma is a family of open-source language models for mathematics including Llemma 34B, a LLM reaching similar performance to Minerva 62B. The Llemma models are obtained by continuing the training of Code Llama models on ProofPile II, a 55B mathematics token dataset. Llemma exceeds Minerva’s problem solving performance on an equi-parameter basis, while covering a wider distribution of tasks, including tool use and formal mathematics.
For more information on Llemma, please read the paper and the Twitter thread!
📄 Paper | 🤗 Model | 🤗 Dataset | Code | Thread

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba
10 Oct 2023
ICLR 2024 (poster) and MATH-AI Workshop at NeurIPS 2023 (oral & poster)
OpenWebMath is a large dataset of high-quality mathematical web text extracted from the web.
One of the key ingredients in Google’s Minerva model is a closed dataset of every math document on the web, something that isn’t available to the academic community.
OpenWebMath is the open-source replication of this dataset, bringing 14.7B new math tokens to the community. It includes text from all popular reference websites (Wikipedia, nLab), forums (MathHelpForum, MathOverflow), blogs (Wordpress, Blogspot), and more!
For more information on OpenWebMath, please read the paper and the Twitter thread!
📄 Paper | 🤗 Dataset | Code | Thread