Apple has entered the open-source large language model arena with a more open approach than other companies. They released a 7B parameter model that performs comparably to Llama 3 8B, while open-sourcing the entire training process and resources.
This move comes after recent criticism from Nature editor Elizabeth Gibney that many supposedly open-source AI models lack transparency in data and training methods, failing to meet true scientific research needs. Apple's release addresses these concerns head-on.
NLP scientist and AutoAWQ creator expressed amazement, noting that Apple not only released a model beating Mistral 7B, but also open-sourced everything including the pre-training dataset.
The significance of this open-source release was highlighted by a community member:
For anyone looking to train models from scratch or fine-tune existing ones, the data management process is essential to study.
In addition to Apple's release, Mistral AI partnered with NVIDIA to launch a 12B parameter small model last week. The HuggingFace founder declared it "small model week".
Apple's new small model shows impressive capabilities:
- 7B base model trained on open datasets using 2.5T tokens
- Primarily English data with 2048 token context window
- Datasets include DCLM-BASELINE, StarCoder and ProofPile2
- MMLU score approaching Llama 3 8B
- Trained using PyTorch and OpenLM frameworks
The research team introduced a new language model data comparison benchmark called DCLM. They found that automatically filtering and selecting high-quality data from larger datasets using machine learning models may be key to building high-quality training sets.
Using DCLM, they designed a high-quality dataset DCLM-BASELINE to train the 7B parameter DCLM-7B model from scratch.
DCLM-7B achieved 64% 5-shot accuracy on the MMLU benchmark, comparable to Mistral-7B-v0.3 (63%) and Llama 3 8B (66%). It also matched Llama 3 8B's average performance across 53 natural language understanding tasks while requiring only 1/6 of the compute.
Compared to other similarly-sized models, DCLM-7B's MMLU score surpassed Mistral-7B and approached Llama 3 8B.
To test the new dataset's effectiveness, an industry professional trained GPT-2 1.5B using llm.c to compare DCLM-Baseline with FineWeb-Edu. Results showed DCLM-Baseline achieved higher average scores, performing better on tasks like ARC, HellaSwag, and MMLU.
The trend towards smaller models has been gaining momentum:
- HuggingFace launched the "SmolLM" family of small models (135M, 360M, 1.7B)
- OpenAI released GPT-4o mini, approaching GPT-4's capabilities at a lower cost
- Mistral AI and NVIDIA released the 12B parameter Mistral NeMo model
The shift towards smaller models is driven by their ability to achieve comparable performance to larger models while significantly reducing costs. As demonstrated by the smol AI founder, models like GPT-4o mini offer lower overall pricing compared to larger alternatives.