The Truth About Open Source in the Large Language Model Field How much open source is really happening in the large language model domain?

When discussing open-source large language models, we focus on artificial intelligence language systems that can be freely obtained, used, and modified. These models are trained on massive amounts of text data and can understand and generate human language, providing a foundation for various applications. We pay attention to their technical characteristics, development trends, application potential, and impact on the field of artificial intelligence.

Open source software development typically follows the principles of reciprocal cooperation and peer production, promoting improvements in production modules, communication channels, and interactive communities. Typical examples include Linux and Mozilla Firefox.

Closed source software (proprietary software) does not disclose source code due to commercial or other reasons, only providing computer-readable programs (such as binary format). The source code is only possessed and controlled by developers. Typical examples include Windows and Android.

Open source is a software development model based on openness, sharing, and collaboration, encouraging everyone to participate in software development and improvement, driving continuous technological progress and widespread application.

Software choosing closed source development is more likely to become a stable, focused product, but closed source software usually costs money, and if it has any errors or missing features, one can only wait for the vendor to solve the problem.

As for what an open source large model is, the industry has not reached a clear consensus like open source software.

The open sourcing of large language models and software open sourcing are similar in concept, both based on openness, sharing, and collaboration, encouraging community participation in development and improvement, promoting technological progress and increasing transparency.

However, there are significant differences in implementation and requirements.

Software open sourcing mainly targets applications and tools, with lower resource requirements for open sourcing, while open sourcing large language models involves large amounts of computational resources and high-quality data, and may have more usage restrictions. Therefore, although both aim to promote innovation and technology dissemination, open sourcing large language models faces more complexity, and the forms of community contribution also differ.

Li Yanhong also emphasized the difference between the two, stating that model open sourcing is not equivalent to code open sourcing: "Model open sourcing only provides a set of parameters, and still requires SFT (supervised fine-tuning) and safety alignment. Even with the corresponding source code, one doesn't know what proportion and what type of data was used to train these parameters, making it impossible to achieve collective effort. Having these things doesn't allow you to stand on the shoulders of giants for iterative development."

Full-process open sourcing of large language models includes making the entire model development process transparent, from data collection, model design, training to deployment. This approach not only includes the disclosure of datasets and the openness of model architecture but also covers the sharing of training process code and the release of pre-trained model weights.

Over the past year, the number of large language models has increased significantly, with many claiming to be open source, but how open are they really?

AI research scholars Andreas Liesenfeld from Radboud University in the Netherlands and computational linguist Mark Dingemanse also found that although the term "open source" is widely used, many models are at most "open weights," with most other aspects of system construction hidden.

For example, tech giants like Meta and Microsoft label their large language models as "open source" but do not disclose important information related to underlying technologies. Surprisingly, AI companies and institutions with fewer resources performed more admirably.

The research team analyzed a series of popular "open source" large language model projects, evaluating their actual openness from multiple aspects such as code, data, weights, API, and documentation. The study also used OpenAI's ChatGPT as a closed-source reference point to highlight the true state of "open source" projects.

✔ for open, ~ for partially open, X for closed

The results show significant differences between projects. According to this ranking, Allen Institute for AI's OLMo is the most open open-source model, followed by BigScience's BloomZ, both developed by non-profit organizations.

The paper states that although Meta's Llama and Google DeepMind's Gemma claim to be open source or open, they are actually only open weights. External researchers can access and use pre-trained models but cannot inspect or customize the models, nor do they know how the models are fine-tuned for specific tasks.

The recent releases of LLaMA 3 and Mistral Large 2 have attracted widespread attention. In terms of model openness, LLaMA 3 has made model weights public, allowing users to access and use these pre-trained and instruction-tuned model weights. Additionally, Meta has provided some basic code for model pre-training and instruction fine-tuning, but has not provided complete training code. The training data for LLaMA 3 has also not been made public. However, this time Meta brought a 93-page technical report on LLaMA 3.1 405B.

The situation with Mistral Large 2 is similar, maintaining a high degree of openness in terms of model weights and API, but with a lower degree of openness in complete code and training data. It has adopted a strategy that balances commercial interests and openness, allowing research use but with restrictions on commercial use.

Google stated that the company is "very precise in language" when describing models, referring to Gemma as open rather than open source. "Existing open source concepts don't always directly apply to AI systems," they said.

An important background to this research is the EU's AI Act, which, when in effect, will implement more lenient regulations on models classified as open, making the definition of open source potentially more important.

Researchers say that the only way to innovate is by adjusting models, which requires enough information to build one's own version. Moreover, models must be subject to scrutiny; for example, if a model has been trained on a large number of test samples, passing a specific test may not be an achievement.

They are also delighted by the emergence of so many open source alternatives. ChatGPT is so popular that it's easy to forget that we know nothing about its training data or other behind-the-scenes methods. This is a hidden danger for those who want to better understand the model or build applications based on it, while open source alternatives make critical foundational research possible.

Silicon Humans also conducted statistics on the open source situation of some domestic large language models:

From the table, we can see that, similar to the overseas situation, models that are more thoroughly open source are basically led by research institutions. This is mainly because research institutions aim to promote scientific research progress and industry development, and are more inclined to open up their research results.

Commercial companies, on the other hand, use their resource advantages to develop more powerful models and gain advantages in competition through appropriate open source strategies.

From GPT-3 to BERT, open source has brought important momentum to the large model ecosystem.

By making their architectures and training methods public, researchers and developers can further explore and improve on these foundations, giving rise to more cutting-edge technologies and applications.

The emergence of open source large models has significantly lowered the threshold for development. Developers and small and medium-sized enterprises can utilize these advanced AI technologies without having to build models from scratch, thus saving a lot of time and resources. This has enabled more innovative projects and products to be quickly implemented, driving the development of the entire industry. Developers actively sharing optimization methods and application cases on open source platforms have also promoted technology maturity and application.

For education and scientific research, open source large language models provide valuable resources. Students and novice developers can quickly master advanced AI technologies by studying and using these models, shortening the learning curve and injecting fresh blood into the industry.

However, the openness of large language models is not a simple binary characteristic. The system architecture based on Transformer and its training process are extremely complex and difficult to simply classify as open or closed. Open source large models are not a simple label, but more like a spectrum, ranging from fully open source to partially open source, with varying degrees.

Open sourcing large language models is a complex and meticulous task, and not all models must be open source.

It should not be demanded to be fully open source in a "moral kidnapping" manner, as this involves a lot of technical, resource, and security considerations, and needs to balance openness and security, innovation and responsibility. Just like other aspects of the technology field, diverse contribution methods can build a richer technological ecosystem.

The relationship between open source and closed source models may be analogous to the coexistence of open source and closed source software in the software industry.

Open source models promote the widespread dissemination and innovation of technology, while closed source models provide more professional and secure solutions in specific fields. The two complement each other and jointly promote the development of artificial intelligence technology.

In the future, we may see more hybrid models emerge, such as partial open source or conditional open source, to balance technology sharing and commercial interests.

Whether open source or closed source, it is important to ensure the safety, reliability, and ethics of the model. This requires joint efforts from the industry, academia, and regulatory agencies to develop appropriate standards and norms to ensure the healthy development of AI technology.

Overall, open source and closed source large language models each have their advantages and limitations. Open source models promote the widespread dissemination and innovation of technology, while closed source models provide more professional and secure solutions in specific fields. The coexistence and competition of the two will drive the entire AI industry forward, bringing more choices and better experiences to users.

In the future, we may see more hybrid models emerge, such as partial open source or conditional open source, to balance technology sharing and commercial interests. Regardless of the model adopted, ensuring the safety, reliability, and ethics of the model is crucial, which requires joint efforts from the industry, academia, and regulatory agencies.