Now, an "oil crisis" sweeping the AI industry has emerged, with almost every AI company desperately seeking new sources of language data. However, no amount of data seems to satisfy the appetite of large AI models. Moreover, more and more content platforms have realized the value of their data and are beginning to guard it jealously. As a result, "synthetic data" has become a new direction for exploration in the entire AI industry.
For quite a long time, it was unknown whether synthetic data could be useful, until recently when Dr. Thomas Scialom, an AI researcher at Meta, provided an answer to this question. According to him, Meta's Llama 3 open-source large model did not rely on any human-written answers in its training, but was entirely based on synthetic data generated by Llama 2.
When introducing the training details of Llama 3, Thomas Scialom mentioned the application of synthetic data in different scenarios of large models, such as feedback on code execution, translation of programming languages, back-translation of documents, question-answering for long texts, summarization of long documents, and reasoning about code repositories, all of which extensively used synthetic data. This also explains how Meta's Llama 3 large model, launched this spring, exceeded 400 billion parameters in scale and achieved seven times the training data volume of Llama 2.
Synthetic data generally refers to new data produced by algorithms that mimic the characteristics of real-world data. So how is this "bootstrapping" operation achieved? Two papers published by related teams from Meta and Microsoft can reveal the secret of training large models using synthetic data. Among them, Meta refers to large models trained using synthetic data as "self-rewarding language models," where the large model itself generates training data, evaluates the quality of this data, and then uses this data to train itself.
Self-rewarding language models are actually applications of AI Feedback Reinforcement Learning (RLAIF). Meta's specific approach is to first pre-train an initial model based on a small amount of manually annotated data, then let the initial model generate multiple candidate responses based on questions, and use the LLM-as-a-Judge method proposed by Dr. Andrew Ng to have the large language model score its own generated responses. New training data is formed based on these scores, which is then used to continue training the model.
In this process, the most important thing is to enable the large model to generate and evaluate new instructions according to examples and add them to its own training set. Since the binary language used by computers is different from human language, researchers need to convert human language into a form that computers can understand, which is called "text embedding." For example, Microsoft's research team defined a series of text embedding tasks and designed specific prompts for these tasks to guide large language models in generating specific data.
The specific prompts created by researchers include two key elements: questions and roles, which are then combined. For example, combining drivers and math problems can generate questions at the primary and secondary school levels, guiding large language models to synthesize data from corresponding perspectives. This is the secret of self-rewarding language models. Then researchers only need to clean and format the generated data, remove duplicate content, correct format errors to ensure they meet training needs.
The advantage of synthetic data is that it can reflect the properties of real data in mathematical and physical terms, and since it does not require manual annotation, it also greatly reduces human errors caused by data collection and transfer processes and inconsistent manual standards. So the question arises: if synthetic data can solve the problem of scarce training data and the resulting high costs, why do many AI companies still tend to mine or purchase human-generated data?
The most critical reason is that despite carefully designed prompts and supervised training, the inherent biases and hallucinations of large language models may still introduce noise into the dataset. Large language models trained on erroneous, hallucinatory, or biased synthetic data will not be able to generalize to real-world scenarios. Large language models based on synthetic data need to avoid being "contaminated" by machine learning, and the higher the proportion of synthetic data in the training data, the more difficult it is to improve natural language understanding capabilities.
For example, Stanford professor Percy Liang pointed out that synthetic data lacks precious "humanity," so that large models trained on synthetic data are not sufficient to achieve AGI. More importantly, synthetic data can be used to verify or expand areas that humans already know, but cannot reveal areas that do not exist in the initial dataset; its boundary is the boundary of the initial dataset.
Therefore, it is theoretically possible that Meta trained Llama 3 based on synthetic data generated by Llama 2, but they didn't tell us how much manpower and time this process actually cost. Although synthetic data is indeed cheaper than real data, it is still unknown how much it costs to eliminate unqualified synthetic data.
If synthetic data were really cheaper than real data in all aspects, even with the problems of hallucinations and AI ethics, there would be no reason for major AI companies to continue focusing on human-generated data.