In the AI era, data has indeed become a key resource. As human data gradually depletes, synthetic data is seen as the future direction, but caution is needed when using it.
A recent paper featured on the cover of Nature sparked discussion. The study pointed out that using only AI-generated content in training could lead to model collapse. This sparked widespread discussion in the AI community, with many believing that the core issue lies in data quality rather than synthetic data itself.
To avoid model collapse, experts have proposed the following suggestions:
-
Use hybrid data. The CEO of Scale AI believes that pure synthetic data cannot bring information gain, and a mix of real-world data, human expert involvement, and formal logic engines should be used.
-
Adopt reinforcement learning methods. Researchers from institutions like Meta proposed using a "rank-prune feedback" method to recover and enhance model performance.
-
Utilize human supervision. Research shows that screening high-quality data through human supervision is more effective and cost-efficient than direct manual labeling.
-
Incorporate real data. In experiments, relying solely on generated data led to performance degradation, while combining real data and feedback could improve performance.
Overall, synthetic data does have potential, but it needs to be used cautiously and combined with other methods to avoid model collapse and achieve performance improvements. The future direction may be a combination of hybrid data and reinforcement learning.