Research has found that using AI-generated data to train AI models may lead to a phenomenon called "model collapse". The main conclusions are as follows:
-
If a large amount of AI-generated content is used in training data, the model will develop irreversible defects, and low-probability events in the original content distribution will disappear.
-
This effect is called "model collapse", similar to inbreeding producing low-quality offspring.
-
Researchers trained an initial model using Wikipedia articles, then trained multiple generations of models using text generated by the previous generation model.
-
Results showed that as the number of iterations increased, the quality of model output rapidly declined:
- Generation 0 began to show factual errors and strange symbols
- Generation 5 became complete gibberish
- Generation 9 showed more irrelevant content and garbled text
-
This indicates that using AI-generated data to train models leads to multi-generational degradation and eventual collapse.
-
To avoid this situation, more high-quality human-generated data needs to be used for training.
-
As AI content floods the internet, obtaining genuine human data will become more difficult and valuable in the future.
In conclusion, this research warns of the potential risks of abusing AI-generated data for model training and emphasizes the importance of high-quality human data.