Model Collapse, also known as AI collapse or Habsburg AI, refers to a phenomena where machine learning models gradually degrade due to errors coming from uncurated training on synthetic data, meaning the outputs of another model including prior versions of itself.[1][2][3]

Shumailov et al. [1] coined the term and described two specific stages to the degradation: early model collapse and late model collapse. In early model collapse the model begins losing information about the tails of the distribution – mostly affecting minority data. Later work highlighted that early model collapse is hard to notice, since overall performance may appear to improve, while the model loses performance on minority data.[4] In the late model collapse model loses a significant proportion of its performance, confusing concepts and losing most of its variance.

Mechanism

edit

Synthetic data, although theoretically indistinguishable from real data, is almost always biased, inaccurate, not well representative of the real data, harmful, or presented out-of-context.[5][6] Using such data as training data leads to issues with quality and reliability of the trained model.[7][8]

Model Collapse occurs for three main reasons – functional approximation errors, sampling errors, and learning errors.[1] Importantly, it happens in even the simplest of models, where not all of the error sources are present. In more complex models the errors oftentimes compound, leading to faster collapse.

Inevitability

edit

From even simplest models it becomes clear that model collapse is not inevitable. For example in the gaussian model,[1] a superlinearly increasing amount of data is needed to curb model collapse. Later work [9] highlighted that it can also be bounded in other settings, yet comes with a significant training cost – requiring accumulating and tracking data over time.

Alternative branch of literature investigates use of machine learning detectors and watermarking to identify model generated data and filter it out.[10][11]

Mathematical models of the phenomenon

edit

Multidimensional Gaussian Model

edit

In the case of multidimensional model with fully synthetic data, exact collapse can be shown.[7][1]

Linear Regression

edit

In the case of a linear regression model,[12][13] scaling laws and bounds on learning can be found

Statistical Language Model

edit

In the case of a linear softmax classifier for next token prediction,[14] exact bounds on learning with a pertially synthetic data can be found.

 
Model collapse in generative models can be curbed by accumulating data

Impact on large language models

edit

In the context of large language models, research found that training LLMs on predecessor-generated text—language models are trained on the synthetic data produced by previous models—causes a consistent decrease in the lexical, syntactic, and semantic diversity of the model outputs through successive iterations, notably remarkable for tasks demanding high levels of creativity.[15]

Data poisoning for artists

edit

Data poisoning is a form of Adversarial machine learning where the data of an image or text is altered so it cannot be trained on accurately by a training model. There are two main types of data poisoning, defensive, where an image's data is alter to protect the integrity of the work by preventing copying and look-alikes, and offensive, where an image is altered to reduce the reliability of generative artificial intelligent image generation.[16] However, it is unknown how much data poisoning affects training data and generative artificial intelligence on a large scale.

References

edit
  1. ^ a b c d e Shumailov, Ilia; Shumaylov, Zakhar; Zhao, Yiren; Gal, Yarin; Papernot, Nicolas; Anderson, Ross (2023-05-31). "The Curse of Recursion: Training on Generated Data Makes Models Forget". arXiv:2305.17493 [cs.LG].
  2. ^ Ozsevim, Ilkhan (2023-06-20). "Research finds ChatGPT & Bard headed for 'Model Collapse'". Retrieved 2024-03-06.
  3. ^ Mok, Aaron. "A disturbing AI phenomenon could completely upend the internet as we know it". Business Insider. Retrieved 2024-03-06.
  4. ^ Wyllie, Sierra; Shumailov, Ilia; Papernot, Nicolas (2024-06-05). "Fairness Feedback Loops: Training on Synthetic Data Amplifies Bias". Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. FAccT '24. New York, NY, USA: Association for Computing Machinery: 2113–2147. arXiv:2403.07857. doi:10.1145/3630106.3659029. ISBN 979-8-4007-0450-5.
  5. ^ De Rosa, Micholas (May 31, 2024). "How the new version of ChatGPT generates hate and disinformation on command". CBC. Retrieved June 13, 2024.
  6. ^ Orland, Kyle (May 24, 2024). "Google's "AI Overview" can give false, misleading, and dangerous answers". arsTechinca. Retrieved June 13, 2024.
  7. ^ a b Alemohammad, Sina; Casco-Rodriguez, Josue; Luzi, Lorenzo; Humayun, Ahmed Imtiaz; Babaei, Hossein; LeJeune, Daniel; Siahkoohi, Ali; Baraniuk, Richard G. (July 4, 2023). "Self-Consuming Generative Models Go MAD". arXiv:2307.01850 [cs.LG].
  8. ^ Self-Consuming Generative Models Go MAD. The Twelfth International Conference on Learning Representations.
  9. ^ Gerstgrasser, Matthias; Schaeffer, Rylan; Dey, Apratim; Rafailov, Rafael; Sleight, Henry; Hughes, John; Korbak, Tomasz; Agrawal, Rajashree; Pai, Dhruv; Gromov, Andrey; Roberts, Daniel A.; Yang, Diyi; Donoho, David L.; Koyejo, Sanmi (2024-04-01). "Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data". arXiv:2404.01413 [cs.LG].
  10. ^ Kirchenbauer, John; Geiping, Jonas; Wen, Yuxin; Katz, Jonathan; Miers, Ian; Goldstein, Tom (2023-07-03). "A Watermark for Large Language Models". Proceedings of the 40th International Conference on Machine Learning. PMLR: 17061–17084.
  11. ^ "My AI Safety Lecture for UT Effective Altruism". Shtetl-Optimized. 2022-11-29. Retrieved 2024-06-22.
  12. ^ Dohmatob, Elvis; Feng, Yunzhen; Kempe, Julia (2024-02-12). "Model Collapse Demystified: The Case of Regression". arXiv.org. Retrieved 2024-06-22.
  13. ^ Dohmatob, Elvis; Feng, Yunzhen; Yang, Pu; Charton, Francois; Kempe, Julia (2024-02-10), A Tale of Tails: Model Collapse as a Change of Scaling Laws, doi:10.48550/arXiv.2402.07043, retrieved 2024-06-22
  14. ^ Seddik, Mohamed El Amine; Chen, Suei-Wen; Hayou, Soufiane; Youssef, Pierre; Debbah, Merouane (2024-04-07), How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse, doi:10.48550/arXiv.2404.05090, retrieved 2024-06-22
  15. ^ Guo, Yanzhu; Shang, Guokan; Vazirgiannis, Michalis; Clavel, Chloé (2024-04-16). "The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text". arXiv:2311.09807 [cs.CL].
  16. ^ The Nightshade Team. "What is Nightshade". Nightshade. University of Chicago. Retrieved June 13, 2024.