AIs can generate near-verbatim copies of novels from training data

Ars Technica

by Melissa Heikkilä, Financial Times

February 23, 2026

AI-Generated Deep Dive Summary

Recent research reveals that leading AI models like those from OpenAI, Google, Meta, Anthropic, and xAI can generate near-verbatim copies of bestselling novels, raising concerns about their ability to store copyrighted material. This finding challenges the industry’s long-standing defense that large language models (LLMs) “learn” from copyrighted works without storing exact copies. Studies show these models memorize significantly more training data than previously believed, potentially undermining AI companies’ legal arguments in ongoing copyright lawsuits. The memorization capability of LLMs appears to occur during fine-tuning, where models may retain specific text sequences from their training data. While developers often claim that the data is not stored verbatim, the evidence suggests otherwise. This revelation could have serious implications for AI firms facing lawsuits, as it calls into question their ability to avoid infringing on copyrighted works. The issue also raises important questions about how users and developers interact with AI models. If these systems can reproduce content so closely, it could lead to unintended consequences for those who rely on them. Legal experts warn that this development could increase the risk of copyright violations, even if unintentional, making it harder for companies to defend against claims of infringement. For readers interested in tech and science, this story highlights a critical intersection of AI capabilities and intellectual property law. As LLMs become more powerful, understanding their limitations—and potential liabilities—is essential for both developers and users. This revelation underscores the need for clearer guidelines and stricter safeguards to prevent misuse while balancing innovation. Ultimately, this breakthrough challenges assumptions about how AI learns and

Verticals

techscience

Originally published on Ars Technica on 2/23/2026