Join Our AI News Channel

Meta AI Model Can Recreate Nearly 50% of Harry Potter Book

Gaurav

A new study shows that Meta’s Llama 3.1 70B AI model can accurately recreate almost 42% of Harry Potter and the Sorcerer’s Stone. This could change the way copyright and generative AI are being fought over in court.

For years, copyright holders, like news publishers and authors, have sued AI companies for using copyrighted material to train their models without permission. One of the main questions has been whether AI models really remember and repeat protected text. A recent study by scientists at Stanford, Cornell, and West Virginia University says that the answer is yes for some books and models.

Key Findings

  • Meta’s mid-sized AI model, Llama 3.1 70B, came out in 2024 and can recreate 50-token sequences from Harry Potter 42% of the time. This makes it the most likely to remember things out of the five models tested.
  • Llama 1 65B and other older models remembered only 4.4% of the same book.
  • People were more likely to remember popular books like The Hobbit and 1984 than less popular ones. Sandman Slim, for instance, only had a 0.13% recall rate.
  • The researchers tested memorization by splitting books into passages that overlapped and figuring out how likely a model was to come up with the next 50 words based on a prompt.

What This Means for Copyright Lawsuits

This study is a mixed bag for both people who like and dislike AI. Critics say that the results show that large language models (LLMs) don’t just learn patterns; they also memorize real text from training data. That goes against what the industry says, which is that this kind of copying is uncommon or unintentional.

But for AI companies like Meta, the results could help make the case that not all content is memorized in the same way, which could make class-action lawsuits that try to group thousands of authors under one claim less strong.

Why Harry Potter?

The study does not definitively ascertain the manner in which the Harry Potter text was incorporated into Llama’s training data—whether directly or via derivative sources such as fan forums or book reviews. Researchers, on the other hand, say that the bigger the training group, the better the memory. Llama 3.1 learned from 15 trillion tokens, which is more than 10 times more than the last version.

The Legal Problem: When Is AI Breaking the Law?

There are three main theories of copyright infringement in AI training, according to legal experts:

  1. Copying during training means using copyrighted works in datasets.
  2. Derivative models: the idea that an AI model that stores training data is a derivative work.
  3. Output infringement is when an AI copies protected content word for word.

The results from Harry Potter support the second and third theories. Some legal scholars think that Llama 3.1 may contain copies of copyrighted works because it can recreate large parts of them.

A Legal Trade-off Between Open and Closed Models

Researchers could test memorization directly with Llama because it is an open-weight model, which means they could see how it works inside. OpenAI, Google, and Anthropic all have closed models that don’t work that way.

This openness could lead to a strange situation where companies are punished for being open, while those with closed models aren’t. Some judges might see open-source projects as a good thing for the public, while others might see them as more risky because they have unfiltered access.

Conclusion

This study makes it harder for the AI industry to say that big models only “learn patterns” and don’t remember things. It also makes people question whether current legal protections, like the Google Books ruling, are good enough. This kind of empirical research could be very important as copyright lawsuits against AI companies move forward.

Gaurav

Gaurav is the founder of FARLI.org, a platform dedicated to making sense of the rapidly evolving AI ecosystem. With a focus on practical innovation, he explores how AI can simplify work, spark creativity, and drive smarter decisions. Through FARLI, he aims to build a definitive resource for everything AI.

Tags

Related Post

Leave a Comment