AI is changing the world. Don't stay behind. Clear summaries, community insight, delivered without the noise. Subscribe to never miss a beat.

Alignment whack-a-mole: Finetuning activates recall of copyrighted books in LLMs

GitHub - cauchy221/Alignment-Whack-a-Mole-Code: The official code repo of Alignment Whack-a-Mole: Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

github.com

April 30, 2026

4 min read

🔥🔥🔥🔥🔥

58/100

Summary

The GitHub repository for Alignment Whack-a-Mole provides the data preprocessing pipeline, finetuning scripts, memorization evaluation code, and analysis scripts related to the study of large language models' verbatim recall of copyrighted books. Example files include excerpts and generations from "The Road" by Cormac McCarthy, but full book content is not included.

Key Takeaways

The GitHub repository provides code for finetuning large language models to activate verbatim recall of copyrighted books, specifically using excerpts from "The Road" by Cormac McCarthy.
The repository includes scripts for data preprocessing, finetuning, and evaluation, along with instructions for setting up dependencies and API keys for OpenAI and Tinker.
The preprocessing pipeline converts EPUB files into JSON format, segments text into excerpts, and generates plot summaries for each chunk using GPT-4o.
Users can finetune models via OpenAI, Vertex AI, and Tinker APIs, with specific commands provided for each platform to handle training and generation tasks.

Read original article

Community Sentiment

Mixed

Positives

The potential for LLMs to provide useful output based on well-OCRed scans of obscure texts could significantly enhance research capabilities, fostering greater access to specialized knowledge.
The excitement around being able to query LLMs for niche information indicates a growing appreciation for the practical applications of AI in academic and research settings.

Concerns

Concerns about copyright infringement suggest that the industry may face significant legal challenges, which could hinder innovation and access to AI-generated content.
The fear that AI companies may eventually charge users for access to information raises ethical questions about the commercialization of knowledge and its impact on funding for libraries.