The plagiarism argument against LLMs gets its sharpest f...

What happened

An essay titled *AI is just unauthorised plagiarism at a bigger scale* hit 678 points on Hacker News, written by Estonian developer Axel Kee. The thesis is simple and uncomfortable: when a model ingests copyrighted writing, code, or images without permission and then emits derivative output without attribution, the only thing separating it from a freshman caught copying Wikipedia is scale and a marketing budget.

Kee's argument leans on the standard a working developer already uses. If a human engineer pasted Stack Overflow code into production without citation, that's a fireable offense at most companies; if a model trained on that same code reproduces it verbatim, it's called 'emergent capability.' The essay doesn't engage with the deeper jurisprudence — fair use, transformative work, the four-factor test — and that's the point. It argues the legal framing has been a distraction from the obvious ethical asymmetry.

The HN thread that formed underneath was, predictably, a war. The top-rated counter-argument: humans also learn from copyrighted material without paying licensing fees every time, and we don't call that plagiarism. The top-rated agreement: humans get sued constantly for derivative work (see: every music copyright case of the last 40 years), and the model vendors have simply purchased themselves out of that liability with scale and lawyers.

Why it matters

The essay is not novel as legal theory. Authors Guild v. OpenAI, the NYT lawsuit, Getty v. Stability AI, and the now-settled Anthropic class action have been arguing this for two years. What's novel is the framing landing with developers — the people building on these models — rather than with rights-holders trying to extract a settlement.

That shift matters because developers are the ones who will or won't ship provenance features. The legal question — 'is this allowed?' — has dominated the discourse, but the engineering question — 'can we even tell where this came from?' — is the one that actually constrains product design. Right now the answer is mostly no. Retrieval-augmented generation gives you a citation for the retrieved chunk, but not for the parametric knowledge the base model contributes. Tools like Perplexity have made citations a UX primitive, but the citations point to *post-hoc* web search results, not to the training data that shaped the model's prior.

The technical state of the art for true training-data attribution is grim. Influence functions (the Anthropic 2023 paper) can identify which training examples most affected a given output, but the compute cost is roughly quadratic in dataset size, which is why no production system uses them. Membership inference attacks can confirm whether a specific document was in training data, but they don't give you 'this paragraph derives from that paragraph.' Watermarking schemes (Google's SynthID, Meta's Stable Signature) flag *AI-generated* content; they don't tag the human content the AI consumed.

The community reaction split along a familiar line. The pragmatist camp: attribution is a distribution problem, not a technical one — pay the publishers, license the training corpus, move on. This is the Anthropic-class-action model: $1.5B reportedly settled in the U.S. authors case in late 2025, with the precedent that bulk licensing is cheaper than litigation. The purist camp: licensing money flows to large rights-aggregators (Reddit, Stack Overflow, the big publishers) and never reaches the individual contributors whose work actually trained the model. The Stack Overflow contributor revolt of 2023, when the site licensed its corpus to OpenAI without sharing the proceeds, is the canonical example.

There's a third camp worth steelmanning: the people who think the essay is just wrong. Their argument is that 'plagiarism' requires intent and the appearance of authorship, and a probabilistic token generator has neither. A model that emits 'To be or not to be' isn't claiming to have written it; the user prompting the model is the author of the artifact, and the user's responsibility for attribution is unchanged from the pre-AI era. This is the position most model vendors take publicly, and it's not unreasonable — but it conveniently externalizes the attribution burden to the people least equipped to discharge it.

What this means for your stack

If you're shipping anything that puts model output in front of users, three concrete things are changing in the next two quarters.

First, provenance metadata is becoming a procurement requirement, not a nice-to-have. Enterprise buyers in regulated industries — legal, medical, financial — are already asking for training-data declarations as part of vendor security reviews. The C2PA standard (Adobe, Microsoft, BBC, et al.) gives you a cryptographic chain of custody for media; expect equivalent specs for text generation by year-end. If your AI feature can't answer 'where did this sentence come from,' it's going to fail RFPs it would have won six months ago.

Second, RAG architectures are eating fine-tuning specifically because retrieval gives you defensible citations. A fine-tuned model that memorizes proprietary documents is a lawsuit waiting to happen; a retrieval system that quotes those documents with a link to the source is defensible craft. This is why nearly every enterprise AI launch in 2026 leads with 'grounded in your data' rather than 'fine-tuned on your data.' The legal posture is the actual product differentiator.

Third, 'No-Slop'–style filters and provenance scoring are going to be table-stakes UI. Expect to see content tools ship a confidence-and-source panel by default — not because users demand it, but because the alternative is a takedown notice from someone whose Substack got regurgitated. The teams that build this now will be selling it to everyone else by mid-2027.

Looking ahead

The essay won't change a single court case. What it might do is shift the *engineering* conversation from 'how do we train on more data faster' to 'how do we train on data we can stand behind in a deposition.' That's a healthier place for the field to be — and it's a place where the small teams with clean, licensed corpora and explicit attribution can actually compete with the frontier labs on something other than parameter count. The plagiarism framing is uncomfortable because it's *almost* right; the productive move is to build the tooling that makes it wrong.

The plagiarism argument against LLMs gets its sharpest framing yet

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

AI is just unauthorised plagiarism at a bigger scale

// community takes

The plagiarism argument against LLMs gets its sharpest framing yet

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

AI is just unauthorised plagiarism at a bigger scale

// community takes

// share this