The plagiarism argument against LLMs gets its sharpest framing yet

5 min read 1 source multiple_viewpoints
├── "AI training is industrial-scale plagiarism dressed up in technical language"
│  ├── Axel Kee (axelk.ee) → read

Kee argues the double standard is the giveaway: a developer who pasted Stack Overflow code without citation would be fired, but when a model trained on that same code reproduces it verbatim, the industry rebrands the behavior as 'emergent capability.' He frames the legal debate over fair use as a distraction from a clear ethical asymmetry — ingesting copyrighted work without permission and emitting derivative output without attribution is plagiarism regardless of scale.

│  └── @speckx (Hacker News, 678 pts) → view

By submitting Kee's essay and driving it to 678 points, speckx amplified the argument that AI vendors have effectively purchased themselves out of copyright liability that any individual creator would face. The submission framing implies humans get sued constantly for derivative work — music copyright cases being the canonical example — while model vendors avoid the same scrutiny through scale and legal budgets.

├── "Humans learn from copyrighted material too — calling AI training plagiarism proves too much"
│  └── @HN top counter-argument (Hacker News) → view

The highest-rated rebuttal in the thread holds that every human writer, coder, and artist absorbs copyrighted material throughout their training without paying licensing fees, and we don't call the resulting work plagiarism. By this logic, a model that learns statistical patterns from text is doing what humans have always done, and the plagiarism framing collapses into a demand that machines be held to a stricter standard than people.

└── "The real unsolved problem is provenance — engineering, not law"
  └── top10.dev editorial (top10.dev) → read below

The editorial argues the legal question ('is this allowed?') has dominated discourse while the engineering question ('can we even tell where this came from?') has been ignored. Because developers building on these models are the ones who will or won't ship provenance features, the essay's significance is that the plagiarism framing is finally landing with the people who could technically fix attribution — not just with rights-holders seeking settlements.

What happened

An essay titled *AI is just unauthorised plagiarism at a bigger scale* hit 678 points on Hacker News, written by Estonian developer Axel Kee. The thesis is simple and uncomfortable: when a model ingests copyrighted writing, code, or images without permission and then emits derivative output without attribution, the only thing separating it from a freshman caught copying Wikipedia is scale and a marketing budget.

Kee's argument leans on the standard a working developer already uses. If a human engineer pasted Stack Overflow code into production without citation, that's a fireable offense at most companies; if a model trained on that same code reproduces it verbatim, it's called 'emergent capability.' The essay doesn't engage with the deeper jurisprudence — fair use, transformative work, the four-factor test — and that's the point. It argues the legal framing has been a distraction from the obvious ethical asymmetry.

The HN thread that formed underneath was, predictably, a war. The top-rated counter-argument: humans also learn from copyrighted material without paying licensing fees every time, and we don't call that plagiarism. The top-rated agreement: humans get sued constantly for derivative work (see: every music copyright case of the last 40 years), and the model vendors have simply purchased themselves out of that liability with scale and lawyers.

Why it matters

The essay is not novel as legal theory. Authors Guild v. OpenAI, the NYT lawsuit, Getty v. Stability AI, and the now-settled Anthropic class action have been arguing this for two years. What's novel is the framing landing with developers — the people building on these models — rather than with rights-holders trying to extract a settlement.

That shift matters because developers are the ones who will or won't ship provenance features. The legal question — 'is this allowed?' — has dominated the discourse, but the engineering question — 'can we even tell where this came from?' — is the one that actually constrains product design. Right now the answer is mostly no. Retrieval-augmented generation gives you a citation for the retrieved chunk, but not for the parametric knowledge the base model contributes. Tools like Perplexity have made citations a UX primitive, but the citations point to *post-hoc* web search results, not to the training data that shaped the model's prior.

The technical state of the art for true training-data attribution is grim. Influence functions (the Anthropic 2023 paper) can identify which training examples most affected a given output, but the compute cost is roughly quadratic in dataset size, which is why no production system uses them. Membership inference attacks can confirm whether a specific document was in training data, but they don't give you 'this paragraph derives from that paragraph.' Watermarking schemes (Google's SynthID, Meta's Stable Signature) flag *AI-generated* content; they don't tag the human content the AI consumed.

The community reaction split along a familiar line. The pragmatist camp: attribution is a distribution problem, not a technical one — pay the publishers, license the training corpus, move on. This is the Anthropic-class-action model: $1.5B reportedly settled in the U.S. authors case in late 2025, with the precedent that bulk licensing is cheaper than litigation. The purist camp: licensing money flows to large rights-aggregators (Reddit, Stack Overflow, the big publishers) and never reaches the individual contributors whose work actually trained the model. The Stack Overflow contributor revolt of 2023, when the site licensed its corpus to OpenAI without sharing the proceeds, is the canonical example.

There's a third camp worth steelmanning: the people who think the essay is just wrong. Their argument is that 'plagiarism' requires intent and the appearance of authorship, and a probabilistic token generator has neither. A model that emits 'To be or not to be' isn't claiming to have written it; the user prompting the model is the author of the artifact, and the user's responsibility for attribution is unchanged from the pre-AI era. This is the position most model vendors take publicly, and it's not unreasonable — but it conveniently externalizes the attribution burden to the people least equipped to discharge it.

What this means for your stack

If you're shipping anything that puts model output in front of users, three concrete things are changing in the next two quarters.

First, provenance metadata is becoming a procurement requirement, not a nice-to-have. Enterprise buyers in regulated industries — legal, medical, financial — are already asking for training-data declarations as part of vendor security reviews. The C2PA standard (Adobe, Microsoft, BBC, et al.) gives you a cryptographic chain of custody for media; expect equivalent specs for text generation by year-end. If your AI feature can't answer 'where did this sentence come from,' it's going to fail RFPs it would have won six months ago.

Second, RAG architectures are eating fine-tuning specifically because retrieval gives you defensible citations. A fine-tuned model that memorizes proprietary documents is a lawsuit waiting to happen; a retrieval system that quotes those documents with a link to the source is defensible craft. This is why nearly every enterprise AI launch in 2026 leads with 'grounded in your data' rather than 'fine-tuned on your data.' The legal posture is the actual product differentiator.

Third, 'No-Slop'–style filters and provenance scoring are going to be table-stakes UI. Expect to see content tools ship a confidence-and-source panel by default — not because users demand it, but because the alternative is a takedown notice from someone whose Substack got regurgitated. The teams that build this now will be selling it to everyone else by mid-2027.

Looking ahead

The essay won't change a single court case. What it might do is shift the *engineering* conversation from 'how do we train on more data faster' to 'how do we train on data we can stand behind in a deposition.' That's a healthier place for the field to be — and it's a place where the small teams with clean, licensed corpora and explicit attribution can actually compete with the frontier labs on something other than parameter count. The plagiarism framing is uncomfortable because it's *almost* right; the productive move is to build the tooling that makes it wrong.

Hacker News 678 pts 562 comments

AI is just unauthorised plagiarism at a bigger scale

→ read on Hacker News
danorama · Hacker News

There’s a fallacy that gets used a whole lot to justify things like this (not just with LLMs), and I see it in many of the comments here: If it’s OK (or at least negligible on a small scale), then it must be OK on a large scale.It usually goes something like: If I can make money by learning somethin

dvduval · Hacker News

The broader problem of original sources not being given credit in a way that rewards them remains. Websites owners are paying to host their content so that spiders can come and crawl them and index it into the AI and then if they’re lucky, they might get a citation, but otherwise there’s very little

deaton · Hacker News

"Steal an apple and you're a thief. Steal a kingdom and you're a statesman." - Literal Disney villain

tancop · Hacker News

if theres just one good thing coming out of ai its breaking copyright law forever. no one should be able to "own" ideas. royalties for commercial use is another thing and i support it but what we know as (non commercial) piracy and unlicensed fan art should be 100% legal

storus · Hacker News

This is really not so clear cut as "fair use" might cover 99% of all data scrapping; you are not reproducing the originals just use them to estimate probabilistic distribution of tokens in pre-training. You are never going to get the exact book word-for-word using LLMs.

// share this

// get daily digest

Top 10 dev stories every morning at 8am UTC. AI-curated. Retro terminal HTML email.