btw this should be obvious but if you are upset about the internet archive losing the appeal, but are also anti-ai training, you are holding two extremely contradictory views - either the law protects your intellectual property to the point where nobody is allowed to download it, or it doesn't - and you should try to examine where exactly the conflict is.
i've had the worst day in a long time so im not elaborating for now - if one of my many esteemed followers felt like getting more clear on the Ramifications and Dialectic here in a reblog i would greatly appreciate it but i did just want to strike while the thing was hot. either you think more IP protection is better, or you don't. you can't have it both ways.
Uhhhh, no? The internet archive should be allowed to download materials as part of a non-profit archival effort, with reasonable effort put in to restrict public access where necessary to protect intellectual property. And for profit downloading of intellectual property for ML training should be fully illegal because it's for profit theft.
How is this not obviously different?
This argument doesn't work for me for a bunch of reasons.
One, I never see ideological opponents of LLms distinguish between for-profit and not-for-profit use. Insofar as copyright violation is a legal question and not a moral one, and no copyright attaches to purely LLM-generated content, even selling LLM services for a profit to cover the cost of servers and training does not seem to me to be repackaging others' work in a form that resembles copyright violation as we usually think of it.
If I took a novel under copyright--say, Lord of the Rings--and did a bunch of linear algebra to the text to create an abstract work of art that was procedurally related to the text, but from which it would be fiendishly difficult to actually recover the text, and hung it up in a gallery, I think it would be generally considered to be both fair use and legitimate "art" (though not necessarily good art, depending on how you felt about abstract representations of large matrices of floating-point numbers). I would almost certainly be in the clear to legally sell that artwork to someone else, and I would certainly be in the clear to sell access to the computer program I used to create that art.
The bigest difference between that and LLM-generated text seems to me to be to be aesthetics: the latter is tained by its association with Big Corporations and Taking Jobs from Hardworking Independent Writers and Artists, while the former invokes the spirit of the individual producing Fine Art alone with a creative (although perhaps slightly incomprehensible) purpose. But aesthetics is a terrible basis on which to make policy. And nobody is ever even honest enough to frame this as "the underlying technology is neither moral nor immoral, but I specifically dislike and want to hurt LLM companies," or even "the government must intervene to protect the incomes of struggling freelancers." It's always "plagiarism machine water use energy."
As a side note--calling something "theft" isn't actually an argument. Libertarians have been using "taxation is theft" as a slogan for decades and nobody grants them the point because plainly they are using a definition of "theft" that is different from anybody else's. We spent like ten or fifteen years in the aughts and teens arguing that filesharing and digital piracy weren't "theft"--even if they were copyright violation--because "theft" in its ordinary meaning implies that it deprives the current owner of their use of the thing. And now tons of people on the internet are overnight turning into copyright maximalists and are arguing for an insanely narrow redefinition of "fair use" at which even Mickey Mouse would blush, and I'm sitting over here thinking "what the hell is everyone smoking?"
Like I'm not even arguing for the utility of LLMs--you can freely acknowledge that as a commercial product they are obviously overhyped and undercapable and there is a real social cost to making it possible to automate the production of low-quality text and images--but arguments about the utility of a technology are not arguments about its morality, and arguments about its morality are not arguments about its legality. As a matter of law, LLMs seems plainly not to be a violation of copyright (any actual copyright lawyers feel free to tell me if I'm wrong, but as of making this post I am aware of no court decisions that selling LLM services for a profit constitutes a violation of copyright). As a matter of policy, trying to change the law so that LLMs are a violation of copyright seems difficult, because "you aren't allowed to do linear algebra from large corpuses of text you scrape from the internet" is not in practice a workable rule. As a matter of social impact, LLMs seem mixed at best, but it's an interesting advance in computer science. As a matter of ethics, it's pretty obviously totally ethically neutral whether you do linear algebra to large amounts of text you scrape from the internet or not. Uses you might put the results of that linear algebra to may or may not be ethical, but the underlying technology is simply another tool.
Seems like the Internet Archive had argued that it should be fine for it to distribute digitalised copyrighted books because they were doing so under fair use. Here's the excerpt from the ruling on why this didn't fly with the courts:
In evaluating the four statutory fair use factors [...], the district court found that:
IA’s use of the [books] is (a) nontransformative because IA reproduces the Works in full and its digital copies serve the same purpose as the originals; and (b) commercial because, despite IA’s nonprofit status, it exploits the Works by soliciting donations on its website and taking a cut of the proceeds when users buy a physical book from BWB using a link embedded on IA’s website;
the Works are original fiction and nonfiction books “close to the core of intended copyright protection”;
IA copies the books wholesale; and
IA “brings to the marketplace a competing substitute” for library eBook licenses, “usurping a market that properly belongs to the copyright-holder.
So it could be possible to do archiving without copyright infringement, but this apparently wasn't it.
Also, consider that not all generative AI/LLMs are developed via for-profit activities; some are research projects funded by public money (like Horizon Europe projects) that can then be built upon/open source/released to the market. Blurry lines are blurry and policy is hard.














