this post was submitted on 14 Jan 2024

382 points (98.5% liked)

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ

54443 readers

202 users here now

⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don't request invites, trade, sell, or self-promote

3. Don't request or link to specific pirated titles, including DMs

4. Don't submit low-quality posts, be entitled, or harass others

Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):

💰 Please help cover server costs.


Ko-fi	Liberapay

founded 1 year ago

MODERATORS

[email protected]

382

Meta Admits Use of 'Pirated' Book Dataset to Train AI (torrentfreak.com)

submitted 9 months ago by [email protected] to c/[email protected]

49 comments fedilink hide all child comments

all 50 comments

sorted by: hot top controversial new old

[–] [email protected] 88 points 9 months ago (2 children)

That's fine, just let the rest of us do the same.

[–] [email protected] 6 points 9 months ago

Actually I prefer if individual users pirating being considere fair use, but corporation pirating not be considered fair use. So them pirating is not fine but us pirating should be.

[–] [email protected] -2 points 9 months ago

Yeah too much of this thread is so hypocritical, but either free to copy stuff should be free or it shouldn't.

[–] [email protected] 72 points 9 months ago

"We didn't do it, and if we did it was fair use, and if it wasn't progress will be hampered if rules and regulations are too strict."

[–] [email protected] 48 points 9 months ago (1 children)

Nationalize AI or tax it to fund UBI, and none of this is an issue.

[–] [email protected] 13 points 9 months ago* (last edited 9 months ago)

Best idea I've heard in a year. Automation should benefit humanity as a whole.

[–] [email protected] 32 points 9 months ago (1 children)

I do wonder how it shakes out. If the case establishes that a license to use the material should be acquired for copyrighted material, then maybe the license I'm setting on comments might bring commercial AI companies in hot water too - which I'd love. Opensource AI models FTW

CC BY-NC-SA 4.0

[–] [email protected] 9 points 9 months ago

That license would require the AI model to only output content under the same license. Not sure if you realize, but commercial use is part of the OpenSource definition:

https://opensource.org/osd/

Your content would just get filtered out from any training dataset.

As for going against commercial companies... maybe you are a lawyer, otherwise good luck paying the fees.

[–] [email protected] 30 points 9 months ago* (last edited 9 months ago)

AI is just too much of a hype. Every company invests millions into AI and all new products need to "have AI". And then everybody also needs to file lawsuits. I mean rightly so if Meta just pirated the books, but that's not a problem with AI, but plain old piracy.

I was pretty sure OpenAI or Meta didn't license gigabytes of books correctly for use in their commercial products. Nice that Meta now admitted to it. I hope their " Fair Use" argument works and in the future we can all "train AI" with our "research dataset" of 40GB of ebooks. Maybe I'm even going to buy another harddisk and see if I can train an AI on 6 TB of tv series, all marvel movies and a broad mp3 collection.

Btw, there was no denying anyways. Meta wrote a scientific paper about their LLaMA model in march of last year. And they clearly listed all of their sources, including Books3. Other companies aren't that transparent. And even less so as of today.

[–] [email protected] 26 points 9 months ago* (last edited 9 months ago)

Welp, whole trained dataset got DMCAed, right? And a nonsensical fine, right?

[–] [email protected] 17 points 9 months ago (1 children)

ohno my copyright!!!! How will the publisher megacorps now make a record quarter??? Think of the shareholders!

[–] [email protected] 48 points 9 months ago (17 children)

That's not the take away you should be having here, it's that a mega Corp felt that they should be allowed to create new content from someone else's work, both without their permission and without paying

[–] [email protected] 17 points 9 months ago (1 children)

ok, fair; but do consider the context that the models are open weight. You can download them and use them for free.

There is a slight catch though which I’m very annoyed at: it’s not actually Apache. It’s this weird license where you can use the model commercially up until you have 700M Monthly users, which then you have to request a custom license from meta. ok, I kinda understand them not wanting companies like bytedance or google using their models just like that, but Mistral has their models on Apache-2.0 open weight so the context should definitely be reconsidered, especially for llama3.

It’s kind of a thing right now- publishers don’t want models trained on their books, „because it breaks copyright“ even though the model doesn’t actually remember copyrighted passages from the book. Many arguments hinge on the publishers being mad that you can prompt the model to repeat a copyrighted passage, which it can do. IMO this is a bullshit reason

anyway, will be an interesting two years as (hopefully) copyright will get turned inside out :)

[–] [email protected] 4 points 9 months ago

I really have to thank you for an educated response

load more comments (16 replies)

[–] [email protected] 10 points 9 months ago

Nope. Yer can feck off Zuck! Yer ain't comin' aboard my ship! 🏴‍☠️

[–] [email protected] 8 points 9 months ago

I'm pretty sure "admits" implies an attempt to hide it. They've explicitly said in the model's initial publication that the training set includes Books3.

[–] [email protected] 4 points 9 months ago (1 children)

In the age of the internet, nothing is truly yours.

Just look at NFT'S

[–] [email protected] 12 points 9 months ago (2 children)

How are NFTs relevant?

[–] [email protected] 5 points 9 months ago

they aren't, except perhaps as a counterexample of some dubious sort

[–] [email protected] 3 points 9 months ago (3 children)

They were supposedly anchors to claim ownership of things in the real world.

CC BY-NC-SA 4.0

[–] [email protected] 5 points 9 months ago

They're fancy receipts, and if people thought of them as just that it might be a technology with some limited non-monetary uses. But, the crypto grift was too strong.

[–] [email protected] 3 points 9 months ago

Marking all your comments CC BY-NC-SA is a good bit.

The point of NFTs (beyond the pyramid scheme) was to enforce artificial digital scarcity at the individual level

[–] [email protected] 2 points 9 months ago

They sold snake oil nothing else.

[–] [email protected] 1 points 9 months ago

This is the least shocking revelation.