lemmy.ml meta

1406 readers

1 users here now

Anything about the lemmy.ml instance and its moderation.

For discussion about the Lemmy software project, go to [email protected].

founded 3 years ago

MODERATORS

[email protected]

Should lemmy.ml block chatgpt scraping in robots.txt? (lemmy.ml)

submitted 1 year ago by [email protected] to c/[email protected]

14 comments fedilink hide all child comments

Some context about this here: https://arstechnica.com/information-technology/2023/08/openai-details-how-to-keep-chatgpt-from-gobbling-up-website-data/

the robots.txt would be updated with this entry

User-agent: GPTBot
Disallow: /

Obviously this is meaningless against non-openai scrapers or anyone who just doesn't give a shit.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 2 points 1 year ago (2 children)

Wouldn't they theoretically be able to set up their own instance, federate with all the larger ones and scrape the data this way? Not sure if blocking them via the robots.txt file is the most effective barrier in case that they really want the data.

[–] [email protected] 11 points 1 year ago* (last edited 1 year ago)

Robots.txt is more of an honor system. If they respect , they won't do that trick.

[–] [email protected] 5 points 1 year ago

Robots.txt is just a notice anyways. Your scraper could just ignore it, no workaround necessary.