…and (in case it’s not apparent) I propose that this argument can be extrapolated to the entire rest-of-the-net; if we don’t accept that LLMs may scrape data, then we need to popularise the throwing-up of paywalls and identity being required for all web-fetches.
Because “trust” ain’t gonna cut it.
Thread
The number of devs from <major platforms> who spend their days browsing @StackOverflow to solve <problems> for <commercial gain> is huge; StackOverflow see some Ad-revenue (see illustration) & users see ~nil.
If a LLM ingests StackOverflow, should it click the Ad-links? Because the question otherwise — if it does not — is: who loses?
The information [in the StackOverflow postings] is out there, it’s regularly put to indirect/derivative commercial gain. To enforce otherwise would oblige use, even *mandate* of identity checks, to (in short) “paywall” the entire net. Such would be illiberal.
So my friends in the pro-fair-use *yet* pro-creator, “copyright is okay so long as it’s licensed appropriately” community (@jimkillock?) had better soon decide whether they prefer an open net with occasional LLMs roving it, or bulk authenticated access to content with EULAs?
But of course it goes without saying that if the LLM-scrapers are not honouring robots.txt, then they should be excoriated, because we already have standards for that sort of thing.
Originally tweeted by Alec Muffett (@AlecMuffett) on 2023/04/01.

Leave a Reply