The reason why armchair (and professional) lawyers who want to argue the Open AI vs: New York Times copyright lawsuit, should look towards the Dead Sea Scrolls…

This is written-up in a bunch[1] of places, but basically: the Dead Sea Scrolls were discovered circa 1946, then excavated, and then hoarded by academics who wanted to only release them piecemeal and slowly, after consideration.

Alas for this plan: to assist researchers who wanted to run statistical analyses of which words were where on what pages, they released a concordance — an index of all of the texts…


[…] organized like a dictionary, listing every use of every word that appears in a text, identifying the places where the word is found and also giving the context in which the word is used. 

Source: NYT

But what they never bet upon was that someone would take the concordance, feed it into a computer, and use the computer to reconstruct (or: “reverse-engineer”) a copy of the original text (allowing that there may be errors in the construction of the concordance or in the process of reconstruction) – the reconstruction being subsequently published and thereby effectively forcing the scroll-keepers into actual publication of the originals

Key Questions

Ignoring that the original texts were so old as to not be in copyright…

  1. Was the concordance a derivative work of the original texts?
  2. Was analysis of the concordance in order to reconstruct the original, in keeping with the intention of providing the concordance for analysis? Would this matter?
  3. Would building an LLM model of all the scrape-able texts on the internet — including, for instance, the Internet Archive copy of the 1991 NYT article re: this matter — be a derivative work of the NYT article, or of the entire Internet? Should we distinguish?
  4. If someone — for instance a lawyer representing the NYT — prompts a LLM with text that will provoke it to roughly reconstruct (allowing that there may be errors in the construction of the model or in the process of reconstruction) a NYT article, is the result a breach of copyright? Is it a mechanically derived work (the reconstruction) of (perhaps) another derived work (the model)?

These are fun things to think about. I’m not going to suggest answers because I am not trying to win an argument one way or another, but I do believe that the future of knowledge-accessability is at stake.

[1] Sources

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *