This is written-up in a bunch[1] of places, but basically: the Dead Sea Scrolls were discovered circa 1946, then excavated, and then hoarded by academics who wanted to only release them piecemeal and slowly, after consideration.
Alas for this plan: to assist researchers who wanted to run statistical analyses of which words were where on what pages, they released a concordance — an index of all of the texts…
[…] organized like a dictionary, listing every use of every word that appears in a text, identifying the places where the word is found and also giving the context in which the word is used.
Source: NYT
But what they never bet upon was that someone would take the concordance, feed it into a computer, and use the computer to reconstruct (or: “reverse-engineer”) a copy of the original text (allowing that there may be errors in the construction of the concordance or in the process of reconstruction) – the reconstruction being subsequently published and thereby effectively forcing the scroll-keepers into actual publication of the originals
Key Questions
Ignoring that the original texts were so old as to not be in copyright…
- Was the concordance a derivative work of the original texts?
- Was analysis of the concordance in order to reconstruct the original, in keeping with the intention of providing the concordance for analysis? Would this matter?
- Would building an LLM model of all the scrape-able texts on the internet — including, for instance, the Internet Archive copy of the 1991 NYT article re: this matter — be a derivative work of the NYT article, or of the entire Internet? Should we distinguish?
- If someone — for instance a lawyer representing the NYT — prompts a LLM with text that will provoke it to roughly reconstruct (allowing that there may be errors in the construction of the model or in the process of reconstruction) a NYT article, is the result a breach of copyright? Is it a mechanically derived work (the reconstruction) of (perhaps) another derived work (the model)?
These are fun things to think about. I’m not going to suggest answers because I am not trying to win an argument one way or another, but I do believe that the future of knowledge-accessability is at stake.
Leave a Reply