We understand copium, but to properly describe the stupidity that I read in an ArsTechnica post I sought refinement and asked some LLMs: “…is there a word to describe the state of self-delusion that one is being subversive and challenging authority through disruptive means?”
ChatGPT suggested a neologism:
Kakistodicy [is] a blend I created based on kakistocracy (rule by the worst) and theodicy (justification of a system despite its flaws)
…which is not quite there, but (with a little poking) Claude did a better job:
“Puerversive” could work – combining “puerile” (childishly foolish and immature) with “subversive.” This would capture both the genuine attempt at subversion and its childish, ineffective nature. This construction follows similar patterns to words like “perversive” or “subversive” while incorporating the Latin “puer” (meaning child/boyish), suggesting behavior that’s rebellious in an immature, naive way rather than being purely fake or performative. It would describe someone who is genuinely trying to be subversive, but in ways that are foolishly inadequate or betray a childish understanding of power structures – like vandalizing a bathroom stall and genuinely believing it’s a blow against the system.
And so we come to ArsTechnica‘s reportage of nepenthes:
Last summer, Anthropic inspired backlash when its ClaudeBot AI crawler was accused of hammering websites a million or more times a day. […] Building on an anti-spam cybersecurity tactic known as tarpitting, [“Aaron”, a pseudonym] created Nepenthes, malicious software named after a carnivorous plant that will “eat just about anything that finds its way inside.”
Aaron clearly warns users that Nepenthes is aggressive malware. It’s not to be deployed by site owners uncomfortable with trapping AI crawlers and sending them down an “infinite maze” of static files with no exit links, where they “get stuck” and “thrash around” for months, he tells users. Once trapped, the crawlers can be fed gibberish data, aka Markov babble, which is designed to poison AI models. That’s likely an appealing bonus feature for any site owners who, like Aaron, are fed up with paying for AI scraping and just want to watch AI burn.
Let’s call this what it is: a column-filler article reporting a puerile hack that purports to stick it to the man but in actuality is a one-off speedbump with negligible impact, unworthy of the attention. Sometimes entire universities engage in the same game (Glaze, Nightshade, Fawkes) but the impact is about the same.
As for “nepenthes” it turns out that all you have to do is ask an AI for general principles of how best to configure curl — see below — and everything else you will need to avoid nepenthes is on the manual page for Curl, and once you’ve implemented it once for your scraper you are done forever.
DeepSeek
consider: you are a malware analyst scraping a potentially hostile website with curl on the command line; what options and features of curl do you enable in order to ensure that your scrape completes in a reasonable amount of time, avoids traps and ratholes and excessively large files and content that is being trickled out slowly so that individual fetches may take artificially long times to complete, if ever? think like an infosec redteamer who is willing to sacrifice thorough coverage of the website scrape in favour of getting a substantive sample of many URLs in a reasonable time
To effectively use curl for scraping a potentially hostile website while avoiding traps and ensuring efficiency, consider the following structured approach:
curl Command Setup
- Timeouts and Limits:
- Connection Timeout: Set with
--connect-timeout 5to limit the initial connection time to 5 seconds. - DNS Timeout: Use
--dns-timeout 5to cap DNS resolution at 5 seconds. - Maximum Time: Apply
--max-time 10to ensure the entire operation doesn’t exceed 10 seconds.
- Connection Timeout: Set with
- Data Management:
- Max Filesize: Limit downloads to 1MB using
--max-filesize 1Mto avoid large files. - Speed Limit: Abort slow transfers with
--speed-limit 100 --speed-time 20, stopping if under 100 bytes/s for 20 seconds.
- Max Filesize: Limit downloads to 1MB using
- Redirection and HTTP/2:
- Redirect Limit: Prevent endless loops with
-L --max-redirs 5, allowing up to 5 redirects. - HTTP/2: Enable for efficiency with
--http2.
- Redirect Limit: Prevent endless loops with
- User-Agent and Output:
- User-Agent: Spoof a browser with
-A "Chrome/XX.X.XXXX.XX"to avoid detection. - Silent Output: Use
-sto suppress progress meter clutter. - Output File: Save results with
-o output.html.
- User-Agent: Spoof a browser with
- Security Considerations:
- HTTPS Verification: Skip with
-kif necessary, though use cautiously.
- HTTPS Verification: Skip with
Final Command
curl -L --max-redirs 5 --max-time 10 --max-filesize 1M --speed-limit 100 --speed-time 20 --connect-timeout 5 --dns-timeout 5 --http2 -A "Chrome/XX.X.XXXX.XX" -s -k -o output.html http://example.com
Optional Enhancements
- Proxy Usage: Add
-x PROXY:PORTfor anonymity. - Resuming Downloads: Use
-C -if partial downloads are useful. - HTTP Version: Force with
--http1.1if server issues arise.
This setup balances thoroughness with efficiency, ensuring a swift and safe scraping process.
Leave a Reply