#LLMs are a fucking scourge. Perceiving their training infrastructure as anything but a horrific all-consuming parasite destroying the internet (and wasting real-life resources at a grand scale) is delusional.
#ChatGPT isn't a fun toy or a useful tool, it's a _someone else's_ utility built with complete disregard for human creativity and craft, mixed with malicious intent masquerading as "progress", and should be treated as such.
That's fucking insane!
@khobochka Can confirm. They are a huge fraction of my website traffic and they don't honour any constraints I put on them short of hard blocks
@khobochka So, we need to find ways to penalize them without them noticing.
Rate-limiting and UA filtering did not do the trick, so we can assume that trying to feed poisoned content to them will also be detected easily.
Maybe the only way to protect against automated LLM bot groping is to put expensive functionality and special content behind login walls, going back to the internet driven by small communities, except that this time they are mostly closed to discovery from the outside.
@khobochka we are having similar problems where I work (public digital libraries). In the last year we got a huge increase of traffic from bots. Also, bots that don't respect session cookies and we had a big problem of servers using a lot of memory creating dumb sessions. Luckily we found how to fix this problem with a Tomcat valve config to avoid this problem with the sessions.
@khobochka and the crawlers aren't just stupid for repeatedly crawl the same identical pages over and over, they also do completely nonsensical stuff. At my work we noticed that GPTBot had found our search button and did what looks like 100.000 requests per day to /search/amp;amp;amp; repeated 100 times and then our support phone number or a url to a random image followed by more amp;'s.
@khobochka I rescued an old forum whose administrator decided they weren’t interested anymore. I was aghast at the traffic coming from the big AI companies.
The first few weeks, I spent an hour, every day, blocking huge netblocks at the firewall. I wrote a script that summarized the heavy-hitters to the web server, investigated each one manually, then added it to the firewall’s blocklist. At the end of the first month, I was blocking nearly 1% of all IPv4 addresses.
@JustinDerrick @khobochka this matches my experience, yeah.
I'm also convinced those companies have dedicated crawlers for specific applications like mediawiki or phpbb.
I haven't seen the same aggressive behavior on any other web thing I host, and the coverage and approach to crawling the wiki almost makes me believe this is a mediawiki-specific crawler.
@denschub @khobochka This forum runs of SMF - Simple Machines Forum - but it has existed for 10+ years, and has thousands of posts and tens of thousands of replies — precisely the type of thing LLMs need to be fed.
@khobochka there's clearly a free rider problem, where these LLMs benefit from other people's work to train their models. In my view, when you use other people's data to train your model that should give them a partial copyright over the model, as the model becomes a derivative work.
It's one thing a search engine crawling as that will send traffic your way, but an LLM crawling is all taken, no give.
That one goes straight to the top of my list:
https://social.saarland/@fedithom/112455121136790239
@fedithom it's probably better to put the original post https://pod.geraspora.de/posts/17342163 to that list, not my cross-post
@khobochka
Perfect, thank you
@khobochka
robots.txt ist just _asking nicely, prettyplease_ who should not wander into which territories.
For real blocks one has to block such systems - by IP or other indicators.
Depending on server it's comparatively easy to filter out user agents (on lighttpd you can easily filter by useragent).
While at it: does one have ideas for nice , ressource-light adversial learning garbage that I could feed those suckers?
(though the zip bomb is nice ide, too)
@khobochka straight up evil
@khobochka damn, didnt know the damn crawlers got aggressive to the point of changing their fucking user agent now...
@khobochka the fact they switch ips and user agents is so scummy. I've been thinking data poisoning is the only real defense, it doesn't save cost but fuck em, I'll burn CPU generating and maintaining wrong content to ingest.
@khobochka LLMs, by themselves, aren't the issue. The issue is human greed -- to be the few winners in this new technology S-curve. That is what's turning people into assholes, and causing all of these problems. Greed leads to cut throat competition, which in turns leads to people ignoring rules (or creative interpretations/advocacy to suite their greed/ambition).
@jay_nakrani @khobochka @GeekAndDad that greed is also what gave us the phones, software, servers, satellites, and networks that allow us to shitpost, and also to create wonderful support networks around the world
progress is a messy affair
@khobochka the power costs of this across the entire internet might be approaching the power of running the LLMs themselves.
It's a weapon.
@khobochka I remember back in the aughts there was talk that getting around an IP or UA block violated the anti-circumvention clause in the DMCA. Wonder if we could test that here.
@khobochka They are like pirates - or more prianha's..
I really can't say on how bored I am about his fucking hype on LLM.
Everywhere AI is mentioned. Even at work the people act as pirates of trying to circumvent rules on compliance to dig the gold they believe is there. And risking everything..
Someone needs to write a scanner that is checking the Logs for such misuse - and block each of such activity.
@khobochka ...that's consistent with some weird periodic slowdown behaviour I've started seeing on my little personal website over the last year or so.
So far I've just waited for it to go away, as the ssh console is unresponsive during the problem, but next time it happens I guess I'll check the access log for signs of Abominable Intelligence.
I've noticed that the amount of this behaviour has been increasing since October 2024.
It's making websites everywhere slower.
@khobochka Change your UA. put them on notice -with a legal letter- that certain of their bots will be charged $ per visit. Wait a reasonable amount of time (eg 4 weeks). Start invoicing. Sue in small claims. Donate the money to FOSS search/archive/fedi etc
@khobochka guess why I maintain a #Scraper #blocklist?
http://hil-speed.hetzner.com/10GB.bin as an extra middlefinger!@khobochka OFC Contributions are welcome!
https://github.com/greyhat-academy/lists.d/blob/main/scrapers.ipv4.block.list.tsv
@tdelmas @khobochka that is due to #GitHub parsing stuff...
@kkarhan may i also add i personally redirect these bots to gz.niko.lgbt which returns a innocent looking 100MB HTTP response with content-encoding: gzip that decompresses to 100GB
and the only way to find out is to actually decompress it so if they want anything they gotta go through it
@niko Okay, that's way cooler...
That's some #NextLevel #AssetDenial ...
@drwho @kkarhan i use nginx and a few if blocks (yes i'm aware if is evil but to the best of my knowledge if is the only way to get a boolean AND expression)
set $redir_to_gz 1;
if ($host = gz.niko.lgbt) {
set $redir_to_gz 0;
}
if ($http_user_agent !~* (claudebot|ZoominfoBot|GPTBot|SeznamBot|DotBot|Amazonbot|DataForSeoBot|2ip|paloaltonetworks.com|SummalyBot|incestoma)) {
set $redir_to_gz 0;
}
if ($redir_to_gz) {
return 301 https://gz.niko.lgbt/;
}
as for the actual stuff behind gz.niko.lgbt
server {
# SSL and listen -- snipped
# static files
root /var/www/gz.niko.lgbt;
location / {
add_header Content-Encoding gzip;
try_files /42.gz =404;
gunzip off;
types { text/html gz; }
}
# additional config -- snipped
}
gunzip off is very important because if the client doesn't support gzip encoding nginx will blow its foot off without that
42.gz is generated with dd if=/dev/zero bs=1M count=102400 | gzip -c - > 42.gz
@kkarhan @niko I use both. Belt and suspenders.
What I want to do is maintain one set of files for the stuff I have in shared housing. I keep the ones for my website up to date and plan to write a script that copies them into other webdirs.
The nginx servers are all behind http basic auth, so they can't see them anyway.
@khobochka @niko yeah...
Bonus points if it's served as an alternative to an archive file or sth.
@khobochka
Doing ultra-wide dumb crawls without proper domain rate limit or caching and the like just sounds like severe developer incompetence.
@ide I don't think this is a question of competence: competence could be summed up as an ability to efficiently achieve goals, which involves feedback leading to anticipation of problems (e.g. use domain rate limit -> avoid getting blocked, use cache -> lower compute by using storage). However, there's neither a defined goal nor any feedback here: we can't harm an LLM crawler that causes harm, its developer won't cache an entire internet and has access to near unlimited compute. This is ethics.
@khobochka pulling the entire internet over the network multiple times is more expensive than caching unchanging parts of it on disk, right? Even disregarding the ethics there's a clear optimalization these developers don't appear to make. Meaning they don't qualify for the "efficiently" part of "an ability to efficiently achieve goals" 'incompetent' seems like a fair conclusion to me so far. Am I missing something?
@khobochka @ide They literally do not care. They don't to have to. It's the narcissist billionaire mentality that everything exists to serve them.
@khobochka We need an international co-operative system of making these parties pay for scraping. It includes legislative changes. At the same time it can become a real-time pricing market for ”rights to scrape” and for creators to get paid.
Here’s my whitepaper for a solution. Absolutely no cryptocurrency involved.
#ai #scraping #copyright #technology #whitepaper
https://docs.google.com/document/d/18cz-ZX1copCYiC4C2ReY8GLJjuhG2IH0MEBGaoSJhP4/edit
Knowing that there are legitimate and conscientious scrapers out there this is particularly maddening. AI is the ultimate techbro tech, an anti Midas touch shittifying our world around us. And as we keep pointing out - noone needs this shit and nobody wants it.
Speaking for my own freewheeling mind of course...
#fediVerse #AI #dataMining #robotsTXT #fediAdmin
This looks to me much more like that we should burry troyan horses right into the bellies of the beasts. My server rules and profiles state that all data is CC-BY-SA-NC.
If they use and train that data they definitely should become in serious legal and financial trouble.
@gimulnautti @khobochka The only thing that #RoyalitySchemes like that created are rich #CollectingAgencies that act as #ValueRemoving #Rentseekers (i.e. #GEMA only pas out 9,0909% of all the royalities collected and innan intransparent manner!) on every "Reproduction Device" (i.e. printers, burners, copiers, scanners) and "Blank (Recordable) Media" (i.e. USB drives, SD csrds, recordable BD-RWs) AND more criminalization.
Anything else is just not gonna work...
https://www.youtube.com/watch?v=9XN57BhyZwk
https://infosec.space/@kkarhan/113725769698521647
@kkarhan @khobochka I don't believe in denying business or market incentives works either.
Your scheme has low risk but terrible chance of adoption. My scheme has moderate risk but at least a possibility of adoption.
Silicon Valley has itself participated in not getting regulated, by boosting this narrative of "all attempts to govern us will end in rich collecting schemes".
We fight amongst ourselves. While they use "freedom" to inflict their tyranny on our assets.
@gimulnautti @kkarhan @khobochka Agreed.. trying to outright block this kind of development is not just doomed to fail but will also cause unwanted side effects. Regulation is IMO the right way forward.
@khobochka This is terrible. I updated the crawler blocklist on the robots.txt of my web and they seem to havemultiplied the crawl attempts x10 times.
@kurio it's heartbreaking that people who run human-centric infra such as personal and federated servers, which is hard enough as it is, have to make an additional inhuman effort to fight off an army of corporate cloud-scale bullshit-generating lawnmowers 
@kurio @khobochka Because you had the audacity to add them to the file.
@drwho @khobochka They punished me for punishing them. 
@kurio @khobochka Firewall rules to cut 'em off. Nope, no server here, slugworts, go away.
@kurio @khobochka I mean, this is what bullies do: If you fight back they escalate beyond all reason specifically because somebody said "no" to them.