It's based on depictions of Anubis in pop media, a jackal that weighs your soul and if it's heavier than a feather you die. Figured it was a good vibe for a bot filter project.
The new rendition of Anubis is going to be aggressively Canadian. Be prepared.
It's a situation where it's difficult to tell for individual requests at request handling time, but easy to see when you look at the total request volume.
Sure, if you're going to deploy it on your company site, but I think if you're running a personal website and want to throttle LLM crawlers without falsely advertising that you're a furry, you could just go and modify this piece of MIT-licensed software.
Or, alternatively, just embrace anime-porn content spectrum. I mean, just compare platforms that are free of it and ones that are chock full, and see which ones die and which grows.
Does the PoW make money via crypto mining? Or is it just to waste the caller's CPU cycles? If you could monetize the PoW then you could re-challenge at an interval tuned so that the caller pays for their usage.
It's to waste CPU cycles. I don't want to touch cryptocurrency with a 20 foot pole. I realize I'm leaving money on the table by doing this, but I don't want to alienate the kinds of communities I want to protect.
By doing PoW as a side effect of something you need to do anyway for other reason, you actually make mining less profitable for other miners, which is helping to eliminate waste.
This is an aspect that a lot of PoW haters miss. While PoW is a waste, there are long term economic incentives to minimize it to either being a side-effect of something actually useful, or using energy that would go to waste anyway, making it's overall effect gravitate toward neutral.
Unfortunately such a second order effects are hard to explain to most people.
I always felt like crypto is nothing but speculating on value with no other good uses, but there is a kind of motivation here.
Say a hash challenge gets widely adopted, and scraping becomes more costly, maybe even requires GPUs. This is great, you can declare victory.
But what if after a while the scraping companies, with more resources than real users, are better able to solve the hash?
Crypto appeals here because you could make the scrapers cover the cost of serving their page.
Ofc if you’re leery of crypto you could try to find something else for bots to do. xkcd 810 for example. Or folding at home or something. But something to make the bot traffic productive, because if it’s just a hardware capability check maybe the scrapers get better hardware than the real users. Or not, no clue idk
I don't think it's possible to develop a frontier model without mass scraping. The economics simply don't add up. You need at least 10 trillion tokens to make an 8 billion parameter model. 10 trillion tokens is something like 40 terabytes.
You simply can't get 40 terabytes of text without mass scraping.
AI expert here. It's probably for collecting training data and the crawlers are probably very unsupervised. I'd guess that they're literally the most simplistic crawler code you can imagine combined with parallelism across machines.
The good news is that it's easy to disrupt these crawlers with some easy hacks. Tactical reverse slowloris is probably gonna make a comeback.
If its for training data, why are they straining FOSS so much? Is there thousands of actors repeatedly making training data all the time? I thought it was a sort of one-off thing w/ the big tech players.
Git forges are some of the worst case for this. The scrapers click on every link on every page. If you do this to a git forge, it gets very O(scary) very fast because you have to look at data that is not frequently looked at and will NOT be cached. Most of how git forges are fast is through caching.
The thing about AI scrapers is that they don't just do this once. They do this every day in case every file in a glibc commit from 15 years ago changed. It's absolutely maddening and I don't know why AI companies do this, but if this is not handled then the git forge falls over and nobody can use it.
Anubis is a solution that should not have to exist, but the problem it solves is catastrophically bad, so it needs to exist.
It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.
I really like this. I don't mind Internet acting like the Wild Wild West but I do mind there's no accountability. This is a nice way to pass the economic burden to the crawlers for sites who still want to stay freely available. You want the data, spend money on your side to get it. Even though the downside is your site could be delisted from search engines, there's no reason why you cannot register your service in a global or p2p indexer.
One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?
I know there is a loading animation widget thingy, but the first time I saw that page (some weeks ago at the Gnome issue tracker), it was proof-of-work'ing for like 20 seconds, and I wasn't sure what was going on, I initially thought I got blocked or that the captcha failed to load.
Of course, now I understand what it is, but I'm not sure it's 100% clear when you just see the "checking if you're a bot" page in isolation.
also if you're using JShelter, which blocks Worker by default, there is no indication that it's never going to work, and the spinner just goes on forever doing nothing
Maybe one of those (slightly misleading) progressbars that have a dynamic speed that gets slower and slower the closer to the finish it gets? Just to indicate that it's working towards something
It'll be somewhat involved, but based on the difficulty vs the clients hashing speed you could say something probabilistic like "90% of the time, this window will be gone in xyz seconds from now"?
Anubis is only going to work as long as it doesn't gets famous, if that happens crawlers will start using GPUs / ASICs for the proof of work and it's game over.
Actually, that is not a bad idea. @xena maybe Anubis v2 could make the client participate in some sort of SETI@HOME project, creating the biggest distributed cluster ever created :-D
Integrate a way to calculate micro-amounts of the shitcoin of your choice and we might have the another actually legitimately useful application of cryptocurrencies on our hands..!
Maybe I'm missing something, but doesn't this mean the work has to be done by the client AND the server every time a challenge is issued? I think ideally you'd want work that was easy for the server and difficult for the server. And what is to stop being DDoS'd by clients that are challenged but neglect to perform the challenge?
Regardless, I think something like this is the way forward if one doesn't want to throw privacy entirely out the window.
We usually write it out in hex form, but that's literally what the bytes in ram look like. In a proof of work validation system, you take some base value (the "challenge") and a rapidly incrementing number (the "nonce"), so the thing you end up hashing is this:
await sha256(`${challenge}${nonce}`);
The "difficulty" is how many leading zeroes the generated hash needs to have. When a client requests to pass the challenge, they include the nonce they used. The server then only has to do one sha256 operation: the one that confirms that the challenge (generated from request metadata) and the nonce (provided by the client) match the difficulty number of leading zeroes.
The other trick is that presenting the challenge page is super cheap. I wrote that page with templ (https://templ.guide) so it compiles to native Go. This makes it as optimized as Go is modulo things like variable replacement. If this becomes a problem I plan to prerender things as much as possible. Rendering the challenge page from binary code or ram is always always always going to be so much cheaper than your webapp ever will be.
I'm planning on adding things like changing out the hash in use, but right now sha256 is the best option because most CPUs in active deployment have instructions to accelerate sha256 hashing. This combined with webcrypto jumping to heavily optimized C++ and the JIT in JS being shockingly good means that this super naïve approach is probably the most efficient way to do things right now.
I'm shocked that this all works so well and I'm so glad to see it take off like it has.
I am sorry if this question is dumb, but how does proof of work deter bots/scrappers from accessing a website?
I imagine it costs more resource to access the protected website but would this stop the bots? Wouldn't they be able to pass the challenge and scrap the data after? Or normal scrapbots usually timeout after a small amount of time/ resources is used?
> I think ideally you'd want work that was easy for the server and difficult for the server.
That's exactly how it works (easy for server, hard for client). Once the client completed the Proof-of-Work challenge, the server doesn't need to complete the same challenge, it only needs to validate that the results checks out.
Similar to how in Proof-of-Work blockchains where coming up with the block hashes is difficult, but validating them isn't nearly as compute-intensive.
This asymmetric computation requirement is probably the most fundamental property of Proof-of-Work, Wikipedia has more details if you're curious: https://en.wikipedia.org/wiki/Proof_of_work
Fun fact: it seems Proof-of-Work was used as a DoS preventing technique before it was used in Bitcoin/blockchains, so seems we've gone full circle :)
I think going full circle would be something like bitcoin being created on top of DoS prevention software and then eventually DoS prevention starting to use bitcoin. A tool being used for something than something else than the first something again is just... nothing? Happens all the time?
I'm commissioning an artist to make better assets. These are the placeholders that I used with the original rageware implementation. I never thought it would take off like this!
I love that I seem to stumble upon something by you randomly every so often. I'd just like to say that I enjoy your approach to explanations in blog form and will look further into Anubis!
I was inspired by https://en.wikipedia.org/wiki/Hashcash, which was proof of work for email to disincentivize spam. To my horror, it worked sufficiently for my git server so I released it as open source. It's now its own project and protects big sites like GNOME's GitLab.
That's cool! What if instead of sha256 you used one of those memory-hard functions like script? Or is sha needed because it has a native impl in browsers?
Right now I'm using SHA-256 because this project was originally written as a vibe sesh rage against the machine. The second reason is that the combination of Chrome/Firefox/Safari's JIT and webcrypto being native C++ is probably faster than what I could write myself. Amusingly, supporting this means it works on very old/anemic PCs like PowerMac G5 (which doesn't support WebAssembly because it's big-endian).
I'm gonna do experiments with xeiaso.net as the main testing ground.
I'm curious if the PoW component is really necessary, AIUI untargeted crawlers are usually curl wrappers which don't run Javascript, so requiring even a trivial amount of JS would defeat them. Unless AI companies are so flush with cash that they can afford to just use headless Chrome for everything, efficiency be damned.
Sadly, in testing the proof of work is needed. The scrapers run JS because if you don't run JS the modern web is broken. Anubis is tactically designed to make them use modern versions of Firefox/Chrome at least.
They really do use headless chrome for everything. My testing has shown a lot of them are on Digital Ocean. I have a list of IP addresses in case someone from there is reading this and can have a come to jesus conversation with those AI companies.
How would those 5 lines of code look like? The base of this solution is that it asks to solve a computationally-intensive problem whose solution, once provided, isn't computationally-intensive to check. How would those 5 lines of code change this?
Use judo techniques. Use their own computing power against them with fake links to fake Markov generated bullshit at random, until their cache get poisoned with no turning point as it's impossible; the LLM's begin to either forget their own stuff or hallucinate once their input it's basically feeded from other LLM's (or themselves).
Interesting idea. Seems to me it might be possible to use with a Monero mining challenge instead, for those low real traffic applications where most of the requests are sure to be bots.
Sure, but this was more about being downvoted. Gimp is a great project, and they clearly took a lot of inspiration from a commercial product (Photoshop) that is now offering the exact tools that I was asking about. I don't think my question was out of line, at all.
It means that I would have successfully caused them to waste their time. It would be pretty fun to see that happen.
reply