Hacker News new | past | comments | ask | show | jobs | submit | xena's comments login

Probably, but if they do that, then that means that I win because I forced them to make changes to their configuration and code.

It means that I would have successfully caused them to waste their time. It would be pretty fun to see that happen.


It breaks websites when you do that.

It's based on depictions of Anubis in pop media, a jackal that weighs your soul and if it's heavier than a feather you die. Figured it was a good vibe for a bot filter project.

The new rendition of Anubis is going to be aggressively Canadian. Be prepared.


aggressively Canadian

Thoth as Canada goose might be worth the outright name change here.


It's a situation where it's difficult to tell for individual requests at request handling time, but easy to see when you look at the total request volume.

Hi! I do this! See https://github.com/TecharoHQ/anubis for more info!

I hope lots of websites adopt this, mainly because I want to see more happy jackal girls while browsing.

My monetization strategy is unironically to offer a de-anime'd version under the name Techaro BotStopper or something.

It's a clever (and hilarious) strategy that will probably sell at least a few licenses. As an anime hater I'd be motivated by this.

Just change the pictures in cmd/anubis/static/img/ to whatever you prefer, I think.

Or, alternatively, you know, pay the author for the work they've done

Sure, if you're going to deploy it on your company site, but I think if you're running a personal website and want to throttle LLM crawlers without falsely advertising that you're a furry, you could just go and modify this piece of MIT-licensed software.

Or you could pay for BotStopper and have RPM/DEB packages too.

Or, alternatively, just embrace anime-porn content spectrum. I mean, just compare platforms that are free of it and ones that are chock full, and see which ones die and which grows.

Does the PoW make money via crypto mining? Or is it just to waste the caller's CPU cycles? If you could monetize the PoW then you could re-challenge at an interval tuned so that the caller pays for their usage.

It's to waste CPU cycles. I don't want to touch cryptocurrency with a 20 foot pole. I realize I'm leaving money on the table by doing this, but I don't want to alienate the kinds of communities I want to protect.

By doing PoW as a side effect of something you need to do anyway for other reason, you actually make mining less profitable for other miners, which is helping to eliminate waste.

This is an aspect that a lot of PoW haters miss. While PoW is a waste, there are long term economic incentives to minimize it to either being a side-effect of something actually useful, or using energy that would go to waste anyway, making it's overall effect gravitate toward neutral.

Unfortunately such a second order effects are hard to explain to most people.


I always felt like crypto is nothing but speculating on value with no other good uses, but there is a kind of motivation here.

Say a hash challenge gets widely adopted, and scraping becomes more costly, maybe even requires GPUs. This is great, you can declare victory.

But what if after a while the scraping companies, with more resources than real users, are better able to solve the hash?

Crypto appeals here because you could make the scrapers cover the cost of serving their page.

Ofc if you’re leery of crypto you could try to find something else for bots to do. xkcd 810 for example. Or folding at home or something. But something to make the bot traffic productive, because if it’s just a hardware capability check maybe the scrapers get better hardware than the real users. Or not, no clue idk


thank you for your contribution to society!

I don't think it's possible to develop a frontier model without mass scraping. The economics simply don't add up. You need at least 10 trillion tokens to make an 8 billion parameter model. 10 trillion tokens is something like 40 terabytes.

You simply can't get 40 terabytes of text without mass scraping.


> You need at least 10 trillion tokens to make an 8 billion parameter model

Are you sure it is not just very inefficient?


I'm not an AI expert, but it seems to me that the common consensus is that current LLMs are quite inefficient and that there's room for improvement.

AI expert here. It's probably for collecting training data and the crawlers are probably very unsupervised. I'd guess that they're literally the most simplistic crawler code you can imagine combined with parallelism across machines.

The good news is that it's easy to disrupt these crawlers with some easy hacks. Tactical reverse slowloris is probably gonna make a comeback.


If its for training data, why are they straining FOSS so much? Is there thousands of actors repeatedly making training data all the time? I thought it was a sort of one-off thing w/ the big tech players.

Git forges are some of the worst case for this. The scrapers click on every link on every page. If you do this to a git forge, it gets very O(scary) very fast because you have to look at data that is not frequently looked at and will NOT be cached. Most of how git forges are fast is through caching.

The thing about AI scrapers is that they don't just do this once. They do this every day in case every file in a glibc commit from 15 years ago changed. It's absolutely maddening and I don't know why AI companies do this, but if this is not handled then the git forge falls over and nobody can use it.

Anubis is a solution that should not have to exist, but the problem it solves is catastrophically bad, so it needs to exist.


Zipbombing rogue AI's will be the new hotness.

Unironically I tried that. Limited success.

It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.

I really like this. I don't mind Internet acting like the Wild Wild West but I do mind there's no accountability. This is a nice way to pass the economic burden to the crawlers for sites who still want to stay freely available. You want the data, spend money on your side to get it. Even though the downside is your site could be delisted from search engines, there's no reason why you cannot register your service in a global or p2p indexer.

"why you cannot register your service in a global or p2p indexer"

Network effects anyone? So yes, we should work on a different way of indexing the web again, than via google, but easier said than done I think ..


Nice work :)

One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?

I know there is a loading animation widget thingy, but the first time I saw that page (some weeks ago at the Gnome issue tracker), it was proof-of-work'ing for like 20 seconds, and I wasn't sure what was going on, I initially thought I got blocked or that the captcha failed to load.

Of course, now I understand what it is, but I'm not sure it's 100% clear when you just see the "checking if you're a bot" page in isolation.


> One piece of feedback: Could you add some explanation (for humans) what we're supposed to do and what is happening when met by that page?

Will do! https://github.com/TecharoHQ/anubis/issues/25


also if you're using JShelter, which blocks Worker by default, there is no indication that it's never going to work, and the spinner just goes on forever doing nothing

Noted! I filed a bug: https://github.com/TecharoHQ/anubis/issues/38

All of this is placeholder wording, layouts, CSS, and more. It'll be fixed in time. This is teething pain that I will get through.


Maybe a progress bar?

There's no way to really make a progress bar make sense, it's a luck-based mechanic.

So just like the windows copy dialog. Progress bar it is.

Maybe one of those (slightly misleading) progressbars that have a dynamic speed that gets slower and slower the closer to the finish it gets? Just to indicate that it's working towards something

more, easier proof of works

and the law of large numbers will do the rest


That's multiplying the work the server has to do by a large number so it can show a nicer progress bar.

Seems very counter to the purpose.


It'll be somewhat involved, but based on the difficulty vs the clients hashing speed you could say something probabilistic like "90% of the time, this window will be gone in xyz seconds from now"?

Yeah, I have to get the data for that though! I'm gonna add that to the list.

Anubis is only going to work as long as it doesn't gets famous, if that happens crawlers will start using GPUs / ASICs for the proof of work and it's game over.

The entire reason bots are so agressive is because they are cheap to run.

If a GPU was required per scrape then >90% simply couldn't afford it at scale.


Author of Anubis here. If that happens, I win.

If that happens, count with me to use Anubis to factor large primes or whatever science needs as a background task.

Actually, that is not a bad idea. @xena maybe Anubis v2 could make the client participate in some sort of SETI@HOME project, creating the biggest distributed cluster ever created :-D

Oh come now, clearly Anubis should make the clients mine bitcoin as proof of work, with a split for the website and the author.

Oh dear, somebody is going to implement this in about an hour, aren't they....


Just in case you didn't know, cryptominers in Javascript are already thing. Firefox even blocks them.

a service that allows you expose and host your data in a private manner getting a cut from whatever token your endpoints have generated.

Loving it, great work as always.

Also

> https://news.ycombinator.com/item?id=43422781

Integrate a way to calculate micro-amounts of the shitcoin of your choice and we might have the another actually legitimately useful application of cryptocurrencies on our hands..!


Maybe I'm missing something, but doesn't this mean the work has to be done by the client AND the server every time a challenge is issued? I think ideally you'd want work that was easy for the server and difficult for the server. And what is to stop being DDoS'd by clients that are challenged but neglect to perform the challenge?

Regardless, I think something like this is the way forward if one doesn't want to throw privacy entirely out the window.

client


The magic of proof of work is that it's something that's really hard to do but easy to validate. Anubis' proof of work works like this:

A sha256 hash is a bunch of bytes like this:

  394d1cc82924c2368d4e34fa450c6b30d5d02f8ae4bb6310e2296593008ff89f
We usually write it out in hex form, but that's literally what the bytes in ram look like. In a proof of work validation system, you take some base value (the "challenge") and a rapidly incrementing number (the "nonce"), so the thing you end up hashing is this:

  await sha256(`${challenge}${nonce}`);
The "difficulty" is how many leading zeroes the generated hash needs to have. When a client requests to pass the challenge, they include the nonce they used. The server then only has to do one sha256 operation: the one that confirms that the challenge (generated from request metadata) and the nonce (provided by the client) match the difficulty number of leading zeroes.

The other trick is that presenting the challenge page is super cheap. I wrote that page with templ (https://templ.guide) so it compiles to native Go. This makes it as optimized as Go is modulo things like variable replacement. If this becomes a problem I plan to prerender things as much as possible. Rendering the challenge page from binary code or ram is always always always going to be so much cheaper than your webapp ever will be.

I'm planning on adding things like changing out the hash in use, but right now sha256 is the best option because most CPUs in active deployment have instructions to accelerate sha256 hashing. This combined with webcrypto jumping to heavily optimized C++ and the JIT in JS being shockingly good means that this super naïve approach is probably the most efficient way to do things right now.

I'm shocked that this all works so well and I'm so glad to see it take off like it has.


I am sorry if this question is dumb, but how does proof of work deter bots/scrappers from accessing a website?

I imagine it costs more resource to access the protected website but would this stop the bots? Wouldn't they be able to pass the challenge and scrap the data after? Or normal scrapbots usually timeout after a small amount of time/ resources is used?


Put simply, most bots just aren't designed to solve such challenges.

> I think ideally you'd want work that was easy for the server and difficult for the server.

That's exactly how it works (easy for server, hard for client). Once the client completed the Proof-of-Work challenge, the server doesn't need to complete the same challenge, it only needs to validate that the results checks out.

Similar to how in Proof-of-Work blockchains where coming up with the block hashes is difficult, but validating them isn't nearly as compute-intensive.

This asymmetric computation requirement is probably the most fundamental property of Proof-of-Work, Wikipedia has more details if you're curious: https://en.wikipedia.org/wiki/Proof_of_work

Fun fact: it seems Proof-of-Work was used as a DoS preventing technique before it was used in Bitcoin/blockchains, so seems we've gone full circle :)


I think going full circle would be something like bitcoin being created on top of DoS prevention software and then eventually DoS prevention starting to use bitcoin. A tool being used for something than something else than the first something again is just... nothing? Happens all the time?

Could you add an option for non-JS users? Maybe a Linux command-line we can paste the output of into a form.

The AI anime girl has 6 fingers btw, combating AI bot with AI girls.

Edit: I will probably send a pull request to fix it.


I'm commissioning an artist to make better assets. These are the placeholders that I used with the original rageware implementation. I never thought it would take off like this!

I love that I seem to stumble upon something by you randomly every so often. I'd just like to say that I enjoy your approach to explanations in blog form and will look further into Anubis!

That's what I've been doing! It works shockingly well. https://github.com/TecharoHQ/anubis

Unless I am missing something, the result of that generated work has no monetary value though.

I was inspired by https://en.wikipedia.org/wiki/Hashcash, which was proof of work for email to disincentivize spam. To my horror, it worked sufficiently for my git server so I released it as open source. It's now its own project and protects big sites like GNOME's GitLab.

That's cool! What if instead of sha256 you used one of those memory-hard functions like script? Or is sha needed because it has a native impl in browsers?

Right now I'm using SHA-256 because this project was originally written as a vibe sesh rage against the machine. The second reason is that the combination of Chrome/Firefox/Safari's JIT and webcrypto being native C++ is probably faster than what I could write myself. Amusingly, supporting this means it works on very old/anemic PCs like PowerMac G5 (which doesn't support WebAssembly because it's big-endian).

I'm gonna do experiments with xeiaso.net as the main testing ground.


The monetary value is not having a misbehaving AI bot download 73TB or whatever of your data.

I'm curious if the PoW component is really necessary, AIUI untargeted crawlers are usually curl wrappers which don't run Javascript, so requiring even a trivial amount of JS would defeat them. Unless AI companies are so flush with cash that they can afford to just use headless Chrome for everything, efficiency be damned.

Sadly, in testing the proof of work is needed. The scrapers run JS because if you don't run JS the modern web is broken. Anubis is tactically designed to make them use modern versions of Firefox/Chrome at least.

They really do use headless chrome for everything. My testing has shown a lot of them are on Digital Ocean. I have a list of IP addresses in case someone from there is reading this and can have a come to jesus conversation with those AI companies.


these companies have more compute than everyone else in the world put together

a proof of work function will end up selecting FOR them!


It'll still keep your site from getting hammered

until some drone working for the parasites (google/facebook/openai) sees this post and writes 5 lines of code to defeat it

and now you have an experience where the bots have it easier time accessing your content than legitimate visitors


How would those 5 lines of code look like? The base of this solution is that it asks to solve a computationally-intensive problem whose solution, once provided, isn't computationally-intensive to check. How would those 5 lines of code change this?

nice try, Google employee

Lol, such a childish excuse to not answer.

Use judo techniques. Use their own computing power against them with fake links to fake Markov generated bullshit at random, until their cache get poisoned with no turning point as it's impossible; the LLM's begin to either forget their own stuff or hallucinate once their input it's basically feeded from other LLM's (or themselves).

Interesting idea. Seems to me it might be possible to use with a Monero mining challenge instead, for those low real traffic applications where most of the requests are sure to be bots.

It has a plugin API. Should be easy to implement should you want it.

I want it out of the box, no installation hassles.

You can get it out of the box, you'll just need to implement it first.

I mean, if you want to compete with commercial products then the least you can do is listen to your audience.

You should provide that feedback to the GIMP project: https://gitlab.gnome.org/GNOME/gimp/-/issues

Gimp was never about competing with a commercial product. It's to build a usable product that the developers like. If others want to use it, great.

Ok, but it's kind of strange if that means I can't even say what I would like to see in that project.

There is nothing stopping you from reporting that feedback to the GIMP team.

Sure, but this was more about being downvoted. Gimp is a great project, and they clearly took a lot of inspiration from a commercial product (Photoshop) that is now offering the exact tools that I was asking about. I don't think my question was out of line, at all.

And a pony!

Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: