Cheapskate's Guide
For the past six-and-a-half years, I have been actively running a personal web server from a residential Internet connection with a mere 0.75 Mbit/s upload bandwidth. My home server currently hosts over a dozen websites, including this one, which on rare occasions receives and successfully handles tens of thousands of web page requests a day. It also hosts the Blue Dwarf social media site with more than five hundred user accounts. As a result, I am very conscious of wasted upload bandwidth.
You may ask what kind of insanity has gripped me that I am not relying on a commercial web-hosting service for this. Well, I could list several reasons for preferring to use a server in my physical possession--the opportunity to tinker with hardware, an increased sense of self sufficiency, all data being local, more privacy for visitors to my websites, easier backups, not relying on the whims of big tech, lower cost, my belief that a residential Internet connection is more difficult for a US government entity to shut down than one provided by a commercial hosting service, and a better opportunity to learn about system administration and all things "back-end". But, the bottom line is that I am doing this just because I want to, and that is a good enough reason.
From almost the beginning of this adventure, one of my major concerns has been finding ways of blocking unwanted traffic from web-scraping robots in order to preserve as much of my inadequate bandwidth as possible for the people who visit my websites. As a direct result of this effort, I have learned a tremendous amount about how the Internet works and who is scraping websites. For example, no one needed to tell me when AI bots began scraping the Internet disguised as the browsers of regular people. I saw it first hand.
In addition to blocking hundreds of user agents, I am also blocking over 220 million of the total 4.3 billion IPv4 addresses on the Internet, or about 5%. Blocking IP addresses is not an ideal method of blocking web-scraping robots, but given my extreme dislike of captchas, it is a tool I use. I am working to provide legitimate visitors who have been blocked through no fault of their own with better methods of gaining access to my websites, and I think I have been making progress. But, what I want to talk about first is where most of the robots are coming from and what the reality of the situation is for hosters of "small Internet" sites like mine. Then, I will talk a little about what I am doing about it. If some readers are unfamiliar with the term "small Internet", other phrases for it are the Indie Web, the personal Web, and the old Internet. It is basically the space consisting of all personal websites on the Internet. Some people define the small Internet differently, but that is how I use the phrase in this article.
The situation in which those of us who self-host small Internet websites currently find ourselves seems rather bleak to many of those who have been paying attention. Multiple people have posted comments on this website and emailed me to tell me they hosted their own websites years ago but gave up thanks to the increasing numbers of web-scraping robots and rude users they were encountering. Now, the robot situation is even worse. Hordes of web-scraping robots are scouring the Internet for AI-training content as well as for the usual data for other purposes, and most couldn't care less about the preferences we express in our robots.txt files. They scrape whatever they want, and they often do so in very stupid ways. Many scrape the same web pages several times a day. Sometimes, they seem to become confused and scrape the same pages thousands of times before we finally block them. Those who run the robots appear not to be concerned at all with the enormous amount of resources they are wasting. Corporate managers' standard solution to every problem seems to be to throw more computers and other types of resources at it until they have beaten it into submission. Perhaps this is merely a reflection of their general approach to managing their employees, but this monumentally wasteful approach is exactly what DeepSeek recently showed to be so wrong-headed. Unfortunately, short of being made a laughing stock, as DeepSeek has just done to every AI company in the United States, nothing seems to phase the people who run and manage these criminally wasteful companies.
This raises an important question. Big tech does everything it can to make the small Internet invisible to casual Internet users by pushing our websites very far down in its search engine rankings, labeling as "blog spam" almost every post on its social media platforms with links to our blog articles, and creating websites like ipqualityscore.com whose sole reason for existence seems to be declaring our sites to be too dangerous to visit. Yet, behind the scenes, they send their robots to scrape our websites. This seems a bit like a politician who is publicly tough on crime sneaking into houses of ill-repute at night when no one is looking. Should we accept big tech's efforts to place a barrier between the commercial Web and the small Internet that is porous for them and impervious for us? Should we allow them to use and abuse our websites this way, or should we block them? Personally, I have no moral objection to blocking any organization engaged in any type of commerce from visiting my websites. Of course, that does not include those that provide services like search engines and VPN's that are useful to Internet users.
The reality of the situation for people like me who have very limited resources and don't trust or can't afford big-tech-provided solutions is that we have no good option other than finding innovative ways of blocking as many web-crawling robots as we can while inconveniencing the human readers of our blogs and users of our services as little as possible. The question then becomes not whether to block but what to block. We don't have to block 100% of the robots. We just have to block enough of them to claw back enough of our bandwidth to adequately serve our readers and users.
I have created a clear robot blocking policy for the readers of the Cheapskate's Guide and posted a link to it on the home page. Basically, it states that anyone scraping the website with a robot may expect to be blocked, and they may be blocked permanently. I make allowances for RSS feed readers that are well behaved, but those that are not may expect to be temporarily or permanently blocked. Why such a draconian policy, you may ask? Because I am very serious about preventing wasted bandwidth. Average Internet users have been conditioned over decades to be free of the burden of their wasteful use of the Internet, so many no longer understand that whether they are paying out of their own pockets or not, someone is paying. When an Internet user sets up his RSS feed reader to download an RSS feed file every two minutes from a personal website that may only post new content every couple of weeks, he may be downloading thousands of times more data than is required to notify him when new content is present. Many simply do not understand or perhaps do not care that they are being wasteful, so my best option is to block them until they do.
Over the years, I have noticed that most robots are coming to my websites from commercial web-hosting companies like Digital Ocean, Hetzner, Microsoft, AWS, 3xK Tech, LeaseWeb, Datacamp, and Linode. Human beings generally come from residential IP addresses owned by organizations like Comcast, AT&T, and other telecommunications companies. When AI appeared on the scene in a big way last year and increased the traffic to my website dramatically, I decided that I had had enough. I began blocking the IP addresses of every commercial web-hosting company that appeared in my log files. Not only that, but I used the "whois" database to look up the block of IP addresses corresponding to every IP address of a commercial hosting company that reached my log files and blocked it also. One person who commented on my approach called this "the nuclear option", and I think that name is appropriate. I would prefer not to use "nuclear weapons" against commercial web-hosting companies, but their policies which tolerate their hosting of web-scraping robots have left me no choice but full-out nuclear war. This is a war I intend to win, and I am largely winning it. The best evidence I have of that is that the number of new robots I see every day has dropped to a manageable level. These days, I only have to add a handful of new IP address blocks to my block list every day. Maybe one day, enough other website owners will be doing the same thing that the web-hosting companies will decide to change their policies. But, if they don't, they don't.
Blocking over 220 million IP addresses of course catches some non-combatants who are innocent users of VPN's, Tor exit nodes, and other proxy services that are also hosted by web-hosting companies. It also catches the servers of some Fediverse instances. Recently, I have begun going through my daily log files looking for user agents that include Mastodon, Lemmy, Akkoma, and other Fediverse applications to add their IP addresses to my whitelist. This means they will be able to reach my website even if all the other IP addresses of the web-hosting companies they are using are blocked.
Many people are surprised that I am able to block such a large number of IP addresses using only the built-in blocking capability of Nginx on a small Raspberry Pi 3B server. All I can say is that apparently Nginx is far more efficient than they know, because I am doing just that. However, Nginx could use some improvements. For example, it provides no way of whitelisting user agents. Being able to whitelist Googlebot, Bingbot, Mastodon, Lemmy, and various RSS readers would be very helpful and would ensure that more innocent visitors to my website would make it through my Nginx blockade without my having to identify them individually. Although the Internet is silent about how to create a whitelist of user agents for Nginx, with the aid of some knowledgeable users on Blue Dwarf, I have some ideas to experiment with.
At this point, I am sure many readers are screaming, "Why aren't you using block lists from Spamhaus or those included with OPNSense or other third-party router software? Or, why aren't you using Cloudflare?" The answer is that I avoid centralized solutions whenever possible. Part of the reason is that I just don't trust them, and part of it is that I don't want to become dependent on them. Many of the benefits of going to all the trouble of hosting a website from a home server are lost if one just ends up relying on (and perhaps paying for) a boat load of centralized services. I am serious enough about not being roped into some corporate ecosystem that I do everything I can to avoid it, including writing as much of the software that runs my websites as I can. I wrote all the HTML and PHP code that runs Blue Dwarf, and the only dependencies in my PHP code are, PHP itself, the Bcrypt functions in Openssl, and Linux. If I could write my own operating system and web-server software, I would.
Many "tech bros" who are reading this are probably laughing and jeering. Let them. They don't understand because they have never been in my position and have never made an effort to find an inexpensive local solution to the robot problem. Many of them are probably part of the reason for the robot problem and therefore hardly unbiased. They assume that if they haven't heard of a solution to the problem, it doesn't exist. I have news for them, small website owners don't have to rely on Cloudflare's centralized captcha solution that annoys website visitors and social media users to no end and makes all of them subject to possibly passing all of their data to a big tech company that may one day turn evil. We can invent decentralized solutions that work. I have found one that isn't perfect yet, but as far as I am concerned, it is already an improvement over the centralized options available, and I am still improving it.
I am sure the tech bros will be shouting, "But a robot can be programmed to pass any test you can invent!" Yes, a robot can be programmed by a human being to pass the test I use to block robots (which appears at the bottom of this page below the comment form). But, how many people do you think will write that code into their robots just to pass a unique test on one small website out of six hundred million? If you are not smart enough to answer that question for yourself, tech bros, I will give you the answer. You can count them on the fingers of one hand. I have been using the same robot test for over five years on cheapskatesguide, and in that time only two robots have passed it. I have also been using this test for nearly three years on Blue Dwarf, and in that time only one robot has passed the test. During the same time, I have also had feedback forms on some of my other websites that have no robot tests, and many of them receive multiple advertisements a day from robots. So unlike the tech bros who have never tried this, I don't have to guess whether my simple robot test works. I know it works.
What this means is that small website owners do not need to rely on the same captcha solution that big tech has invented to apply to every website on the Internet. Nor do we have to spend millions of dollars to come up with another solution. All we have to do is come up with a simple and unique solution like mine that requires a little human thought to solve, and we will have a virtually impenetrable robot blocking solution without having to pass all the traffic from our websites through Cloudflare, and without having to pay a penny for it. However, I only use my robot test to prevent robots from adding comments to cheapskatesguide articles and creating user accounts on Blue Dwarf. I don't inconvenience every visitor to my websites with this. For that, I need a robot blocking solution that only makes itself known when it is actually needed.
So far, I have tried two approaches for testing visitors who are coming from IP addresses I block. Remember that I am only blocking 5% of IPv4 addresses on the Internet, so 95% of visitors will not be tested at all. For those 5%, I began a couple of years ago by presenting a custom 403 error message with my email address. Those that wanted to read my articles enough to send me their IP address and ask to be unblocked were accommodated. A problem with this approach was that many readers use VPN's and other proxies that change IP addresses virtually every time they use them. For that reason and because I believe in protecting every Internet user's privacy as much as possible, I wanted a way of immediately unblocking visitors to my website without them having to reveal personal information like names and email addresses.
I recently spent a few weeks on a new idea for solving this problem. With some help from two knowledgeable users on Blue Dwarf, I came up with a workable approach two weeks ago. So far, it looks like it works well enough. To summarize this method, when a blocked visitor reaches my custom 403 error page, he is asked whether he would like to be unblocked by having his IP address added to the website's white list. If he follows that hypertext link, he is sent to the robot test page. If he answers the robot test question correctly, his IP address is automatically added to the white list. He doesn't need to enter it or even know what it is. If he fails the test, he is told to click on the back button in his browser and try again. After he has passed the robot test, Nginx is commanded to reload its configuration file (PHP command: shell_exec("sudo nginx -s reload");), which causes it to immediately accept the new whitelist entry, and he is granted immediate access. He is then allowed to visit cheapskatesguide as often as he likes for as long as he continues to use the same IP address. If he switches IP addresses in the future, he has about a one in twenty chance of needing to pass the robot test again each time he switches IP addresses. My hope is that visitors who use proxies will only have to pass the test a few times a year. As the whitelist grows, I suppose that frequency may decrease. Of course, it will reach a non-zero equilibrium point that depends on the churn in the IP addresses being used by commercial web-hosting companies. In a few years, I may have a better idea of where that equilibrium point is.
For the benefit of readers who may want to experiment with my blocking approach, here is my PHP script, add2whitelist.php, that checks whether a blocked reader to my website has passed the robot test and if so adds his IP address to my whitelist and unblocks him:
<?php
// This cache control should work for all browsers, because it is
// the HTML 1.1 standard.
header('Cache-Control: no cache, no store, must-revalidate');
session_cache_limiter('nocache'); // this is later than HTML 1.1
header('Pragma: no cache');
header('Expires: 0');
// You must add this to your Nginx config file:
//
//
// # Custom 403 Error Handling
// error_page 403 /403-error-handler_test.html;
// location = /403-error-handler_test.html {
// allow all;
// #internal;
// }
//
// # Works with Custom 403 Error Handling.
// # get_access.html contains the HTML form that displays
// # the robot test to the website visitor and calls this
// # script.
// location = /get_access.html {
// allow all;
// #internal;
// }
//
//
// # Allow access to add2whitelist.php to blocked users!
// # Use your path for <<local path>>!!!
// location = <<local path>>/add2whitelist.php {
// allow all;
// satisfy any;
// include snippets/fastcgi-php.conf;
// fastcgi_pass unix:/run/php/php8.2-fpm.sock;
// }
//
// # Use your path for <<full path>>!!!
// include <<full path>>/whitelist.conf;
//
//
// You must also create an empty file (with a blank line) and
// make it readable and writeable by www-data:
// # Use your path for <<local path>>!!!
// <<local path>>/whitelist.conf
//
//
// You must also have 403-error-handler.html and get_access.html in
// the root directory of the website.
if($_SERVER["REQUEST_METHOD"] == "POST")
{
$bot_test = "";
$bot_test = validate(5,$_POST["bot_test"]);
//Test to see if robot test questions was answered correctly.
$day_of_month = (int)date("dS");
$correct_answer = $day_of_month + 8;
//Quotes around "$correct_answer" turn it into a string!
if($bot_test === "$correct_answer")
{
$ipaddress = $_SERVER["REMOTE_ADDR"];
$filename = "../php/whitelist.conf";
if(!file_exists($filename))
{echo "<br>Whitelist file does not exist."; exit();}
$file = fopen($filename, "a+")
or die("Could not open whitelist file.");
$date = date("M dS Y");
// Add IP address that user entered into 403 error page form.
fwrite($file,
"allow " . $ipaddress . "; #User added" . " " . $date . PHP_EOL);
fclose($file);
// The next line only works if you add to the
// /etc/sudoers.d/010_pi-nopasswd file:
//
// www-data ALL=(ALL) NOPASSWD: /sbin/nginx -s reload
// www-data ALL=(ALL) NOPASSWD: /sbin/nginx -t
//
// Also note that www-data does not need to be and should not
// be in the sudo group!!!
//
// Check the syntax of the Nginx config file, and if correct,
// reload it.
$text = shell_exec("sudo nginx -t 2>&1");
if(preg_match("/syntax is ok/", $text))
{
shell_exec("sudo nginx -s reload");
header("location:" . "http://cheapskatesguide.org"); exit();
}
else
{echo "<br>Whitelist file has wrong syntax.";}
}
else
{
echo "<br>Wrong answer. Hit the back button in your browser and ";
echo "try again.";
exit();
}
}
else
{
echo "<br>Go to <a href='http://cheapskatesguide.org'>Cheapskate's Guide's main page</a>.";
}
// Validate inputs:
function validate($num_chars, $var)
{
// This function removes any scripts that a hacker my have inserted
// into the robot test HTML form. I prefer to keep it private for
// obvious reasons. Use any method of user input sanitization
// you prefer.
}
?>
Multiple differences exist between a centralized corporate-generated captcha and my solution. First, with my method, no one's data is being sent to a corporation to do who-knows-what with. Next, Internet users who are coming to my website from residential IP addresses should never be interrogated by this code. My hope is those who do see my robot test will only see it a few times a year, if that, but that is something we will learn with experience using it. Last, it is a solution I created to serve a need, not to generate income for shareholders. I have no shareholders. The purpose of Cheapskate's Guide is not to generate income. It is to educate readers.
Although I don't expect the tech bros to receive much enlightenment from this article, I hope those few readers who want to run their own websites without resorting to centralized solutions will understand the implications of what they have read here. Small websites are fully capable of blocking web-scraping robots without relying on or paying for centralized solutions. All it takes is some ingenuity, tenacity, experience with Nginx, Apache, or one of the other web servers, and a minimal amount of programming knowledge. Furthermore, regardless of corporate and governmental efforts to manipulate and silence us on the big social media websites, we can still create our own spaces on the Internet where we can speak freely, and those spaces can accommodate tens or hundreds of thousands of people each. With very efficient software, they may even be able to accommodate millions each. In almost every case, other than needing a basic residential Internet connection, we can do all of this without asking big tech for any help at all. Many instances on the Fediverse that are run by individuals are shining examples of this. My hope is that more of us will begin to realize that we have much more freedom on the Internet than big tech wants us to believe. If Facebook, Reddit, Twitter, and others insist on treating us like Internet cattle, we can abandon their sites and not only create our own, but also create much better online homes for those who are not yet technically sophisticated enough to create their own.
We have had many interesting discussions on Blue Dwarf about creating our own online spaces. If you are interested in self-hosting, personal websites, online privacy, or blogging in general, you are always welcome to join us. Blue Dwarf is a small social media platform that I run personally. It is a free-speech platform that encourages a wide variety of topics and bars none, but we generally tend to lean to the technical side. We have a good community with people of all levels of technical experience, and we welcome tech newbies along with everyone else.
If you have found this article worthwhile, please share it on your favorite social media. You will find sharing links at the top of the page.
How to Stop Bad Robots from Accessing Your Lighttpd Web Server
The Webcrawling Robot Swarm War is On!
Internet Centralization may have made Blocking Unwanted Web-Crawling Robots Easier
Now Blocking 56,037,235 IP Addresses, and Counting...
The Joys and Sorrows of Maintaining a Personal Website
Running a Small Website without Commercial Software or Hosting Services: Lessons Learned
How to Serve Over 100K Web Pages a Day on a Slower Home Internet Connection