(cache)Nuking the Corporate Web's Web-Scraping Robots

For the past six-and-a-half years, I have been actively running a personal web server from a residential Internet connection with a mere 0.75 Mbit/s upload bandwidth. My home server currently hosts over a dozen websites, including this one, which on rare occasions receives and successfully handles tens of thousands of web page requests a day. It also hosts the Blue Dwarf social media site with more than five hundred user accounts. As a result, I am very conscious of wasted upload bandwidth.

You may ask what kind of insanity has gripped me that I am not relying on a commercial web-hosting service for this. Well, I could list several reasons for preferring to use a server in my physical possession--the opportunity to tinker with hardware, an increased sense of self sufficiency, all data being local, more privacy for visitors to my websites, easier backups, not relying on the whims of big tech, lower cost, my belief that a residential Internet connection is more difficult for a US government entity to shut down than one provided by a commercial hosting service, and a better opportunity to learn about system administration and all things "back-end". But, the bottom line is that I am doing this just because I want to, and that is a good enough reason.

From almost the beginning of this adventure, one of my major concerns has been finding ways of blocking unwanted traffic from web-scraping robots in order to preserve as much of my inadequate bandwidth as possible for the people who visit my websites. As a direct result of this effort, I have learned a tremendous amount about how the Internet works and who is scraping websites. For example, no one needed to tell me when AI bots began scraping the Internet disguised as the browsers of regular people. I saw it first hand.

In addition to blocking hundreds of user agents, I am also blocking over 220 million of the total 4.3 billion IPv4 addresses on the Internet, or about 5%. Blocking IP addresses is not an ideal method of blocking web-scraping robots, but given my extreme dislike of captchas, it is a tool I use. I am working to provide legitimate visitors who have been blocked through no fault of their own with better methods of gaining access to my websites, and I think I have been making progress. But, what I want to talk about first is where most of the robots are coming from and what the reality of the situation is for hosters of "small Internet" sites like mine. Then, I will talk a little about what I am doing about it. If some readers are unfamiliar with the term "small Internet", other phrases for it are the Indie Web, the personal Web, and the old Internet. It is basically the space consisting of all personal websites on the Internet. Some people define the small Internet differently, but that is how I use the phrase in this article.

The situation in which those of us who self-host small Internet websites currently find ourselves seems rather bleak to many of those who have been paying attention. Multiple people have posted comments on this website and emailed me to tell me they hosted their own websites years ago but gave up thanks to the increasing numbers of web-scraping robots and rude users they were encountering. Now, the robot situation is even worse. Hordes of web-scraping robots are scouring the Internet for AI-training content as well as for the usual data for other purposes, and most couldn't care less about the preferences we express in our robots.txt files. They scrape whatever they want, and they often do so in very stupid ways. Many scrape the same web pages several times a day. Sometimes, they seem to become confused and scrape the same pages thousands of times before we finally block them. Those who run the robots appear not to be concerned at all with the enormous amount of resources they are wasting. Corporate managers' standard solution to every problem seems to be to throw more computers and other types of resources at it until they have beaten it into submission. Perhaps this is merely a reflection of their general approach to managing their employees, but this monumentally wasteful approach is exactly what DeepSeek recently showed to be so wrong-headed. Unfortunately, short of being made a laughing stock, as DeepSeek has just done to every AI company in the United States, nothing seems to phase the people who run and manage these criminally wasteful companies.

This raises an important question. Big tech does everything it can to make the small Internet invisible to casual Internet users by pushing our websites very far down in its search engine rankings, labeling as "blog spam" almost every post on its social media platforms with links to our blog articles, and creating websites like ipqualityscore.com whose sole reason for existence seems to be declaring our sites to be too dangerous to visit. Yet, behind the scenes, they send their robots to scrape our websites. This seems a bit like a politician who is publicly tough on crime sneaking into houses of ill-repute at night when no one is looking. Should we accept big tech's efforts to place a barrier between the commercial Web and the small Internet that is porous for them and impervious for us? Should we allow them to use and abuse our websites this way, or should we block them? Personally, I have no moral objection to blocking any organization engaged in any type of commerce from visiting my websites. Of course, that does not include those that provide services like search engines and VPN's that are useful to Internet users.

The reality of the situation for people like me who have very limited resources and don't trust or can't afford big-tech-provided solutions is that we have no good option other than finding innovative ways of blocking as many web-crawling robots as we can while inconveniencing the human readers of our blogs and users of our services as little as possible. The question then becomes not whether to block but what to block. We don't have to block 100% of the robots. We just have to block enough of them to claw back enough of our bandwidth to adequately serve our readers and users.

I have created a clear robot blocking policy for the readers of the Cheapskate's Guide and posted a link to it on the home page. Basically, it states that anyone scraping the website with a robot may expect to be blocked, and they may be blocked permanently. I make allowances for RSS feed readers that are well behaved, but those that are not may expect to be temporarily or permanently blocked. Why such a draconian policy, you may ask? Because I am very serious about preventing wasted bandwidth. Average Internet users have been conditioned over decades to be free of the burden of their wasteful use of the Internet, so many no longer understand that whether they are paying out of their own pockets or not, someone is paying. When an Internet user sets up his RSS feed reader to download an RSS feed file every two minutes from a personal website that may only post new content every couple of weeks, he may be downloading thousands of times more data than is required to notify him when new content is present. Many simply do not understand or perhaps do not care that they are being wasteful, so my best option is to block them until they do.

Over the years, I have noticed that most robots are coming to my websites from commercial web-hosting companies like Digital Ocean, Hetzner, Microsoft, AWS, 3xK Tech, LeaseWeb, Datacamp, and Linode. Human beings generally come from residential IP addresses owned by organizations like Comcast, AT&T, and other telecommunications companies. When AI appeared on the scene in a big way last year and increased the traffic to my website dramatically, I decided that I had had enough. I began blocking the IP addresses of every commercial web-hosting company that appeared in my log files. Not only that, but I used the "whois" database to look up the block of IP addresses corresponding to every IP address of a commercial hosting company that reached my log files and blocked it also. One person who commented on my approach called this "the nuclear option", and I think that name is appropriate. I would prefer not to use "nuclear weapons" against commercial web-hosting companies, but their policies which tolerate their hosting of web-scraping robots have left me no choice but full-out nuclear war. This is a war I intend to win, and I am largely winning it. The best evidence I have of that is that the number of new robots I see every day has dropped to a manageable level. These days, I only have to add a handful of new IP address blocks to my block list every day. Maybe one day, enough other website owners will be doing the same thing that the web-hosting companies will decide to change their policies. But, if they don't, they don't.

Blocking over 220 million IP addresses of course catches some non-combatants who are innocent users of VPN's, Tor exit nodes, and other proxy services that are also hosted by web-hosting companies. It also catches the servers of some Fediverse instances. Recently, I have begun going through my daily log files looking for user agents that include Mastodon, Lemmy, Akkoma, and other Fediverse applications to add their IP addresses to my whitelist. This means they will be able to reach my website even if all the other IP addresses of the web-hosting companies they are using are blocked.

Many people are surprised that I am able to block such a large number of IP addresses using only the built-in blocking capability of Nginx on a small Raspberry Pi 3B server. All I can say is that apparently Nginx is far more efficient than they know, because I am doing just that. However, Nginx could use some improvements. For example, it provides no way of whitelisting user agents. Being able to whitelist Googlebot, Bingbot, Mastodon, Lemmy, and various RSS readers would be very helpful and would ensure that more innocent visitors to my website would make it through my Nginx blockade without my having to identify them individually. Although the Internet is silent about how to create a whitelist of user agents for Nginx, with the aid of some knowledgeable users on Blue Dwarf, I have some ideas to experiment with.

At this point, I am sure many readers are screaming, "Why aren't you using block lists from Spamhaus or those included with OPNSense or other third-party router software? Or, why aren't you using Cloudflare?" The answer is that I avoid centralized solutions whenever possible. Part of the reason is that I just don't trust them, and part of it is that I don't want to become dependent on them. Many of the benefits of going to all the trouble of hosting a website from a home server are lost if one just ends up relying on (and perhaps paying for) a boat load of centralized services. I am serious enough about not being roped into some corporate ecosystem that I do everything I can to avoid it, including writing as much of the software that runs my websites as I can. I wrote all the HTML and PHP code that runs Blue Dwarf, and the only dependencies in my PHP code are, PHP itself, the Bcrypt functions in Openssl, and Linux. If I could write my own operating system and web-server software, I would.

Many "tech bros" who are reading this are probably laughing and jeering. Let them. They don't understand because they have never been in my position and have never made an effort to find an inexpensive local solution to the robot problem. Many of them are probably part of the reason for the robot problem and therefore hardly unbiased. They assume that if they haven't heard of a solution to the problem, it doesn't exist. I have news for them, small website owners don't have to rely on Cloudflare's centralized captcha solution that annoys website visitors and social media users to no end and makes all of them subject to possibly passing all of their data to a big tech company that may one day turn evil. We can invent decentralized solutions that work. I have found one that isn't perfect yet, but as far as I am concerned, it is already an improvement over the centralized options available, and I am still improving it.

I am sure the tech bros will be shouting, "But a robot can be programmed to pass any test you can invent!" Yes, a robot can be programmed by a human being to pass the test I use to block robots (which appears at the bottom of this page below the comment form). But, how many people do you think will write that code into their robots just to pass a unique test on one small website out of six hundred million? If you are not smart enough to answer that question for yourself, tech bros, I will give you the answer. You can count them on the fingers of one hand. I have been using the same robot test for over five years on cheapskatesguide, and in that time only two robots have passed it. I have also been using this test for nearly three years on Blue Dwarf, and in that time only one robot has passed the test. During the same time, I have also had feedback forms on some of my other websites that have no robot tests, and many of them receive multiple advertisements a day from robots. So unlike the tech bros who have never tried this, I don't have to guess whether my simple robot test works. I know it works.

What this means is that small website owners do not need to rely on the same captcha solution that big tech has invented to apply to every website on the Internet. Nor do we have to spend millions of dollars to come up with another solution. All we have to do is come up with a simple and unique solution like mine that requires a little human thought to solve, and we will have a virtually impenetrable robot blocking solution without having to pass all the traffic from our websites through Cloudflare, and without having to pay a penny for it. However, I only use my robot test to prevent robots from adding comments to cheapskatesguide articles and creating user accounts on Blue Dwarf. I don't inconvenience every visitor to my websites with this. For that, I need a robot blocking solution that only makes itself known when it is actually needed.

So far, I have tried two approaches for testing visitors who are coming from IP addresses I block. Remember that I am only blocking 5% of IPv4 addresses on the Internet, so 95% of visitors will not be tested at all. For those 5%, I began a couple of years ago by presenting a custom 403 error message with my email address. Those that wanted to read my articles enough to send me their IP address and ask to be unblocked were accommodated. A problem with this approach was that many readers use VPN's and other proxies that change IP addresses virtually every time they use them. For that reason and because I believe in protecting every Internet user's privacy as much as possible, I wanted a way of immediately unblocking visitors to my website without them having to reveal personal information like names and email addresses.

I recently spent a few weeks on a new idea for solving this problem. With some help from two knowledgeable users on Blue Dwarf, I came up with a workable approach two weeks ago. So far, it looks like it works well enough. To summarize this method, when a blocked visitor reaches my custom 403 error page, he is asked whether he would like to be unblocked by having his IP address added to the website's white list. If he follows that hypertext link, he is sent to the robot test page. If he answers the robot test question correctly, his IP address is automatically added to the white list. He doesn't need to enter it or even know what it is. If he fails the test, he is told to click on the back button in his browser and try again. After he has passed the robot test, Nginx is commanded to reload its configuration file (PHP command: shell_exec("sudo nginx -s reload");), which causes it to immediately accept the new whitelist entry, and he is granted immediate access. He is then allowed to visit cheapskatesguide as often as he likes for as long as he continues to use the same IP address. If he switches IP addresses in the future, he has about a one in twenty chance of needing to pass the robot test again each time he switches IP addresses. My hope is that visitors who use proxies will only have to pass the test a few times a year. As the whitelist grows, I suppose that frequency may decrease. Of course, it will reach a non-zero equilibrium point that depends on the churn in the IP addresses being used by commercial web-hosting companies. In a few years, I may have a better idea of where that equilibrium point is.

For the benefit of readers who may want to experiment with my blocking approach, here is my PHP script, add2whitelist.php, that checks whether a blocked reader to my website has passed the robot test and if so adds his IP address to my whitelist and unblocks him:

<?php
// This cache control should work for all browsers, because it is
// the HTML 1.1 standard.
header('Cache-Control: no cache, no store, must-revalidate');
session_cache_limiter('nocache'); // this is later than HTML 1.1
header('Pragma: no cache');
header('Expires: 0');

//      You must add this to your Nginx config file:
//
//
//      # Custom 403 Error Handling
//      error_page 403 /403-error-handler_test.html;
//      location = /403-error-handler_test.html {
//         allow all;
//         #internal;
//      }
//
//      # Works with Custom 403 Error Handling.
//      # get_access.html contains the HTML form that displays
//      # the robot test to the website visitor and calls this
//      # script.
//      location = /get_access.html {
//         allow all;
//         #internal;
//      }
//
//   
//      # Allow access to add2whitelist.php to blocked users!
//      # Use your path for <<local path>>!!!
//      location = <<local path>>/add2whitelist.php {
//                allow all;
//                satisfy any;
//                include snippets/fastcgi-php.conf;
//                fastcgi_pass unix:/run/php/php8.2-fpm.sock;
//      }
//
//      # Use your path for <<full path>>!!!
//      include <<full path>>/whitelist.conf;
//
//
//      You must also create an empty file (with a blank line) and
//      make it readable and writeable by www-data:
//      # Use your path for <<local path>>!!!
//          <<local path>>/whitelist.conf
//
//
//      You must also have 403-error-handler.html and get_access.html in
//      the root directory of the website.


if($_SERVER["REQUEST_METHOD"] == "POST")
{
   $bot_test = "";
   $bot_test = validate(5,$_POST["bot_test"]);


   //Test to see if robot test questions was answered correctly.
   $day_of_month = (int)date("dS");
   $correct_answer = $day_of_month + 8;
   //Quotes around "$correct_answer" turn it into a string!
   if($bot_test === "$correct_answer")
   {
      $ipaddress = $_SERVER["REMOTE_ADDR"];
      $filename = "../php/whitelist.conf";
      if(!file_exists($filename))
      {echo "<br>Whitelist file does not exist."; exit();}

      $file = fopen($filename, "a+")
         or die("Could not open whitelist file.");

      $date = date("M dS Y");

      // Add IP address that user entered into 403 error page form.
      fwrite($file, 
             "allow " . $ipaddress . "; #User added" . " " . $date . PHP_EOL);
      fclose($file);

      // The next line only works if you add to the
      // /etc/sudoers.d/010_pi-nopasswd file:
      //
      //   www-data ALL=(ALL) NOPASSWD: /sbin/nginx -s reload
      //   www-data ALL=(ALL) NOPASSWD: /sbin/nginx -t
      //
      // Also note that www-data does not need to be and should not 
      // be in the sudo group!!!
      //

      // Check the syntax of the Nginx config file, and if correct,
      // reload it.
      $text = shell_exec("sudo nginx -t  2>&1");
      if(preg_match("/syntax is ok/", $text))
      {
         shell_exec("sudo nginx -s reload");
         header("location:" . "http://cheapskatesguide.org"); exit();
      }
      else
      {echo "<br>Whitelist file has wrong syntax.";}
   }
   else
   {
      echo "<br>Wrong answer.  Hit the back button in your browser and ";
      echo "try again.";
      exit();
   }
}
else
{
   echo "<br>Go to <a href='http://cheapskatesguide.org'>Cheapskate's Guide's main page</a>.";
}

// Validate inputs:
function validate($num_chars, $var)
{
   // This function removes any scripts that a hacker my have inserted
   // into the robot test HTML form.  I prefer to keep it private for
   // obvious reasons.  Use any method of user input sanitization 
   // you prefer.
}

?>

Don't forget to insert your own paths, your nginx configuration file, the correct answer to your robot-test question, and the domain name of your website into the code!

Multiple differences exist between a centralized corporate-generated captcha and my solution. First, with my method, no one's data is being sent to a corporation to do who-knows-what with. Next, Internet users who are coming to my website from residential IP addresses should never be interrogated by this code. My hope is those who do see my robot test will only see it a few times a year, if that, but that is something we will learn with experience using it. Last, it is a solution I created to serve a need, not to generate income for shareholders. I have no shareholders. The purpose of Cheapskate's Guide is not to generate income. It is to educate readers.

Although I don't expect the tech bros to receive much enlightenment from this article, I hope those few readers who want to run their own websites without resorting to centralized solutions will understand the implications of what they have read here. Small websites are fully capable of blocking web-scraping robots without relying on or paying for centralized solutions. All it takes is some ingenuity, tenacity, experience with Nginx, Apache, or one of the other web servers, and a minimal amount of programming knowledge. Furthermore, regardless of corporate and governmental efforts to manipulate and silence us on the big social media websites, we can still create our own spaces on the Internet where we can speak freely, and those spaces can accommodate tens or hundreds of thousands of people each. With very efficient software, they may even be able to accommodate millions each. In almost every case, other than needing a basic residential Internet connection, we can do all of this without asking big tech for any help at all. Many instances on the Fediverse that are run by individuals are shining examples of this. My hope is that more of us will begin to realize that we have much more freedom on the Internet than big tech wants us to believe. If Facebook, Reddit, Twitter, and others insist on treating us like Internet cattle, we can abandon their sites and not only create our own, but also create much better online homes for those who are not yet technically sophisticated enough to create their own.

We have had many interesting discussions on Blue Dwarf about creating our own online spaces. If you are interested in self-hosting, personal websites, online privacy, or blogging in general, you are always welcome to join us. Blue Dwarf is a small social media platform that I run personally. It is a free-speech platform that encourages a wide variety of topics and bars none, but we generally tend to lean to the technical side. We have a good community with people of all levels of technical experience, and we welcome tech newbies along with everyone else.

If you have found this article worthwhile, please share it on your favorite social media. You will find sharing links at the top of the page.

How to Stop Bad Robots from Accessing Your Lighttpd Web Server

The Webcrawling Robot Swarm War is On!

Internet Centralization may have made Blocking Unwanted Web-Crawling Robots Easier

Now Blocking 56,037,235 IP Addresses, and Counting...

The Joys and Sorrows of Maintaining a Personal Website

Running a Small Website without Commercial Software or Hosting Services: Lessons Learned

How to Serve Over 100K Web Pages a Day on a Slower Home Internet Connection

nix
said on Mar 29th 2025 @ 02:36:45pm ,

thank you cheapskate

im ever thankful for the efforts you put into keeping the web free for tor users worldwide!

Dee
said on Mar 30th 2025 @ 05:58:04am ,

I found the design of your website very pleasing. I’m running Dillo with cookie disabled for many years. In recent years, more and more websites refuse my visit. But whatever, I find sites that work well with Dillo rich in dense, AI free, content.

Cheapskate
said on Mar 30th 2025 @ 08:50:29am ,

Dee,

That is an interesting observation and another reason to use old browsers on the Internet.

paul
said on Mar 30th 2025 @ 03:11:45pm ,

interesting but ridiculously long winded. I couldn't read it all, sorry. I gave up about halfway. Could you make an abridged version that gives the relevant points in about 1/4 of the length?

rob
said on Mar 30th 2025 @ 03:45:49pm ,

@paul
>Could you make an abridged version that gives the relevant points in about 1/4 of the length?

Why don't you just use an AI to do this for you?

Cyborg
said on Mar 31st 2025 @ 01:12:27pm ,

Are you sure you don't mean 0.75 megaBYTES/s (or 6 megabits/s) upload bandwidth? The average upload speed in the U.S. is 61.98 megabits/s.

Aaron
said on Mar 31st 2025 @ 04:18:01pm ,

Hey, I've just found this blog a couple of days ago but this post reminded me of Anubis (TecharoHQ/anubis at GitHub). Do you know about it? Any thoughts on that solution vs yours?

Cheapskate
said on Apr 01st 2025 @ 05:11:56am ,

Cyborg,

Yes I am sure my connection is 6 Mbits/s down and 0.75 Mbits/s up, not Mbytes/s.

Aaron,

Yes, I have heard of Anubis, but it isn't appropriate for my needs because it increases the amount of data being sent to AI's. Even if the data is fake, my web server would still have to send it.

Cleverson
said on Apr 01st 2025 @ 12:20:59pm ,

Hi, thank you for all of your effort. You're one of those I call heros of our historic period.

Saddy
said on Apr 12th 2025 @ 08:50:33am ,

Good read. One question though: How do you distinguish residential IP addresses from commercial web-hosting companies? If it is via whois, what fields do you use, because I find it sometimes difficult to decide just from the descr. fields alone.

Cheapskate
said on Apr 12th 2025 @ 12:51:55pm ,

Saddy,

Distinguishing between commercial web-hosting companies and residential Internet services is not always easy. However, USUALLY commercial hosting companies advertise on the Internet, so you can duckduckgo their names and often see their advertisements. Other times, you can find them on business-oriented websites and perhaps even on Wikipedia. Sometimes, the same company provides servers for web-hosting and residential Internet customers, so I often choose not to block them. Russian web-hosting companies are also very difficult to distinguish from residential ISP's. Many of the whois results say things like "stub", and doing a whois <<IP address>> -h whois.ripe.net, etc. search yields nothing useful. In other words, many scammy and devious companies intentionally prevent you from getting any useful information from the whois database. So, my general answer to your question is that I do the best I can to distinguish the commercial hosting companies from everyone else, and of course I often fail. That is why I need to provide human readers with a method of unblocking themselves.

Christopher Witkowski
said on May 01st 2025 @ 03:27:12pm ,

Just discovered that backslashes disappear unless they themselves are escaped with a backslash. My personal preference is for the WYSIWYG approach, but, if this sort of thing is necessary then I would suggest a Preview button next to the Submit button. Or maybe a little popup telling you what's going to happen when you type a character like this.

Cheapskate
said on May 02nd 2025 @ 08:02:35am ,

Christopher Witkowski,

The backslash problem has do to the sanitization I do on user inputs. I have fixed that on Blue Dwarf, but I haven't carried it over to cheapskatesguide because I have not noticed a need to do so until today. I will consider it. Thanks.

Christopher Witkowski
said on May 03rd 2025 @ 05:01:40pm ,

(from 104.195.208.129) https://cheapskatesguide.org/articles/porteus-linux.html:

403 Error
Your IP address has been blocked. This MAY be because you have made yourself look like a robot by too frequently reading the cheapskatesguide RSS feed or because you are using an unknown VPN or Tor exit node.

Get access to the website now.

Leave now.

"Get access to the website now." doesn't do anything except change the URL to https://cheapskatesguide.org/articles/get_access.html. "Leave now." also does nothing except change the URL to https://cheapskatesguide.org/duckduckgo.com. Going back doesn't do anything except change the URL back.

I turned on VPN (OpenVPN + vpnbook = IP address: 144.217.253.149) and the page loads.

I've reloaded a few pages a few times a day because they are topics of current interest to me, but, I don't think that's excessive.

Cheapskate
said on May 04th 2025 @ 08:27:47am ,

Christopher Witkowski,

Thanks for the information. It looks like your browser is looking for the get_access.html form in the wrong directory. For the time being, I will have to use full paths to tell users' browsers where it is. I have fixed the duckduckgo link also.

The reason you had trouble from 104.195.208.129 is that IP address belongs to a commercial hosting service that I blocked in the last couple of days. I have been forced to block as many commercial hosting services as I can to reduce the number of robots reaching my website. Sorry for the inconvenience, but the whitelisting procedure is meant to reduce it.

Thanks for your help. Hopefully, the bugs will be ironed out of this whitelisting procedure, and it will work correctly for everyone soon.

Christopher Witkowski
said on May 04th 2025 @ 10:27:15pm ,

Recently I had to change my ISP and the switchover was done at the beginning of May. It looks like my new ISP is getting it's internet access from Teksavvy. https://who.is/whois-ip/ip-address/104.195.208.129 shows phone numbers and email addresses - I don't know if any of these actually work, but, it shouldn't hurt to try. There are a few toll free numbers on their web site, https://www.teksavvy.com/company/contact/, but, these days, long distance is cheap (I've used 1010-256 in the past). If you like I can try contacting them myself, but, it's your web site, so, I defer to your preference.

For now I've got a simple and easy workaround. It only takes a couple of clicks of the mouse to turn VPN on, or off.

Cheapskate
said on May 05th 2025 @ 05:58:57am ,

Christopher Witkowski,

The whitelisting procedure should work for you now. If it doesn't, please let me know.

There is no need to call Teksavvy. I assume small ISP's do things like this all the time. Unfortunately, that just makes it more difficult for me to know where the robots are coming from.

Christopher Witkowski
said on May 05th 2025 @ 02:42:24pm ,

Now, 144.217.253.149 (VPN server), in addition to 104.195.208.129 (ISP), is being blocked. Currently using the VPN server at 142.4.216.196.

Cheapskate
said on May 06th 2025 @ 05:06:57am ,

Christopher Witkowski,

What happened this time when you tried to unblock 144.217.253.149 and 104.195.208.129?

Christopher Witkowski
said on May 06th 2025 @ 01:12:51pm ,

I don't get to try to unblock them - it's your web site.

Still getting:

------------------------------

403 Error

Your IP address has been blocked. This MAY be because you have made yourself look like a robot by too frequently reading the cheapskatesguide RSS feed or because you are using an unknown VPN or Tor exit node.

Get access to the website now.

Leave now.

------------------------------

For now the alternate VPN server still works.

Cheapskate
said on May 07th 2025 @ 06:01:55am ,

Christopher Witkowski,

And what happened when you clicked on "Get access to the website now."?

Christopher Witkowski
said on May 07th 2025 @ 12:11:47pm ,

Had to use Opera's VPN to see this as my alternate VPN got blocked.

But, yes, that link works now. I have been able to unblock my regular IP address as well as the 2 VPN server IP addresses.

Thank you.

Christopher Witkowski
said on May 07th 2025 @ 12:23:36pm ,

Without having seen that page before I had no idea that I would get to unblock my address. I'm really bad at guessing these things - I thought that link would give me something like a one time look at the posts.

Christopher Witkowski
said on May 07th 2025 @ 12:48:58pm ,

A couple of annoyances:

1) When you unblock your IP address it takes you to the main page instead the page you were at.

2) On one of the pages, after entering the number, I entered a <carriage return> instead of clicking on Submit and it appeared to have cleared the number, but, then, when I re-entered the number it wouldn't accept it. Took me a while to figure out that the <carriage return> had simply been accepted and the number I had previously entered had just been scrolled out of site. Instead of <textarea> how about <input> (one line instead of multiple) with a type of number.

Cheapskate
said on May 07th 2025 @ 01:51:36pm ,

Christopher Witkowski,

Well, I am happy that you can now get to the website, even if you aren't getting to the exact page you want to see. I will think about whether the carriage return issue can be fixed, but I doubt it. People will probably just have to know how to enter data into an HTML form correctly to get to the website. I guess I could add a note that says, "Don't hit the 'enter' key. Click on the 'Submit' button."

Christopher Witkowski
said on May 08th 2025 @ 02:17:11am ,

The carriage return issue is due to the use of <textarea ...> instead of <input type="number" ...>.

I don't understand how this is not about Babbleweb. The initial post talks about the day of month + 8 test, so, that has to be part of Babbleweb. The blocking may not be - I assumed it was - but then you have to say where a discussion about it belongs and rather than deleting comments move them to where they belong.

It's your web site so you get to do what you want, but, I assume you want some participation in your web site, so, you need a place for these kinds of comments.

Cheapskate
said on May 09th 2025 @ 06:49:51am ,

I have moved the conversation with Christopher Witkowski here from the babbleweb article because it seems more appropriate here. Currently, with Christopher's comments incorperated, get_access.html looks like this:

<!DOCTYPE html>
<h4><br><br>
<p>You can be unblocked immediately in most cases, but if you were blocked for too frequently reading the RSS feed, you may have to wait for a day. To be unblocked now, which will automatically add your current IP address to the server's whitelist, answer the robot-challenge question below.</p>
<br><br>
<form action="<<insert full path here>>/add2whitelist.php" method="post">Day of the month in North America + 8 = <input type="text" name="bot_test"size="5"></input> (example: 27+8=35)  <input type='hidden' name='first_time' value='1'><input type="submit" value="Submit"></form>
<br><br>Alternatively, you can send an email containing the IP address that you want to be unblocked to cheapskatesguide@tutanota.com.
</h4><br>
</html>

Note that <textarea... has been replaced with <input type="text"..., so that visitors can hit "enter" on their keyboards instead of clicking on the "Submit" button if they choose. Also <<insert full path here>> will have to be replaced with the full URL to the add2whitelist.php file on the server of anyone who wants to use this code.

Comments

Required Fields are marked *

Nuking the Corporate Web's Web-Scraping Robots

Related Articles:

Comments