This post will hopefully make the situation a bit clearer for everyone. Most of this info has been posted in one place or another already, but I will attempt to consolidate the important bits as I recall them. I have effectively zero visibility into the operation of cybertip.ca or Project Arachnid (their crawler), so this is based on what I can tell from my end. I have not used this blog in many years, but it's the easiest way I can publish this info now.

To start with, I do not believe they are acting maliciously, and they do not seem to be intentionally using the site to search for images. They are just following links. Dangerous links which spread CSAM (Child Sexual Abuse Material), links which they should be smart enough not to follow, but ultimately, still, just following links. From what I understand, these links primarily come from image boards and such which helpfully add them next to all posted images. This is great for users, as the links allow for quick and easy source lookups of interesting images as you come across them:


The current situation started in the afternoon on the 31st of March when our host received the first CSAM notification and promptly sent it over to us for review. At the time, I was traveling, and not online as often as usual so the report went unnoticed. Around the same time, the host of iqdb.org - another anime reverse image search engine - also started receiving similar notices. Likewise, they forwarded them to the site's operator, but unfortunately the emails wound up in spam and were not immediately noticed. More reports continued coming in over the course of the afternoon and night. I finally noticed them in my email when I came back online later that night.


The view of my inbox was seriously distressing, full of notifications from cybertip and our colocation host. I was more than a bit freaked out, wondering what could possibly be going so wrong for me to have an inbox full of CSAM alerts. I was also very concerned that our host would suddenly pull the plug. It had been several hours after all, and the wording of the notices is highly alarming. I needed to act fast. Looking at the reports, I was somewhat relieved to see they were reports of /userdata images. Those images are temporary files associated with, and created for the searches performed in response to the links the crawler accessed. They're also all long gone by now, having been automatically deleted only minutes after creation.

I was far from sure that our colocation host would recognize that distinction on their own though! The clock was ticking, I quickly purged all query image caches, etc, just to be sure, and responded to the many tickets as fast as possible. Mostly the same explanation to each, but luckily SauceNAO's host seems to have understood the situation. I did not receive a response on the tickets, but the site stayed online. Simultaneously, I attempted to contact cybertip directly in response to the notices explaining what their bot was doing wrong, and how it was directly spreading the material. No response.

After the initial tickets were dealt with, I sent a heads up to the group of anime site operators I interact with frequently, including the operator of iqdb. It was getting fairly late though, so most had already gone AFK for the night. The notices kept coming, so I took the emergency action of disabling the search query image, as that is what all the reports were reporting.

The reports immediately stopped, though cybertip continued to search for bad images, causing them to be uploaded to our servers. Once the image they were uploading was no longer being displayed on the page, there was no longer anything to report...

The next morning, after some pretty terrible sleep, I awoke to the news that iqdb.org was down. Taken down by their host, in response to the abuse notifications sent by cybertip. Abuse notifications they should actually have been sending to themselves.

Luckily, SauceNAO was still online. If I had not noticed the night before, we would probably also have been taken down, with potentially damaging effects to our servers, data, and reputation. Later that day, iqdb was brought back online when its operator was able to respond to the abuse reports, but it could have been so much worse.

Several days later, once I was back home, I started to see many users wondering about the search query image being missing. A few even asked me directly about it, so it was obviously starting to be a problem.

The search query image makes it clear that the image was acquired successfully, is properly formatted, aligned, etc. Clicking on the search query image also allows editing the image to remove borders or search for just a portion of an image to improve result accuracy. It's a very important feature, and everyone was missing it badly. Reluctantly, I re-enabled the search query images, hoping for the best...

It took a few hours, but the notices started flowing again. More reports for the images being searched for, the same images being created at the direction of the crawler which then reports them. In frustration, having heard nothing from cybertip I attempted to contact them again. Shortly after, I posted a pointed message to Twitter, publicly calling them out on their crawler's bad behavior.

By the next afternoon, the tweet was getting a lot of attention. I don't know if it was solely in response to the attention from that tweet, but they finally responded to my initial email. Around the same time, they replied on twitter with a complete denial.


Via email, I attempted to explain what was wrong and suggested several options for fixing it, but they seem to think their crawler's behavior is completely okay. Consequences be damned, no apparent care for how the modern internet operates.
One good thing did come out of that email communication though, they agreed to notify us directly in the future rather than through our host. This dramatically reduces the chance our host will suddenly decide to drop us as a customer, or take our servers offline.

Shortly after, I replied to their reply on Twitter.


Mostly silence since, and their crawler has continued trying to search for what they call CSAM on our site. In response, we disabled searches from AWS, on which their badly behaved crawler is hosted.

While blocking them from searching for abuse material on SauceNAO improves the situation for us, it does not change the fact that their bot is actually spreading the material they claim to be trying to remove from the internet. In my view, it's even worse now since they know what is happening and have promised no action to address the problem.

There are many other services, including big names like Google, Bing, and Yandex, which allow uploading or acting on an image using just an image link embedded in a url. Each and every one of these is in effect being attacked by the Project Arachnid bot with illegal requests directing their servers to access and in some cases host illegal images. The giants may have the resources to shrug this off, but smaller players like us are being severely impacted by Project Arachnid's misuse of our services.

I am still attempting to work with them, hopefully something positive will come of all this.

  1. This post will hopefully make the situation a bit clearer for everyone. Most of this info has been posted in one place or another already, but I will attempt to consolidate the important bits as I recall them. I have effectively zero visibility into the operation of cybertip.ca or Project Arachnid (their crawler), so this is based on what I can tell from my end. I have not used this blog in many years, but it's the easiest way I can publish this info now.

    To start with, I do not believe they are acting maliciously, and they do not seem to be intentionally using the site to search for images. They are just following links. Dangerous links which spread CSAM (Child Sexual Abuse Material), links which they should be smart enough not to follow, but ultimately, still, just following links. From what I understand, these links primarily come from image boards and such which helpfully add them next to all posted images. This is great for users, as the links allow for quick and easy source lookups of interesting images as you come across them:


    The current situation started in the afternoon on the 31st of March when our host received the first CSAM notification and promptly sent it over to us for review. At the time, I was traveling, and not online as often as usual so the report went unnoticed. Around the same time, the host of iqdb.org - another anime reverse image search engine - also started receiving similar notices. Likewise, they forwarded them to the site's operator, but unfortunately the emails wound up in spam and were not immediately noticed. More reports continued coming in over the course of the afternoon and night. I finally noticed them in my email when I came back online later that night.


    The view of my inbox was seriously distressing, full of notifications from cybertip and our colocation host. I was more than a bit freaked out, wondering what could possibly be going so wrong for me to have an inbox full of CSAM alerts. I was also very concerned that our host would suddenly pull the plug. It had been several hours after all, and the wording of the notices is highly alarming. I needed to act fast. Looking at the reports, I was somewhat relieved to see they were reports of /userdata images. Those images are temporary files associated with, and created for the searches performed in response to the links the crawler accessed. They're also all long gone by now, having been automatically deleted only minutes after creation.

    I was far from sure that our colocation host would recognize that distinction on their own though! The clock was ticking, I quickly purged all query image caches, etc, just to be sure, and responded to the many tickets as fast as possible. Mostly the same explanation to each, but luckily SauceNAO's host seems to have understood the situation. I did not receive a response on the tickets, but the site stayed online. Simultaneously, I attempted to contact cybertip directly in response to the notices explaining what their bot was doing wrong, and how it was directly spreading the material. No response.

    After the initial tickets were dealt with, I sent a heads up to the group of anime site operators I interact with frequently, including the operator of iqdb. It was getting fairly late though, so most had already gone AFK for the night. The notices kept coming, so I took the emergency action of disabling the search query image, as that is what all the reports were reporting.

    The reports immediately stopped, though cybertip continued to search for bad images, causing them to be uploaded to our servers. Once the image they were uploading was no longer being displayed on the page, there was no longer anything to report...

    The next morning, after some pretty terrible sleep, I awoke to the news that iqdb.org was down. Taken down by their host, in response to the abuse notifications sent by cybertip. Abuse notifications they should actually have been sending to themselves.

    Luckily, SauceNAO was still online. If I had not noticed the night before, we would probably also have been taken down, with potentially damaging effects to our servers, data, and reputation. Later that day, iqdb was brought back online when its operator was able to respond to the abuse reports, but it could have been so much worse.

    Several days later, once I was back home, I started to see many users wondering about the search query image being missing. A few even asked me directly about it, so it was obviously starting to be a problem.

    The search query image makes it clear that the image was acquired successfully, is properly formatted, aligned, etc. Clicking on the search query image also allows editing the image to remove borders or search for just a portion of an image to improve result accuracy. It's a very important feature, and everyone was missing it badly. Reluctantly, I re-enabled the search query images, hoping for the best...

    It took a few hours, but the notices started flowing again. More reports for the images being searched for, the same images being created at the direction of the crawler which then reports them. In frustration, having heard nothing from cybertip I attempted to contact them again. Shortly after, I posted a pointed message to Twitter, publicly calling them out on their crawler's bad behavior.

    By the next afternoon, the tweet was getting a lot of attention. I don't know if it was solely in response to the attention from that tweet, but they finally responded to my initial email. Around the same time, they replied on twitter with a complete denial.


    Via email, I attempted to explain what was wrong and suggested several options for fixing it, but they seem to think their crawler's behavior is completely okay. Consequences be damned, no apparent care for how the modern internet operates.
    One good thing did come out of that email communication though, they agreed to notify us directly in the future rather than through our host. This dramatically reduces the chance our host will suddenly decide to drop us as a customer, or take our servers offline.

    Shortly after, I replied to their reply on Twitter.


    Mostly silence since, and their crawler has continued trying to search for what they call CSAM on our site. In response, we disabled searches from AWS, on which their badly behaved crawler is hosted.

    While blocking them from searching for abuse material on SauceNAO improves the situation for us, it does not change the fact that their bot is actually spreading the material they claim to be trying to remove from the internet. In my view, it's even worse now since they know what is happening and have promised no action to address the problem.

    There are many other services, including big names like Google, Bing, and Yandex, which allow uploading or acting on an image using just an image link embedded in a url. Each and every one of these is in effect being attacked by the Project Arachnid bot with illegal requests directing their servers to access and in some cases host illegal images. The giants may have the resources to shrug this off, but smaller players like us are being severely impacted by Project Arachnid's misuse of our services.

    I am still attempting to work with them, hopefully something positive will come of all this.

  2. Update:
    The block has been lifted, and does not seem to have been intentionally placed. It's hard to say what happened, as we have been indexing their site without issue since 2009, but their willingness to work with us to get the problem resolved was awesome. We really appreciate it.

    Our thanks go out to everyone who helped us bring this issue to their attention so quickly, and to the wonderful people at pixiv. May their site continue to thrive~


    Well, I hate to say it, but pixiv seems to have blocked the update server... I could just change IPs, the block is very specific, but that's not the point. We can only hope they will come to their senses and allow such a useful service to continue indexing their site...


    It seems incredibly short sighted for them to block a service which makes their site significantly more accessible to the community, especially so soon after they expanded internationally... I've sent them a few messages in different ways, so we'll see if they feel like replying.


    If you find the pixiv index useful, re-blog this, re-tweet the relevant messages on the SauceNAO twitter feed, e-mail them, and/or do what ever you can to GET THEIR ATTENTION.


    Spread the word. If they can't hear us, nothing will change.
    11

    View comments

  3. As announced on the twitter account, pixiv index updates were fully automated a few days ago. Assuming the updater performs reliably, and pixiv doesn't make any more sweeping changes, (wishful thinking) this automation will save quite a bit of time -- and provide a much more useful and relevant index... (two birds with one stone? ;P)

    The current setup deploys a new set roughly every four to five hours, and seems to be working relatively well so far. Since there were no serious issues with the deployment, the frequency may be tweaked somewhat, but updating too often could put unnecessary strain on the various resources involved. We'll just have to see how it goes~


    If anyone has any _reasonable_ ideas on how the various other indexes could be automated, (or 'outsourced' ;P) I'd love to hear them. Properly maintaining the currently deployed indexes would take an enormous amount of time, so invariably more important things come up and push even much needed updates aside...
    0

    Add a comment

  4. The Image Search Options Firefox Extension has now been 'ported' to chrome! I use ported very loosely, as this was effectively a complete rewrite, and the two share absolutely no code... Somewhat annoying... >_<

    This extension is more in line with v1.x of the Firefox extension, it does not yet have anywhere near the same level of extensibility as v2.x, and due to the huge limitations placed on chrome extensions when compared to the complete access provided by Firefox, it may never get there. I'm fairly new to chrome extensions so we'll just see how it goes.

    Image Search Options for Chrome
    6

    View comments

  5. Figured it was about time to bring this index current again, overall not too much new stuff in the first half of the year... There were some great ones, but hopefully frequency will pick up again over the summer.

    I also tweaked the indexing method, so the content that was already in there should be more searchable than ever. The new method will be applied to the anime index next, and an update to that should be coming soon.
    0

    Add a comment

  6. the pixiv index has been updated, and now includes upto ID 18530324, there were a few other updates earlier this week, but I never got around to posting about them...


    In the future, the relatively minor and repetitive updates like these will probably only be posted on the SauceNAO twitter feed. Major news and important updates will still be discussed here.
    1

    View comments

  7. Index now includes all IDs up to and including 17502758~
    0

    Add a comment

  8. Finally releasing the first cut of the Anime index! It's nowhere near comprehensive at this point, I'm just releasing it as is because I really need to get a feel for how it performs in the real world. It'll undoubtedly have some issues in its first few months, just as the H-Anime index did, but with a bit of tweaking it should be a great addition to the overall mix.

    The Anime index as it stands covers a pretty random assortment of around a thousand series. It's still very much a work in progress at this point, and will be rapidly expanded over the coming months. (Popular recent series are probably next) Hope you like it~
    2

    View comments

  9. Since everyone loves it so much, here's another update. ;)
    New Max ID is 16990318~ I'm starting to run out of space to store the raw unprocessed images. This latest update added another 100GB to the total size... Gonna have to do something about that sometime soon >_>;

    Next updates will probably be the H-Anime and H-Mags indexes, they're starting to show their age...
    0

    Add a comment

  10. Since I had everything still set up after yesterday's update with 5 day old data, I decided to just bring the index current. New max ID is 16472339~

    Might be a while before I update this index again, but it depends on how fast new images keep stacking up... At the current rate we'll be looking at ID 20m in a month or two >_<;
    0

    Add a comment

  11. Highest ID now covered is 16325536.

    The image acquisition was actually between the 1st and 2nd of Feb, but I'm just finally getting around to rolling it out... There's a fair bit that is done to the raw images before they are incorporated into the index; the processing just took a bit longer than normal for various reasons~
    0

    Add a comment

  12. This index covers any images that had been previously covered by the original pixiv index, but were deleted between 2008 and 2011. At the moment it is searched at the same time as the 5_pixiv index, and is included with the 999 search.

    Its worth noting that this index is really just for basic coverage... There are certain cases where it could come in handy to know which member, the title, or what ID a deleted image was assigned, but beyond that it wont be much help. The links to where the images would have been on pixiv will generally be dead, and in many cases the members will be gone as well.
    0

    Add a comment

  13. The Pixiv index is once again up to date, or so it was a few hours ago when 15953992 was the max... Undoubtedly the index is already missing the most important and epic images ever, but what can ya do~ Next update in a few weeks, that's not too terribly long to wait. ;P

    Oddly enough, the index gen completed before the display images finished uploading to the server, so some of them may be missing for a little while. I decided to deploy the new index before the upload finished since it will quickly resolve itself, and the search results will be accurate even with a few of the the display images missing.
    0

    Add a comment

  14. This index was in very sad shape. It was started a couple years ago, but quickly abandoned for whatever reason and never really got off the ground... The HCG index could and should be a very important part of the overall index mix, so I've taken some much needed time to expand it from a few dozen games to a few thousand. The index still has a long way to go before it's able to identify anything serious, but it should have _some_ utility now.


    The games indexed are fairly random at this point, some big, some fairly obscure. The next update will probably focus on including the more popular ones.
    0

    Add a comment

  15. Seems like the only thing I ever post about is index updates... Well, here's another~

    For this update of the H-Anime Index I changed how the index is generated from the source material. In the wild, the change could have the same good results I saw in testing, or it could completely ruin the index's ability to provide decent matches... It's a delicate balancing act, so we'll just have to see over the next few days. ;P

    In the worst case scenario I'll just roll back to the previous one, and re-add the new episodes with the old method.
    0

    Add a comment

  16. Just trying to determine whether it would be reasonable to automate the updates at some point...

    Highest indexed indexed ID is now 15368602~
    0

    Add a comment

  17. Pixiv has been updated again~
    The highest ID indexed is now 15278931, but this time the index has been rebuilt from scratch... The fresh regen has the unfortunate side effect of dropping any images which have been deleted from Pixiv over the past year or so. Any such images are no longer included in the 5_Pixiv index, but I will be setting up an extended Pixiv index at some point in the near future which will cover all lost images. (I like to be as complete as possible :P)
    0

    Add a comment

  18. The H-Magazine index has finally been updated! The last update was in early 2009, so there was quite a bit missing at this point... With the new index the number of magazines covered has been greatly increased, and every issue available was included. The organization of the index was also changed considerably, so it should be much easier to update in the future.

    Next up: Pixiv! It'll take a few more days, but it's getting to be time for another update...
    2

    View comments

  19. The DDB Project (3-4), H-Anime (1), and Pixiv(5) Indexes have been updated~

    Pixiv, as always, just needed another update; an update to the tune of 1.5 million new images... They are seriously on fiiiiiire!
    The DDB Project indexes had not been updated in almost a year, and were getting quite old, so they really needed some attention. MugiMugi was kind enough to provide me the with the updated data necessary for regenerating them.
    The H-Anime index was just updated in early September, but unfortunately there was a problem with some of the series info fields, and it needed to be regenerated... While I was at it, I brought it up to date again, adding everything that was released in September~

    Actually, the DDB Project and H-Anime index updates are old news. Those updates happened about a week ago, but I never got around to updating this site. You would know this if you kept tabs on the indexing status page. ;P
    1

    View comments

  20. This version improves on the POST functionality introduced in 1.8.0, fixing many of the limitations, and making it easier to use.

    The problem causing the odd Image truncation for large images turned out to be Firefox's automatic image resizing, or any other resizing of the image. An unfortunate limitation of the JavaScript image height and width attributes is that they are not the real image dimensions... To get around this, I have to load the image into a new image object which is not resized, and pull the height and width attributes from that. (as always, any better suggestions are welcome)

    Background image search was pretty easy to add. It really should have been in the first iteration, but I dropped it in favor of an earlier release.

    Finally, and probably most importantly, is that the POST image submission will no longer show the search page while loading the results, instead it will show "Loading. Please Wait...". The way the POST submission works is still exactly the same, but during loading of the results I overwrite the base search page initially with the status message so it is clearer what is actually going on.

    So all in all its become even more complicated. The current flow is something like this...

    Getting Results:
    Selected Image->ImageURL->new ImageObject from URL->new Canvas from ImageObject->PNG DataURI from Canvas->Binary Image data from DataURI via AtoB->Binary Stream from Binary Image data->POST Request using Binary Stream.

    Displaying Results:
    Load Search Page in new Tab/etc...->Wait for search page load to finish->Overwrite its innerHTML with loading message->Wait for POST request to finish->Overwrite its innerHTML with POST request results.

    isn't it just beautiful~ >_>;
    2

    View comments

  21. At long last the H-Anime Index has been given some real attention, and has been brought up to date. Loads of new content can now be found, and as such it should be much more useful. The previous Update was in mid December so it was really in need of a good refresh.

    I hope it wont go quite so long without an update next time, but who knows~ It can be quite tough to find certain things... ;P
    0

    Add a comment

  22. 412269 new images added to the Pixiv index~ Last ID is 12757456

    New stuff (not pixiv updates) soonish, been too busy with other things.
    0

    Add a comment

  23. 433413 new images added to the Pixiv index~ Last ID is 12439935

    There's so damn much activity on Pixiv these days. XD


    Pixiv user image index is also coming along nicely at the expense of the H-Anime index update. Alas, I am but one person...
    0

    Add a comment

  24. https://addons.mozilla.org/en-US/firefox/addon/93451/

    Its been out for a while now, and is pretty easy to use, but I'll quickly cover some of its more interesting features...

    Custom GET Options:
    First off is the 'Options String' setting. This can be used to pass customized GET variables to the various sites. If you wanted to set the variable "blah" to "foo" you would type into this field "&blah=foo". Whatever is placed there will be inserted directly into the URL.

    A practical example for SauceNAO would be "&accept=1" which will automatically bypass that pesky warning screen. This is particularly useful if you do not wish to accept SauceNAO's delicious cookie.


    POST Image Search:
    Another incredibly useful feature is POST image submission. This allows you to directly search for images that are behind login screens, on your local network, or are for whatever other reason not directly accessible to the search site.
    When this feature is enabled, and the user right clicks on a NON-BACKGROUND image, right clicking on the context menu option will initiate a special query sequence. (doing so on a background image will perform a regular search.)
    Now, the way this feature works is somewhat peculiar, and if you know of a better way PLEASE let me know!

    First it will load the selected image into an invisible canvas element. That canvas element is then converted into a base64 png data URI which is then atob'd, placed into a stream, and POSTed to the selected site. Are you clawing your eyes out yet?

    The extension then opens a new tab/window/etc and has it load the selected site's search url without an image so that relative links work properly, and then uses javascript to replace the page's content with the data returned by the POST submission.

    Not surprisingly, this method has a few significant problems... First off, when the new tab is opened, it will temporarily display a page without results, or an error message until the POST finishes. Secondly, the page source of the POST submission results can not be viewed in the 'View Page Source' window. And lastly, somewhere in the conversion to canvas element, to data URI, to binary data, to binary stream, large images get improperly truncated. This effectively crops the large images to their upper left corner. I'll be looking into this one more, it's probably something stupid, but for now only reasonably sized images work perfectly~


    Other cool but fairly self explanatory features include:
    • Being able to turn off certain sites! - You don't actually like IQDB or TinEye do ya? Or don't tell me, you hate SauceNAO... ;P
    • Renaming any of the Context Menu options. - The defaults are sooo lame.
    • Turning off the menu options for background images. - They're usually not what you want to search for anyway.
    • Changing where the results open! - Why settle for a new foreground tab? You can have a new window, a background tab, etc~
    1

    View comments

  25. 297539 new images added to the pixiv index~

    The pixiv redesign had me raging for a little bit, my scripts were working so perfectly for the old site... ;_; Overall it doesn't behave too terribly differently though.


    Next Updates:
    H-Anime - Couple Weeks

    I'm also considering adding a new index to cover pixiv profile, and page background images. Just need to work out how best to keep it up to date while maintaining history...
    0

    Add a comment

  26. Don't really know why I'm starting this blog, but oh well~

    SauceNAO is a reverse image search engine focusing on Anime and Manga related images.

    The SauceNAO site uses a modified version of the Image Query Database server code which can be found at the heart of the IQDB reverse image search service.
    The IQDB server works by taking an image supplied by the user, separating its color channels, and running a wavelet transform on each channel.
    The output of the wavelet transform is weighted using a little bit of magic, and the transformed image is compared to the already processed images in the index to determine the closest match.

    It takes some serious power, and a helluva lot of storage space to sort and pre-process a few hundred million image files, but having a large part of the work done already is what keeps the searches as fast as possible.
    As an example, the pixiv index at a mere 10 million images takes a good 24 hours to generate from scratch, and pixiv is growing by leaps and bounds. (~200k new images this week alone)

    But before pre-processing can even begin, the content must first be gathered. Deciding what to index, and then finding a way to get it is where the real challenge lies.
    Guessing what people _might_ search for, or only indexing big name titles really doesn't cut it. Any random person can source the popular stuff...
    The goal of SauceNAO is to gather both the popular things, and the most ridiculously obscure things that neither I nor seemingly anyone else has heard of.

    In that respect, Only two of the indexes really pass muster: the H-Anime index and the pixiv Index. Unfortunately, the former hasn't been updated in nearly 8 months, and is horribly out of date...
    I'll get to that soon. >_>;

    Next Updates:
    pixiv - next couple days
    H-Anime - few weeks
    E-Anime (first release) - month or two

    The E-Anime index will contain Ecchi Anime. Just a few hundred series to start with, but it'll keep expanding until everything is covered, or I run out of room... ;P
    0

    Add a comment

Relevant Links
Labels
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.