It'll never match "a real search engine"

Posted Nov 29, 2011 20:16 UTC (Tue) by khim (subscriber, #9252)
In reply to: Web Search By The People, For The People: YaCy 1.0 by b7j0c
Parent article: Web Search By The People, For The People: YaCy 1.0

This design will never work. The best way to find something on lwn.net is to use Google with "site:lwn.net" restriction. Why? Because even when you restrict search to the single site "a real search engine" still uses metainformation from the whole world wide web.

YaCy just does not have enough information to ever match "a real search engine". The best result it can ever hope to get is mostly unsorted list of links. Relevancy sorting is just impossible - by design (protection of user's privacy).

This is in addition to technical problems: these are also enormous but they can be solved, at least in theory. Principal refusal to use available information (in the name of privacy) cripples the project from the start and can not be ever changed (without changes in stated goals, obviously).

to post comments

It'll never match "a real search engine"

Posted Nov 29, 2011 21:33 UTC (Tue) by bjartur (guest, #67801) [Link] (2 responses)

No, but the privacy benefit is largely moot as soon as you send your query, which you have to do unless all you want to do is full-text search over your local cache. That's not really where YaCy excels, although it does support a number of complicated formats grep can't handle.
What YaCy provides is a DHT protocol for distributed keyword search. It has the potential to solve the biggest problem with replacing existing web indexes: lack of Internet Archive class bandwidth and storage. In fact, YaCy seems to do it quite well. It is a great step up from the Common Name Resolution Protocol and HTML-form assisted HTTP querying.
There is still much left to improve, with ranking being one. But client-side ranking provides benefits already: customization beyond what Google allows you to do in it's believe that it knows your interests better than you do (and in what language you prefer your content, solely based on your location).

And this is the only part that matter...

Posted Nov 29, 2011 21:51 UTC (Tue) by khim (subscriber, #9252) [Link] (1 responses)

There is still much left to improve, with ranking being one.

This is the only part that matters. Sure, to create search engine you really need beefy hardware and it cost a pretty penny, but... it's only large for an individual. You need few millions of dollars to build datacenter comparable to Google - there are a lot of individuals and organizations which can afford that. But then you need to rank the results - and this where task becomes hard.

But client-side ranking provides benefits already: customization beyond what Google allows you to do in it's believe that it knows your interests better than you do (and in what language you prefer your content, solely based on your location).

Sure, but this an icing on the cake. Show me how and when you'll get the cake - then we'll have a meaningful discussion. Google sorts documents once and then uses millions of times (this simplification, of course: nowadays it alters the the existing ranking "on the fly" and does not rebuild the index each week like it did years ago, but these are minor details) - this makes the whole thing affordable. How do you plan to achieve that with client-side ranking is the question.

Indexing: Done; Crawling: WIP; Ranking: TD

Posted Dec 3, 2011 15:49 UTC (Sat) by bjartur (guest, #67801) [Link]

Ranking seems relatively cheap when all indexers are trusted. Not so in a trust-nothing peer system. YaCy has already pushed the state of the art of P2P higher than I expected in coming years by use of DHT so obvious in retrospect. It's still just a sort-of working proof of concept with great potential for evulutionary enhancement and reworking. In YaCy indexes can be provided either by website run YaCy servers or a distributed network of crawlers. The former works already for the few sites that run YaCy, and with the publicity YaCy has now, distributed crawling might get close to usable soon.
Yes, ranking is hard - but so was distributed indexing. It became easier. For now ranking has to be done by a trusted site. But perhaps the most important product of the YaCy project may become standardization upon a common protocol allowing searchers to more easily aggregate search results from multiple rankers and for rankers to aggregate indexes from even more crawlers.

If Google refuses to rank Yahoo Mail above GMail, then bohoo. If Google omits a site from their index, then shit. It happens so rarely that their results are overall far superior to those of any other engine. Microsoft has crawled an astounding number of pages, but doesn't yet have all those esoteric pages whose authors have long forgot about and are only linked to by that other esoteric page, perhaps not served as HTML. It doesn't even matter whether Bing's ranking algorithm is better than Google's. If Google's is just good enough to allow their maturity to keep and attract users. And Google didn't exactly stop at the original PageRank.

But YaCy allows decoupling of indexing and ranking, even if ranking practically can not be done in a fully distributed fashion, it allows for ranking to be outsourced far more easily than the current mess of custom HTML soup results. Note that a standardized format for search results (uri-list, RSS, Atom or a semantic HTML dialect) would achieve the same, but such a standard will not be adhered to until a few major players have shown it support for it and classified it as the New Deal(tm).

Disclaimer: I do not run a YaCy crawler yet for a lack of bandwidth. Google is the search engine used by this stable build of Opera, but Bing by the customized experimental build for it can't cope with the insanity that is the latest Yahoo-esque revision of Google's search result page.