Google グループ
ヘルプ | ログイン
ディスカッション Accessing archive.org からのメッセージ
投稿先のグループは Usenet グループです。このグループにメッセージを投稿すると、インターネット上のユーザーがメール アドレスを閲覧できるようになります。
返信メッセージが送信されていません。
投稿に成功しました。
 
差出人:
宛先:
Cc:
フォローアップ先:
Cc を追加 | フォローアップ先を追加 | 件名を編集
件名:
確認:
確認のため、下の画像に表示されている文字か、アクセシビリティ アイコンをクリックすると聞こえる数字を入力してください。 聞こえた番号を入力します
 
Brian Widdas  
プロフィールを表示
 詳細オプション 1月14日, 午後6:47
ニュースグループ: demon.service
差出人: Brian Widdas <br...@widdas.net>
日付: Wed, 14 Jan 2009 23:47:57 GMT
ローカル: 2009年1月14日(水) 午後6:47
件名: Re: Accessing archive.org
On 2009-01-10, Alan Poulter <l...@poulter.demon.co.uk> wrote:

> Is anyone else having problems accessing archive.org? If you are please
> can you email me rather than reply here.

Hi.

The problems have now been fixed. The explanation is not short, so bear
with me:

Firstly, yes, something on web.archive.org was blocked by the IWF. Don't
ask what, I don't know.

The filter we use uses a proxy to inspect suspect URLs. Where a URL is not
on the IWF list (ie, the server hosts some child abuse content, but only a
single URL is blocked), we have to proxy the connection on to the original
server the request was intended for.

Here's where it gets interesting. The proxy sends various bits of
information with the request. One of these is the name of the proxy itself.
Not unsurprisingly, this is 'iwfwebfilter.thus.net'.

It seems that archive.org use caches at their end to speed up access to
pages. When a page is requested, if it's not in the cache, it is built from
the archive and made available to the requestor. As part of this build
process, the server takes a hostname from the cache, along with the date
portion of the URL, etc, to create the 'base URL' of the page.

To explain: say you want to archive www.demon.net. In order to make the
page available on

    http://web.archive.org/web/20070107191318/http://www.demon.net/

you need to strip out all the references to http://www.demon.net/ in the
page (in links, images, CSS, javascript, etc) and replace them with the URL
above. Since a page may not change much, it's better to do it at request
time, so that a single copy of a page can span multiple archived instances.

Unfortunately, the archive.org software would take the server name we
supplied and use it in place of 'web.archive.org', which is why you'd get a
URL like

    http://iwfwebfilter.thus.net/web/20070107191318/http://www.demon.net/

That server doesn't have any content there, so you'd get a 404.

However, this only happened on a cache miss. That is, if the page was
already in the cache, and it had the correct URLs, it would work just fine.
So some people would see that everything appeared to be as it should.

Equally unfortunately, a page with the iwfwebfilter.thus.net URLs could be
cached and then served up to non-Demon customers, which explains our
friends in Romania, and other reports of people who'd not been anywhere
near the Demon caches seeing 'iwfwebfilter.thus.net' where they'd been
expecting 'web.archive.org'

Shortly before 10pm this evening (albeit a more civilised time where
they're based), the Internet Archive fixed the bug and cleared the caches,
so the problem won't return. Nor can the same technique of mis-supplying a
hostname be used for mischief.

To summarise:
  * There was a bug in the Wayback Machine software, which we tickled
  * Demon didn't perform any content manipulation
  * Demon didn't unilaterally filter or block web.archive.org
  * The Internet Archive have now fixed the bug

As Richard is fond of saying, I'm writing to inform.
Brian
--
&#9786;


    返信    投稿者に返信    転送  
メッセージを投稿するには、ログインする必要があります。
メッセージを投稿するには、まず最初にこのグループに参加する必要があります。
投稿する前に、[設定] ページでニックネームを更新してください。
投稿に必要な権限がありません。

グループを作成 - Google グループ - Google ホーム - 利用規約 - プライバシー ポリシー
©2009 Google