ログイン新規登録

Qiitaにログインして、便利な機能を使ってみませんか?

あなたにマッチした記事をお届けします

便利な情報をあとから読み返せます

2

この記事は最終更新日から1年以上が経過しています。

ConoHaのArchiveBoxアプリケーションを使ってみたよ

最終更新日 投稿日 2019年06月16日

この記事は、ConoHa Advent Calendar 2019 23日目の記事です。

清楚かわいい「ConoHa」から ArchiveBox なるテンプレートが公開されたので、使ってみたよ。というお話。

最初にまとめ

まずは以下をコピペして

apt-get update;
yes | apt-get -y upgrade;
apt-get -y dist-upgrade;
cd /opt/archivebox/;
git checkout master;
git pull;
apt-get remove -y youtube-dl;
wget https://yt-dl.org/latest/youtube-dl -O /usr/local/bin/youtube-dl;
chmod a+x /usr/local/bin/youtube-dl;
hash -r;

終わったなら以下でアーカイブができる

echo "アーカイブしたいウェブページのURL"  | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False  /opt/archivebox/archive

結果は/opt/archivebox/output/に保存され、WebUI経由で確認できる。

alt

ArchiveBoxって何ぞ??

【リリース】[VPS]「ArchiveBox」テンプレートイメージ提供開始|VPSならConoHa VPS

ConoHaでは、2019年4月24日(水)より「ArchiveBox」アプリケーションテンプレートイメージの提供を開始いたしました。

テンプレートを利用すると、ご自身で指定したURLのコンテンツをHTMLやPDF、画像などの形式で簡単に保存し、アーカイブできる「ArchiveBox」をすぐにご利用いただけます。

ArchiveBoxについてはここ参照

一言でいうと
「The open-source self-hosted web archive.」
らしい。(自分でつくる黒歴史保管サービスウェブ魚拓的な)

サーバをつくる

早速 ConoHa ダッシュボードにログインしてサクッとサーバー立てます。

alt

内容的にストレージがいっぱい必要そうなので、ストレージが多い1GBプランをチョイス(SSD50GBでも1,000円以下安い!!)

イメージタイプは、もちろん「ArchiveBox」rootパスワードとかは適当に。

ログインしてセットアップ

指定されたIPアドレスにログイン

Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-47-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Sun Dec 22 xx:xx:xx JST 2019

  System load:  1.5                Processes:           120
  Usage of /:   11.5% of 49.09GB   Users logged in:     0
  Memory usage: 25%                IP address for eth0: 255.255.255.255
  Swap usage:   0%

 * Overheard at KubeCon: "microk8s.status just blew my mind".

     https://microk8s.io/docs/commands#microk8s.status

241 packages can be updated.
112 updates are security updates.


*** System restart required ***
================================================
Welcome to ArchiveBox image!

Server address : http://255.255.255.255/

ArchiveBox directory : /opt/archivebox/

ArchiveBox Web Username : abox_user
ArchiveBox Web Password : XXXXXXXXXX

Enjoy Minecraft!

To delete this message: rm -f /etc/motd
================================================

System information なる便利そうなものが目に入りますが、とりあえず、ArchiveBoxの箇所を。

ほうほう URL があるということは、WebUIがあるのか。早速。

Web UsernameWeb Password いれて。

alt

...あれ?

ちょっとドキュメントを確認。

アーカイブしたWebページを閲覧する
アーカイブしたウェブページは、ウェブブラウザから閲覧できます。1回以上のアーカイブを実行しないとアーカイブ閲覧用ページは生成されず404エラーとなりますので、前述の「ArchiveBoxを使ってWebページをアーカイブする」の節に従ってアーカイブを実行してください。

ArchiveBoxアプリケーションイメージの使い方|ConoHa VPSサポート

なるほど。

アーカイブコマンドはコレですね。

$ cd /opt/archivebox/ && \
sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False \
 /opt/archivebox/archive "アーカイブしたいウェブページのURL"

実際に叩く前に、まずお約束のアップデートをして、

$ apt-get update
$ apt-get -y upgrade
$ apt-get -y dist-upgrade

そしてそして、ドキュメントページをアーカイブ。

$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://support.conoha.jp/v/archivebox/"
[*] [2019-12-22 15:53:48] Downloading https://support.conoha.jp/v/archivebox/
    > output/sources/support.conoha.jp-1576997628.txt
[*] [2019-12-22 15:53:49] Parsing new links from output/sources/support.conoha.jp-1576997628.txt...
[X] No links found :(

・・・なにか違う。

ページが悪いのかな?今度は Google で

$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.google.com/?hl=ja"
[*] [2019-12-22 15:54:09] Downloading https://www.google.com/?hl=ja
    > output/sources/www.google.com-1576997649.txt
[*] [2019-12-22 15:54:09] Parsing new links from output/sources/www.google.com-1576997649.txt...
    > Adding 14 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:54:09] Updating main index files...
    > output/index.json
    > output/index.html
(略)

※全文表示
$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.google.com/?hl=ja"
[*] [2019-12-22 15:54:09] Downloading https://www.google.com/?hl=ja
    > output/sources/www.google.com-1576997649.txt
[*] [2019-12-22 15:54:09] Parsing new links from output/sources/www.google.com-1576997649.txt...
    > Adding 14 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:54:09] Updating main index files...
    > output/index.json
    > output/index.html
[?] [2019-12-22 15:54:09] Updating content for 14 pages in archive...
[+] [2019-12-22 15:54:10] "https://www.youtube.com/?gl=JP&tab=w1"
    https://www.youtube.com/?gl=JP&tab=w1
    > output/archive/1576997649 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:10] "https://www.google.com/setprefdomain?prefdom=JP&prev=https://www.google.co.jp/&sig=K_JeAmkhcsGNpZGRumn5RDR2zO--w%3D"
    https://www.google.com/setprefdomain?prefdom=JP&prev=https://www.google.co.jp/&sig=K_JeAmkhcsGNpZGRumn5RDR2zO--w%3D
    > output/archive/1576997649.0 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:10] "https://www.google.com/logos/doodles/2019/winter-2019-northern-hemisphere-5325275381366784-2x.jpg"
    https://www.google.com/logos/doodles/2019/winter-2019-northern-hemisphere-5325275381366784-2x.jpg
    > output/archive/1576997649.1 (new)
      > favicon
      > title
      > wget
      > pdf
      > screenshot
      > dom
      > git
      > media
      √ index.json
      √ index.html
[+] [2019-12-22 15:54:12] "https://www.google.co.jp/intl/ja/about/products?tab=wh"
    https://www.google.co.jp/intl/ja/about/products?tab=wh
    > output/archive/1576997649.2 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://www.google.co.jp/imghp?hl=ja&tab=wi"
    https://www.google.co.jp/imghp?hl=ja&tab=wi
    > output/archive/1576997649.3 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://play.google.com/?hl=ja&tab=w8"
    https://play.google.com/?hl=ja&tab=w8
    > output/archive/1576997649.4 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://news.google.co.jp/nwshp?hl=ja&tab=wn"
    https://news.google.co.jp/nwshp?hl=ja&tab=wn
    > output/archive/1576997649.5 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://maps.google.co.jp/maps?hl=ja&tab=wl"
    https://maps.google.co.jp/maps?hl=ja&tab=wl
    > output/archive/1576997649.6 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://mail.google.com/mail/?tab=wm"
    https://mail.google.com/mail/?tab=wm
    > output/archive/1576997649.7 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://drive.google.com/?tab=wo"
    https://drive.google.com/?tab=wo
    > output/archive/1576997649.8 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "https://accounts.google.com/ServiceLogin?hl=ja&passive=true&continue=https://www.google.com/%3Fhl%3Dja"
    https://accounts.google.com/ServiceLogin?hl=ja&passive=true&continue=https://www.google.com/%3Fhl%3Dja
    > output/archive/1576997649.9 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "http://www.google.co.jp/intl/ja/services/"
    http://www.google.co.jp/intl/ja/services/
    > output/archive/1576997649.10 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "http://www.google.co.jp/history/optout?hl=ja"
    http://www.google.co.jp/history/optout?hl=ja
    > output/archive/1576997649.11 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[+] [2019-12-22 15:54:12] "http://schema.org/WebPage"
    http://schema.org/WebPage
    > output/archive/1576997649.12 (new)
      > favicon
    ! Failed to archive link: KeyError: 'domain'
[√] [2019-12-22 15:54:12] Update of 14 pages complete (2.93 sec)
    - 1 entries skipped
    - 7 entries updated
    - 0 errors
    To view your archive, open: output/index.html
[*] [2019-12-22 15:54:12] Updating main index files...
    > output/index.json
    > output/index.html

長い!ブラウザに戻って...!!

alt

思ってたのと違う。
1URL=1行と思ったのに、play.google.comとかいる。なんで?

バージョンが古いのかな。
Gitぽいので、アップデート。

$ cd /opt/archivebox/
$ git checkout master
$ git pull

リトライ(今度はヤフーで)

$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.yahoo.co.jp/"
[*] [2019-12-22 15:56:01] Downloading https://www.yahoo.co.jp/
    > output/sources/www.yahoo.co.jp-1576997761.txt
[*] [2019-12-22 15:56:01] Parsing new links from output/sources/www.yahoo.co.jp-1576997761.txt...
    > Adding 65 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:56:01] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 15:56:01] Updating content for 79 pages in archive...

[+] [2019-12-22 15:56:01] "https://www.yahoo.co.jp/"
    https://www.yahoo.co.jp/
    > output/archive/1576997761
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media
(略)

※全文表示
$ cd /opt/archivebox/ && sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False /opt/archivebox/archive "https://www.yahoo.co.jp/"
[*] [2019-12-22 15:56:01] Downloading https://www.yahoo.co.jp/
    > output/sources/www.yahoo.co.jp-1576997761.txt
[*] [2019-12-22 15:56:01] Parsing new links from output/sources/www.yahoo.co.jp-1576997761.txt...
    > Adding 65 new links to index (parsed import as Plain Text)
[*] [2019-12-22 15:56:01] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 15:56:01] Updating content for 79 pages in archive...

[+] [2019-12-22 15:56:01] "https://www.yahoo.co.jp/"
    https://www.yahoo.co.jp/
    > output/archive/1576997761
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:56:13] "https://www.yahoo-help.jp/app/answers/detail/p/533/a_id/43883"
    https://www.yahoo-help.jp/app/answers/detail/p/533/a_id/43883
    > output/archive/1576997761.0
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:56:24] "https://www.yahoo-help.jp/"
    https://www.yahoo-help.jp/
    > output/archive/1576997761.1
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:56:41] "https://weather.yahoo.co.jp/weather/"
    https://weather.yahoo.co.jp/weather/
    > output/archive/1576997761.2
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:56:53] "https://tv.yahoo.co.jp/"
    https://tv.yahoo.co.jp/
    > output/archive/1576997761.3
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:57:09] "https://trilltrill.jp/"
    https://trilltrill.jp/
    > output/archive/1576997761.4
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:57:25] "https://travel.yahoo.co.jp/?sc_e=ytsl"
    https://travel.yahoo.co.jp/?sc_e=ytsl
    > output/archive/1576997761.5
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:57:46] "https://travel.yahoo.co.jp/?sc_e=ytmh"
    https://travel.yahoo.co.jp/?sc_e=ytmh
    > output/archive/1576997761.6
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:58:04] "https://transit.yahoo.co.jp/"
    https://transit.yahoo.co.jp/
    > output/archive/1576997761.7
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:58:13] "https://sports.yahoo.co.jp/"
    https://sports.yahoo.co.jp/
    > output/archive/1576997761.8
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:58:25] "https://shopping.yahoo.co.jp/?sc_e=ytc"
    https://shopping.yahoo.co.jp/?sc_e=ytc
    > output/archive/1576997761.9
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:58:49] "https://shopping.yahoo.co.jp/"
    https://shopping.yahoo.co.jp/
    > output/archive/1576997761.10
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:59:12] "https://services.yahoo.co.jp/?mode=pc"
    https://services.yahoo.co.jp/?mode=pc
    > output/archive/1576997761.11
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:59:16] "https://search.yahoo.co.jp/search"
    https://search.yahoo.co.jp/search
    > output/archive/1576997761.12
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:59:20] "https://retty.me/?utm_y_pc_top"
    https://retty.me/?utm_y_pc_top
    > output/archive/1576997761.13
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:59:39] "https://realestate.yahoo.co.jp/"
    https://realestate.yahoo.co.jp/
    > output/archive/1576997761.14
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 15:59:50] "https://rdsig.yahoo.co.jp/travel_kanko/yjtop_cont/RV=1/RU=aHR0cHM6Ly93d3cuaWt5dS5jb20vaWtDby5hc2h4P2Nvc2lkPWlrMDEwMDAyJnN1cmw9JTJG"
    https://rdsig.yahoo.co.jp/travel_kanko/yjtop_cont/RV=1/RU=aHR0cHM6Ly93d3cuaWt5dS5jb20vaWtDby5hc2h4P2Nvc2lkPWlrMDEwMDAyJnN1cmw9JTJG
    > output/archive/1576997761.15
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:00:05] "https://rdsig.yahoo.co.jp/reservation/yjtop_cont/RV=1/RU=aHR0cHM6Ly9yZXN0YXVyYW50LmlreXUuY29tL3JzQ29zaXRlLmFzcD9Db3NObz0xMDAwMDE3NSZDb3NVcmw9"
    https://rdsig.yahoo.co.jp/reservation/yjtop_cont/RV=1/RU=aHR0cHM6Ly9yZXN0YXVyYW50LmlreXUuY29tL3JzQ29zaXRlLmFzcD9Db3NObz0xMDAwMDE3NSZDb3NVcmw9
    > output/archive/1576997761.16
      > title
        Failed: Unable to detect page title
        Run to see full output:
            cd /opt/archivebox/output/archive/1576997761.16;
            curl https://rdsig.yahoo.co.jp/reservation/yjtop_cont/RV=1/RU=aHR0cHM6Ly9yZXN0YXVyYW50LmlreXUuY29tL3JzQ29zaXRlLmFzcD9Db3NObz0xMDAwMDE3NSZDb3NVcmw9 | grep <title>
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:00:26] "https://rdsig.yahoo.co.jp/partner/from_ytop/pc/list1/RV=1/RU=aHR0cHM6Ly9wYXJ0bmVyLnlhaG9vLmNvLmpwLw--"
    https://rdsig.yahoo.co.jp/partner/from_ytop/pc/list1/RV=1/RU=aHR0cHM6Ly9wYXJ0bmVyLnlhaG9vLmNvLmpwLw--
    > output/archive/1576997761.17
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:00:40] "https://rdsig.yahoo.co.jp/auction/promo/yearend2019/pc/ytop/txt/RV=1/RU=aHR0cHM6Ly9hdWN0aW9ucy55YWhvby5jby5qcC90b3BpYy9wcm9tby95ZWFyZW5kMjAxOS8_Y3BpZD1wcl95ZWFyZW5kMjAxOSZtZW51PXRvcHBhZ2UmdGFyPXRvcCZjcj10b3A-"
    https://rdsig.yahoo.co.jp/auction/promo/yearend2019/pc/ytop/txt/RV=1/RU=aHR0cHM6Ly9hdWN0aW9ucy55YWhvby5jby5qcC90b3BpYy9wcm9tby95ZWFyZW5kMjAxOS8_Y3BpZD1wcl95ZWFyZW5kMjAxOSZtZW51PXRvcHBhZ2UmdGFyPXRvcCZjcj10b3A-
    > output/archive/1576997761.18
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:00:51] "https://privacy.yahoo.co.jp/"
    https://privacy.yahoo.co.jp/
    > output/archive/1576997761.19
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:00:57] "https://premium.yahoo.co.jp/"
    https://premium.yahoo.co.jp/
    > output/archive/1576997761.20
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:01:26] "https://points.yahoo.co.jp/"
    https://points.yahoo.co.jp/
    > output/archive/1576997761.21
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:01:36] "https://news.yahoo.co.jp/topics/top-picks?date=20191222&mc=f&mp=f"
    https://news.yahoo.co.jp/topics/top-picks?date=20191222&mc=f&mp=f
    > output/archive/1576997761.22
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:01:46] "https://news.yahoo.co.jp/pickup/6346062"
    https://news.yahoo.co.jp/pickup/6346062
    > output/archive/1576997761.23
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:01:58] "https://news.yahoo.co.jp/pickup/6346059"
    https://news.yahoo.co.jp/pickup/6346059
    > output/archive/1576997761.24
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:02:11] "https://news.yahoo.co.jp/pickup/6346056"
    https://news.yahoo.co.jp/pickup/6346056
    > output/archive/1576997761.25
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:02:29] "https://news.yahoo.co.jp/pickup/6346053"
    https://news.yahoo.co.jp/pickup/6346053
    > output/archive/1576997761.26
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media

[+] [2019-12-22 16:02:44] "https://news.yahoo.co.jp/pickup/6346052"
    https://news.yahoo.co.jp/pickup/6346052
    > output/archive/1576997761.27
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      ??                                                                                          2.4% (1/60sec)^C


[X] [2019-12-22 16:02:52] Downloading paused on link 1576997761.27 (29/79)
    To view your archive, open: output/index.html
    Continue where you left off by running:
        archive 1576997761.27

5分立っても終わらない...。

公式見るか...

How does it work?

echo 'http://example.com' | ./archive

GitHub - pirate/ArchiveBox: ?? The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

パイプ???

$ echo "https://www.yahoo.co.jp/" | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False  /opt/archivebox/archive
[*] [2019-12-22 16:10:13] Parsing new links from output/sources/stdin-1576998613.txt...
    > Adding 1 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:10:13] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 16:10:13] Updating content for 1 pages in archive...

[+] [2019-12-22 16:10:13] "https://www.yahoo.co.jp/"
    https://www.yahoo.co.jp/
    > output/archive/1576998613
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media
[√] [2019-12-22 16:10:26] Update of 1 pages complete (12.93 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors
    To view your archive, open: output/index.html
[*] [2019-12-22 16:10:26] Saving main index files...
    √ output/index.json
    √ output/index.html

更新!!

alt

ヽ(=´▽`=)ノ

リンククリック!

alt

ヽ(=´▽=)ノヽ(=´▽=)ノヽ(=´▽=)ノヽ(=´▽=)ノヽ(=´▽`=)ノ

公式ドキュメントを読んだ結果

Usage ・ pirate/ArchiveBox Wiki ・ GitHub

  • ./archiveを叩くとアーカイブが実行される。
    • パイプは単一URLとして認識 → そのページのアーカイブを行う。
    • パラメータはURLリストとして認識 → ページ内リンクすべてのアーカイブを行う。

らしい。
また、

  • RSS、XML等の外部URL
  • Chrome、Firefoxのブラウザ履歴

からURLリストを取得できるらしい。

あと、一点気になったのがAudio & Video: media/ all audio/video files + playlists, including subtitles & metadata with youtube-dlの部分。

もしかして

美雲このは(CV:上坂すみれ)

$ echo "https://www.youtube.com/watch?v=3F7cYxVFgKo"  | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False  /opt/archivebox/archive
[*] [2019-12-22 16:11:20] Parsing new links from output/sources/stdin-1576998680.txt...
    > Adding 1 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:11:20] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 16:11:20] Updating content for 2 pages in archive...

[+] [2019-12-22 16:11:20] "https://www.youtube.com/watch?v=3F7cYxVFgKo"
    https://www.youtube.com/watch?v=3F7cYxVFgKo
    > output/archive/1576998680
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media
        Failed: Failed to download media
            Got youtube-dl response code: 1.
            ERROR: 3F7cYxVFgKo: "token" parameter not in video info for unknown reason; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
        Run to see full output:
            cd /opt/archivebox/output/archive/1576998680;
            youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://www.youtube.com/watch?v=3F7cYxVFgKo

[*] [2019-12-22 16:11:34] "Yahoo! JAPAN"
    https://www.yahoo.co.jp/
    √ output/archive/1576998613
[√] [2019-12-22 16:11:34] Update of 2 pages complete (14.62 sec)
    - 1 links skipped
    - 0 links updated
    - 1 links had errors
    To view your archive, open: output/index.html
[*] [2019-12-22 16:11:34] Saving main index files...
    √ output/index.json
    √ output/index.html

youtube-dlもバージョンアップが必要...?

$ sudo youtube-dl -U
Usage: youtube-dl [OPTIONS] URL [URL...]

youtube-dl: error: youtube-dl's self-update mechanism is disabled on Debian.
Please update youtube-dl using apt(8).
See https://packages.debian.org/sid/youtube-dl for the latest packaged version.

無効...?

If you have installed youtube-dl using a package manager like apt-get or yum, use the standard system update mechanism to update. Note that distribution packages are often outdated. As a rule of thumb, youtube-dl releases at least once a month, and often weekly or even daily. Simply go to https://yt-dl.org to find out the current version. Unfortunately, there is nothing we youtube-dl developers can do if your distribution serves a really outdated version. You can (and should) complain to your distribution in their bugtracker or support forum.

GitHub - ytdl-org/youtube-dl: Command-line program to download videos from YouTube.com and other video sites

なるほど。リポジトリ経由はアップデートが遅いと。

一度消して、バイナリを直接ダウンロード

apt-get remove -y youtube-dl
wget https://yt-dl.org/latest/youtube-dl -O /usr/local/bin/youtube-dl
chmod a+x /usr/local/bin/youtube-dl
hash -r

アップデートが終わったので、リトライ

$ echo "https://www.youtube.com/watch?v=3F7cYxVFgKo"  | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False  /opt/archivebox/archive
[*] [2019-12-22 16:14:20] Parsing new links from output/sources/stdin-1576998860.txt...
    > Adding 0 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:14:20] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 16:14:20] Updating content for 2 pages in archive...

[*] [2019-12-22 16:14:20] "美雲このは(CV:上坂すみれ)「空色Drops」 - YouTube"
    https://www.youtube.com/watch?v=3F7cYxVFgKo
    √ output/archive/1576998680

[*] [2019-12-22 16:14:20] "Yahoo! JAPAN"
    https://www.yahoo.co.jp/
    √ output/archive/1576998613
[√] [2019-12-22 16:14:20] Update of 2 pages complete (0.02 sec)
    - 2 links skipped
    - 0 links updated
    - 0 links had errors
    To view your archive, open: output/index.html
[*] [2019-12-22 16:14:20] Saving main index files...
    √ output/index.json
    √ output/index.html
 - 2 links skipped

むむむ...

$ rm -rf /opt/archivebox/output/*
$ echo "https://www.youtube.com/watch?v=3F7cYxVFgKo"  | sudo -u archivebox SUBMIT_ARCHIVE_DOT_ORG=False  /opt/archivebox/archive
[*] [2019-12-22 16:15:36] Parsing new links from output/sources/stdin-1576998936.txt...
    > Adding 1 new links to index (parsed import as Plain Text)
[*] [2019-12-22 16:15:36] Saving main index files...
    √ output/index.json
    √ output/index.html
[?] [2019-12-22 16:15:36] Updating content for 1 pages in archive...

[+] [2019-12-22 16:15:36] "https://www.youtube.com/watch?v=3F7cYxVFgKo"
    https://www.youtube.com/watch?v=3F7cYxVFgKo
    > output/archive/1576998936
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > media
[√] [2019-12-22 16:15:52] Update of 1 pages complete (16.15 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors
    To view your archive, open: output/index.html
[*] [2019-12-22 16:15:52] Saving main index files...
    √ output/index.json
    √ output/index.html

更新!!

無限ロード...リンクの書き換えとかはしてくれないのか(まぁ当然か)

alt

右上のMediaのリンクをクリック
文字化けしてるけど、どうやらアーカイブできているっぽい

alt

以上

あとがき

うごくまでが結構大変だったけど、アーカイブの精度は良さそう
ただ、せっかくならConoHaのオブジェクトストレージ使いたいよね?とおもって色々がんばったけどダメだった
goofysでマウントしてはパーミッションとか一時ファイルとかとかの制約でエラーになる。一時ディレクトリを作って、アーカイブが終わり次第 mvすればいけるかも?)

新規登録して、もっと便利にQiitaを使ってみよう

  1. あなたにマッチした記事をお届けします
  2. 便利な情報をあとで効率的に読み返せます
  3. ダークテーマを利用できます
ログインすると使える機能について

コメント

この記事にコメントはありません。

いいね以上の気持ちはコメントで

2

新規登録して、Qiitaをもっと便利に使ってみませんか

この機能を利用するにはログインする必要があります。ログインするとさらに下記の機能が使えます。

  1. ユーザーやタグのフォロー機能であなたにマッチした記事をお届け
  2. ストック機能で便利な情報を後から効率的に読み返せる

ソーシャルアカウントでログイン・新規登録

メールアドレスでログイン・新規登録