(cache)【 2020/03/31まで ! 】インプレスで無料公開中の書籍をPDF化して読む！

概要

コロナウイルス対策で増えた在宅時間を生かすために、インプレスさんが「できる」シリーズなど44冊を2020/03/31まで無料公開してるよ！

どの書籍も面白そうで全部読みたいけど、本気を出してもあと2週間では読みきれないし、保存してゆっくり読もうと思う

警告

ダウンロードした画像、作成したPDFは私的利用に限定してください。他人への販売や譲渡は犯罪です。
この記事は、著作権法とimpressの利用規約を確認した上で、問題ないと判断して書いています。
自己責任でお願いします。

分かりやすく書きすぎてリテラシーがない人に悪用されると困るので、環境構築やパッケージのインストールは省略した上で、ステップごとに分散して書いてます。

STEP 1. 無料公開されてる本とそのリンクを取得してみる！

まずは、無料公開されてる本と、無料公開のリンクをオブジェクトにして取得するよ。無料公開の特設ページ(https://book.impress.co.jp/items/tameshiyomi)にアクセスしてから、デベロッパーツールのConsoleを開いて、以下のコードを実行しよう。

const title = [...document.querySelectorAll('h4 a')].map(i => i.innerHTML);
const num = [...document.querySelectorAll('.module-book-list-item-img div a')].map(i => i.href.split('/')[3]);

if (title.length !== num.length) {
  throw new Error('タイトルの数とリンクの数が違います。');
}

const title_and_num = {};

for (const i in title) {
  title_and_num[num[i]] = title[i];
}

document.body.innerHTML = '<pre>' + JSON.stringify(title_and_num, null, '\t') + '</pre>';

すると、画面が書き換えられて、オブジェクトが表示されるよ。

キーになっている数字はそれぞれのページの末尾を表してるから、

url = 'https://impress.tameshiyo.me/' + '9784295003850'

みたいにしてURLを求めるよ。

コピペしてimpress_title_num.jsonっていうファイルに保存しておいてね。

タイトルに/が含まれるとファイルが作れないので、/を-に置換しておこう！

STEP 2. 画像のURLを取得する！

PythonでSeleniumを使用して全ての画像のURLを取得して、jsonファイルを作成するよ。URLを取得するコードは、サーバへの負荷も考慮し、１リクエストごとに５秒以上待つようにしてあります。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import chromedriver_binary
from time import sleep
import json

url_num = input('url_num: ')
url = 'https://impress.tameshiyo.me/' + url_num
urls_image = []

options = Options()
options.binary_location = '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary'
options.add_argument('--headless')

driver = webdriver.Chrome(options=options)
driver.get(url + '?page=1')
sleep(5)

# 表示されている総見開き数
count_of_view = int(driver.find_element_by_id('page_indicator').text.split(' ')[2])
# 総ページ数は、count_of_view * 2 - 1 または、count_of_view * 2 - 2 になる。
count_of_page = count_of_view * 2

for i in range(count_of_page):
  page = i + 1

  # 1, 3, 5の偶数ページ(左側のページ)のとき、新しいページを取得。
  if page % 2 == 0:
    # URLのページ数は実際のページ数より1つ多い(表紙の前の存在しないページを1ページ目としている)
    driver.get(url + '?page=' + str(page+1))
    sleep(3)

  # ページ数が偶数の時は最初の要素、奇数の時は2つめの要素にURLが含まれる。
  number_of_image_url_place = page % 2

  try:
    img_url = ''
    try_counter = 0
    while img_url == '':
      try_counter += 1
      sleep(0.5)
      img_url = driver.find_elements_by_class_name('page_img')[number_of_image_url_place].get_attribute('src')
      # 10回試行してsrc無し、かつ、ページ数が最後の2ページなら、ページがないと判断
      if try_counter > 10 and page >= count_of_view * 2 - 2:
        break
    # 代理画像URLが取得されている場合は警告を出す
    if img_url == 'https://impress.tameshiyo.me/img/bookfilter.png':
      print('[warning] img_url is bookfilter in page ' + page)
    elif img_url == '':
      break
    else:
      urls_image.append(img_url)
    print('finished page ' + str(page))
  except:
    print('can\'t get url on page ' + str(page))
    pass

driver.close()
driver.quit()

title_and_num = json.load(open('impress_title_num.json', 'r', encoding='utf-8'))

with open('impress_urllist_' + title_and_num[url_num] + '.json', 'w', encoding='utf-8') as f:
  f.write(json.dumps(urls_image, indent=4))

確認

画像にはCORSの設定がされていないようなので、URLが正しく取得されているか、下のようなHTMLファイルを使って確認してみよう！

注意 : このHTMLファイルを不特定多数の人がアクセスできるサーバーに配置しないでください。著作権侵害となる可能性があります。

<!DOCTYPE html>
<html>
  <body>
    <input type="text" placeholder="数字を入力してね">
    <button type="button">決定</button>
    <script>
      fetch('impress_title_num.json')
      .then(a => a.json())
      .then(titleAndNum => {
        document.querySelector('button').addEventListener('click', () => {
        const title = titleAndNum[document.querySelector('input').value];
        console.log(title)
        const fileName = `impress_urllist_${title}.json`;
        fetch(fileName)
        .then(b => b.json())
        .then(json => {
          for (const i in json) {
            const img = document.createElement('img');
            img.src = json[i];
            document.body.appendChild(img);
          }
        });
      })
      });
    </script>
  </body>
</html>

Macの場合は、PHPでローカルサーバを立てて確認するのが楽だよ。

$ php -S localhost:8080

$HTMLで確認したときの画像$

STEP 3. 画像を保存する！

画像を保存っていうと著作権大丈夫？って感じがするけど、普段ブラウザでみている画像も一時的にローカルに保存されているので、保存している場所が違うだけだよ。
ここでも、１枚保存するごとに3秒間隔をおいてます。

import urllib.request, urllib.error
import json, os
from time import sleep

title_and_num = json.load(open('impress_title_num.json', 'r', encoding='utf-8'))
num = input('input num : ')
title = title_and_num[num]
url_list = json.load(open('impress_urllist_' + title + '.json', 'r', encoding='utf-8'))
os.makedirs('impress_' + title)

for i, url in enumerate(url_list):
  urllib.request.urlretrieve(url, './impress_{0}/impress_{1}_{2}.jpg'.format(title, num, ('00' + str(i+1))[-3:]))
  print('[done] impress_{0}_{1}.jpg'.format(num, ('00' + str(i+1))[-3:]))

STEP 4. PDF化する！

ダウンロードした画像をくっつけてPDFにするよ。

import img2pdf
from pathlib import Path
import json

num = input('input num : ')
title_and_num = json.load(open('impress_title_num.json', 'r', encoding='utf-8'))
title = title_and_num[num]
path_import = Path('impress_' + title)
path_output = Path('impress_' + title + '/' + title + '.pdf')

lists = list(path_import.glob('**/*'))
with open(path_output, 'wb') as f:
  f.write(img2pdf.convert([str(i) for i in sorted(lists) if i.match('*.jpg')]))

いつもの見慣れたPDFビューアだ！これで時間を気にせず引きこもれるぞ！

インプレスさんありがとう！