Scrapy メモ

0
いいね
0
コメント

2017年01月15日に投稿

Webページのスクレイピングと分析・可視化で使用したPython製WebクローラScrapyについて覚えたことについて記載する。
本記事はメモ程度の内容であり、情報の正確性については保証しない。必ず公式ドキュメントを参照すること。

サンプルコード

サンプルコード1

import scrapy

class QiitaCalendarSpider(scrapy.Spider):
    name = "qiita_calendar"
    allowed_domains = ["qiita.com"]
    start_urls = ["http://qiita.com/advent-calendar/2016/calendars"]

    custom_settings = {
        "DOWNLOAD_DELAY": 1,
    }

    def parse(self, response):
        for href in response.css('table.table.adventCalendarList tbody tr'):
            calendar_title = href.css('td.adventCalendarList_calendarTitle a::text').extract_first()
            calendar_url = href.css('td.adventCalendarList_calendarTitle a::attr(href)').extract_first()
            calendar_attendees = href.css(
                'td.adventCalendarList_progress span.adventCalendarList_recruitmentCount::text').extract_first()

            yield {
                'calendar_title': calendar_title,
                'calendar_url': response.urljoin(calendar_url),
                'calendar_attendees': calendar_attendees
            }

            for page in response.css('li.hidden-xs a'):
                next_page = page.css('::attr(href)').extract_first()
                if next_page is not None:
                    next_page = response.urljoin(next_page)
                    yield scrapy.Request(next_page, callback=self.parse)

サンプルコード2

    def start_requests(self):
        qiita_calendars = QiitaCalendarLoader()
        for url in qiita_calendars.urls():
            yield scrapy.Request(url=url, callback=self.parse)

コーディング

DOWNLOAD_DELAYは最初に設定するべき

DOWNLOAD_DELAYはクロール毎のインターバル時間を秒単位で設定するプロパティ。これを設定しないと簡単にWebサイトにDoSアタックをかけることになる。絶対に一番最初に設定すること。

allowed_domains

必須ではないものの設定推奨。このリストに含まれるドメインしかクロールしない。

レスポンスの取扱い

parse(self, response)で渡されるresponseで主に使うのはtext、css()、urljoin()の3つ。
response.textはhtml情報がテキスト形式で入っているので、このままBeautifulSoupに渡すことができる。
response.css()はCSSセレクタによりhtml内の任意のセクションのSelectorインスタンスのリストを返す。インスタンスに含まれるデータは extract() でアクセス可能。リストのままでも、extract_first()で最初の要素に対してのみextract()をかけることが可能。
response.urljoin(path)は、pathと現在クロールしているURLを合わせてURLのフルパスを返す。

start_urls と start_requests()

最初にクロールするURLの一覧はリストstart_urlsに書く以外に、start_requests()でyieldするという方法がある。(サンプルコード2)
今回の例では、カレンダー一覧ページで取得したカレンダーURLを含むjsonをロードし(QiitaCalendarLoader)、そのURLに対してクロールするという処理を行っている。

コマンド

プロジェクト作成

scrapy startproject <プロジェクト名> で作成可能。

参考: https://doc.scrapy.org/en/1.3/topics/commands.html#genspider

Spider生成

scrapy genspider <スパイダー名> <対象ドメイン>で作成可能だが、大したことはしないので別に使わなくてもいい。

参考: https://doc.scrapy.org/en/1.3/topics/commands.html#genspider

クローリング

scrapy crawl <スパイダー名>

プロジェクトを作成した場合はcrawlコマンドを使う。

参考: https://doc.scrapy.org/en/1.3/topics/commands.html#crawl

インタラクティブシェル

scrapy shell <URL>

ipythonによるインタラクティブシェルモードに入る。URLは省略可。
URLを指定した場合、既に response 変数にレスポンスが格納されている。
インタラクティブシェルモードを起動するとヘルプが表示されるが、特に覚えるべきメソッドは2つ。

shelp()

ヘルプを表示する。

fetch(url)

あらたにURLをクロールする。結果はresponseに格納される。

settings.py

プロジェクトを作成すると、settings.py という設定ファイルが作成される。DOWNLOAD_DELAYなど、全てのスパイダーで共通の設定を行うものはこちらに記述した方がいい。

[Webページのスクレイピングと分析・可視化](http://qiita.com/shiumachi/items/82a13f29933539c69e32)で使用したPython製Webクローラ[Scrapy](https://scrapy.org/)について覚えたことについて記載する。
本記事はメモ程度の内容であり、情報の正確性については保証しない。必ず公式ドキュメントを参照すること。

# サンプルコード

## サンプルコード1

```python
import scrapy

class QiitaCalendarSpider(scrapy.Spider):
    name = "qiita_calendar"
    allowed_domains = ["qiita.com"]
    start_urls = ["http://qiita.com/advent-calendar/2016/calendars"]

custom_settings = {
        "DOWNLOAD_DELAY": 1,
    }

def parse(self, response):
        for href in response.css('table.table.adventCalendarList tbody tr'):
            calendar_title = href.css('td.adventCalendarList_calendarTitle a::text').extract_first()
            calendar_url = href.css('td.adventCalendarList_calendarTitle a::attr(href)').extract_first()
            calendar_attendees = href.css(
                'td.adventCalendarList_progress span.adventCalendarList_recruitmentCount::text').extract_first()

yield {
                'calendar_title': calendar_title,
                'calendar_url': response.urljoin(calendar_url),
                'calendar_attendees': calendar_attendees
            }

for page in response.css('li.hidden-xs a'):
                next_page = page.css('::attr(href)').extract_first()
                if next_page is not None:
                    next_page = response.urljoin(next_page)
                    yield scrapy.Request(next_page, callback=self.parse)
```

## サンプルコード2

```python
    def start_requests(self):
        qiita_calendars = QiitaCalendarLoader()
        for url in qiita_calendars.urls():
            yield scrapy.Request(url=url, callback=self.parse)
```

# コーディング

## DOWNLOAD_DELAYは最初に設定するべき

`DOWNLOAD_DELAY`はクロール毎のインターバル時間を秒単位で設定するプロパティ。これを設定しないと簡単にWebサイトにDoSアタックをかけることになる。**絶対に一番最初に設定すること**。

## allowed_domains

必須ではないものの設定推奨。このリストに含まれるドメインしかクロールしない。

## レスポンスの取扱い

`parse(self, response)`で渡される`response`で主に使うのは`text`、`css()`、`urljoin()`の3つ。
`response.text`はhtml情報がテキスト形式で入っているので、このままBeautifulSoupに渡すことができる。
`response.css()`はCSSセレクタによりhtml内の任意のセクションのSelectorインスタンスのリストを返す。インスタンスに含まれるデータは `extract()` でアクセス可能。リストのままでも、`extract_first()`で最初の要素に対してのみ`extract()`をかけることが可能。
`response.urljoin(path)`は、`path`と現在クロールしているURLを合わせてURLのフルパスを返す。

## start_urls と start_requests()

最初にクロールするURLの一覧はリスト`start_urls`に書く以外に、`start_requests()`でyieldするという方法がある。(サンプルコード2)
今回の例では、カレンダー一覧ページで取得したカレンダーURLを含むjsonをロードし(`QiitaCalendarLoader`)、そのURLに対してクロールするという処理を行っている。

# コマンド

## プロジェクト作成

`scrapy startproject <プロジェクト名>` で作成可能。

参考: https://doc.scrapy.org/en/1.3/topics/commands.html#genspider

## Spider生成

`scrapy genspider <スパイダー名> <対象ドメイン>`で作成可能だが、大したことはしないので別に使わなくてもいい。

参考: https://doc.scrapy.org/en/1.3/topics/commands.html#genspider

## クローリング

`scrapy crawl <スパイダー名>`

プロジェクトを作成した場合はcrawlコマンドを使う。

参考: https://doc.scrapy.org/en/1.3/topics/commands.html#crawl

## インタラクティブシェル

`scrapy shell <URL>`

### shelp()

ヘルプを表示する。

### fetch(url)

あらたにURLをクロールする。結果はresponseに格納される。

# settings.py

プロジェクトを作成すると、`settings.py` という設定ファイルが作成される。`DOWNLOAD_DELAY`など、全てのスパイダーで共通の設定を行うものはこちらに記述した方がいい。

shiumachi

407Contribution

Organization

編集リクエスト

この記事は以下の記事からリンクされています
Webページのスクレイピングと分析・可視化からリンク約3時間前

Comments Loading...

Scrapy メモ

サンプルコード

サンプルコード1

サンプルコード2

コーディング

DOWNLOAD_DELAYは最初に設定するべき

allowed_domains

レスポンスの取扱い

start_urls と start_requests()

コマンド

プロジェクト作成

Spider生成

クローリング

インタラクティブシェル

shelp()

fetch(url)

settings.py

人気の投稿

Organization

問題がある投稿を報告する

ご意見