【Python】機械学習を用いた競馬予想【データ収集編】

悩んでいる人

人手で競馬予想を行うのは限界があるため、機械学習を利用したい。

学習する上でのデータの収集方法を教えて欲しい。

こんなお悩みを解決します。

前回までで、環境構築を行いました。今回は、データの収集方法について解説します。

前回の記事を確認していない方は、以下の記事を参考に環境構築を行っておいてください。

今回の実装結果

今回の実装結果は、GitHubに掲載しています。

一部省略している箇所もあるため、全体像を把握したい方は、以下のリンクからアクセスしてください。

https://github.com/yuruto-free/machine-learning-keiba/tree/v0.2.1

注意点

Web上からデータを取得する場合は、相手側のサーバに負荷がかかるため、アクセスする際は十分に注意してください。

また、本記事はスクレイピングの方法を解説していますが、スクレイピングを推奨している訳ではないため、自己責任で利用してください。

データの収集の概要

今回は、以下のデータを収集し、それぞれ該当するディレクトリに格納します。

対象	概要	格納先	ファイル名
レース結果	レースが開催された日の馬の情報・着順	data/html/race	{race_id}.bin
馬の過去成績	各馬ごとの過去の成績	data/html/horse	{horse_id}.bin
血統情報	各馬ごとの血統情報	data/html/ped	{horse_id}.bin

収集するデータの概要

ここで、race_id、horse_idは、それぞれのレースやそれぞれの馬に一意に割り当てられる情報となります。

学校でいう、学籍番号みたいなものとなります。

これらのIDを用いることで、データ間の対応付けが可能となるため、今後は、データを識別するために利用します。

また、データ収集後のイメージは、以下のようになります。

以降では、それぞれのデータの事例を紹介したいと思います。

レース結果

レース結果としては、以下のようなデータを収集することになります。

それぞれの内訳は、以下のようになります。

今回は、データ収集後に、スクレイピングによりそれぞれのデータを抽出します。

抽出結果は、それぞれ以下のファイル名で保存します。

対象	保存先
レース情報	data/raw/race_info.pkl
レース結果	data/raw/results.pkl
払い戻し結果	data/raw/payback.pkl

レース結果に関する情報の保存先

馬の過去成績

馬の過去成績としては、以下のようなデータを収集することになります。

このデータは、以下のファイル名で保存します。

対象	保存先
馬の過去成績	data/raw/horse_results.pkl

馬の過去成績に関する情報の保存先

血統情報

血統情報としては、以下のようなデータを収集することになります。

このデータは、以下のファイル名で保存します。

対象	保存先
血統情報	data/raw/ped_results.pkl

血統情報に関する情報の保存先

データの収集方法

以下のステップでデータを収集します。

指定した期間内でのレース開催日を取得します。
レース開催日に開催される全レース（最大12レース）分のデータを取得します。
ここで、各レースには、上記に示したrace_idが割り振られるため、この情報をもとにレース結果を保存します。
取得したデータから分析に必要な情報を抽出するため、それぞれのデータに対してスクレイピングを行います。
上記の作業を馬の過去成績と各馬の血統情報に対しても行います。
スクレイピングした結果を所定のパスに保存します。

データ収集の準備

以降では、Pythonを用いてデータ収集を行うためのプログラムについて解説していきます。

データ収集を行うプログラムを追加した後のディレクトリ構造は、以下のようになります。


.
|-- Dockerfile
|-- docker-compose.yml
|-- entrypoint.sh
`-- workspace
    `-- keiba
        |-- data
        |   |-- html
        |   |   |-- horse
        |   |   |-- ped
        |   |   `-- race
        |   |-- master
        |   `-- raw
        |-- models
        |-- modules
        |   |-- __init__.py # [追加]モジュールロード用
        |   |-- Constants.py # [追加]定数用
        |   `-- Collection.py # [追加]スクレイピング用
        `-- scrape.ipynb # [追加]メイン処理用

また、データ収集時のアクセス先やデータ保存時の保存先は決まっているため、定数として定義しておきます。

以下のように定数を定義し、modules/Constants.pyに保存します。

from dataclasses import dataclass
import os

# グローバル変数
_BASE_DIR = os.path.abspath('./')
_DATA_DIR = os.path.join(_BASE_DIR, 'data')
_RAW_DIR = os.path.join(_DATA_DIR, 'raw')
_HTML_DIR = os.path.join(_DATA_DIR, 'html')
_MASTER_DIR = os.path.join(_DATA_DIR, 'master')

@dataclass(frozen=True)
class LocalPaths:
    # レース結果
    RAW_RESULTS_PATH = os.path.join(_RAW_DIR, 'results.pkl')
    RAW_RACEINFO_PATH = os.path.join(_RAW_DIR, 'race_info.pkl')
    RAW_PAYBACK_PATH = os.path.join(_RAW_DIR, 'payback.pkl')
    # 馬の過去成績
    RAW_HORSERESULTS_PATH = os.path.join(_RAW_DIR, 'horse_results.pkl')
    # 血統情報
    RAW_PEDS_PATH = os.path.join(_RAW_DIR, 'ped_results.pkl')

@dataclass(frozen=True)
class SystemPaths:
    HTML_RACE_DIR = os.path.join(_HTML_DIR, 'race')
    HTML_HORSE_DIR = os.path.join(_HTML_DIR, 'horse')
    HTML_PED_DIR = os.path.join(_HTML_DIR, 'ped')
    HORSE_RESULTS_PATH = os.path.join(_MASTER_DIR, 'horse_results_updated_at.pkl')

# グローバル変数
_DB_DOMAIN = 'https://db.netkeiba.com'
_TOP_URL = 'https://race.netkeiba.com/top'

@dataclass(frozen=True)
class UrlPaths:
    RACE_URL = f'{_DB_DOMAIN}/race'
    HORSE_URL = f'{_DB_DOMAIN}/horse'
    PED_URL = f'{_DB_DOMAIN}/horse/ped'
    CALENDAR_URL = f'{_TOP_URL}/calendar.html'
    RACE_LIST_URL = f'{_TOP_URL}/race_list_sub.html'

さらに、JupyterLabでモジュールを読み込む際に、必要な情報のみ読み込めるよう、以下の内容でmodules/__init__.pyを作成します。

from .Constants import LocalPaths
from .Collection import Collection

__all__ = [
    LocalPaths,
    Collection,
]

上記のようにすることで、下記のように簡単にimportが行えつつ、利用者側に公開するクラスも制限できます。

from module import LocalPaths, Collection
# modules/__init__.pyがない場合、名前空間の制約上、以下のようにインポートする必要あり
# import modules.Constants.LocalPaths as LocalPaths
# import modules.Collection.Collection as Collection

データ収集

実際にデータを収集するためのプログラムを実装していきたいと思います。

scrape.ipynb

まず、実際の処理の流れを表現しているscrape.ipynbの内容を以下に示します。

%load_ext autoreload
%autoreload 2

import pandas as pd
from dataclasses import dataclass
from modules import LocalPaths, Collection
# 行と列の最大表示数を指定
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 30)

# ==============
# スクレイピング
# ==============
# 実行用パラメータ
@dataclass(frozen=True)
class _RaceParams:
    EXECUTION = True
    FROM = '2022-01'
    TO = '2023-01'    
    
@dataclass(frozen=True)
class ExecParams:
    # レース開催日に関する処理
    RACE_INFO = _RaceParams()
    # 馬情報に関する処理
    HORSE_EXECUTION = True
    # 血統情報に関する処理
    PED_EXECUTION = True

# インスタンス生成
collection = Collection()

if ExecParams.RACE_INFO.EXECUTION:
    # レース開催日の取得
    event_dates = collection.get_event_date(from_=ExecParams.RACE_INFO.FROM, to_=ExecParams.RACE_INFO.TO)
    # レースIDの取得
    race_ids = collection.get_race_ids(event_dates)
    # htmlのスクレイピング
    html_filepaths = collection.scrape_html_race(race_ids)
    
    if html_filepaths:
        # レース結果テーブルの取得
        race_results = collection.get_rawdata_results(html_filepaths)
        # レース情報テーブルの取得
        all_race_info = collection.get_rawdata_raceinfo(html_filepaths)
        # 払い戻し結果テーブルの取得
        paybacks = collection.get_rawdata_payback(html_filepaths)
        # テーブルの更新
        collection.update_rawdata(LocalPaths.RAW_RESULTS_PATH, race_results)
        collection.update_rawdata(LocalPaths.RAW_RACEINFO_PATH, all_race_info)
        collection.update_rawdata(LocalPaths.RAW_PAYBACK_PATH, paybacks)
if ExecParams.HORSE_EXECUTION:
    race_results = collection.load_rawdata(LocalPaths.RAW_RESULTS_PATH)
    horse_ids = race_results['horse_id'].unique()
    html_file_horses = collection.scrape_html_horse_with_master(horse_ids)
    
    if html_file_horses:
        # 馬の過去成績テーブルの取得
        horse_results = collection.get_rawdata_horse(html_file_horses)
        collection.update_rawdata(LocalPaths.RAW_HORSERESULTS_PATH, horse_results)
if ExecParams.PED_EXECUTION:
    race_results = collection.load_rawdata(LocalPaths.RAW_RESULTS_PATH)
    horse_ids = race_results['horse_id'].unique()
    html_file_peds = collection.scrape_html_ped(horse_ids)
    
    if html_file_peds:
        # 血統情報の取得
        ped_results = collection.get_rawdata_ped(html_file_peds)
        collection.update_rawdata(LocalPaths.RAW_PEDS_PATH, ped_results)

利用時にユーザが設定する項目は、以下の4点となります。

項目		内容
_RaceParams	EXECUTION	レース情報の取得要否を指定（True：取得する、False：取得しない）
	FROM	レース情報の取得開始時期を指定（format：yyyy-mm）
	TO	レース情報の取得終了時期を指定（format：yyyy-mm）
ExecParams	HORSE_EXECUTION	馬の過去成績の取得要否を指定（True：取得する、False：取得しない）
ExecParams	PED_EXECUTION	血統情報の取得要否を指定（True：取得する、False：取得しない）

次に、実際にデータを収集するメインのプログラムmodules/Collection.pyについて説明します。

modules/Collection.py

modules/Collection.pyは、大きく分けて以下のような構成となっています。

処理の概要	内容
初期化	Webページから情報を取得する際のインターバルを定義する。
開催日一覧の取得	取得開始時期と取得終了時期の間で、レースが開催される日にちを一覧で取得する。
レースID一覧の取得	レースの開催年月日からレースIDを取得する。
Webページの取得・保存	対象のURLのページを取得し、結果を所定のディレクトリに保存する。
保存したWebページのスクレイピング	Webページを解析し、必要な情報を抽出する。
スクレイピング結果の保存	スクレイピング結果を所定のパスに保存する、

modules/Collection.pyの構成

以降では、それぞれの実装結果について説明します。

初期化

関連するモジュールのimportも含め、初期化処理は以下のようになります。

import numpy as np
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta
from monthdelta import monthmod
import time
import requests
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
import re
import os
import sys
from .Constants import UrlPaths, SystemPaths

class Collection:
    """
    Collection : データ収集用クラス

    Attributes
    ----------
    __class_name : str
        クラス名
    __wait_time : float
        待機時間
    """
    def __init__(self, wait_time=1.2):
        """
        初期化処理
        """
        # クラスメソッド用の変数
        self.__class_name = self.__class__.__name__
        self.__wait_time = wait_time
        np.random.seed(int(time.time()))

Webページを取得する際は、requestsライブラリを利用し、スクレイピングには、pandasのread_htmlとbs4のBeautifulSoupを利用します。

また、連続してデータを取得するとサーバ側に負担がかかるため、一定期間のインターバルを設けるようにしています。（今回のケースでは、1.2秒～2.2秒となります）

開催日一覧の取得

取得開始時期から取得終了時期まで、毎月レースが開催される日を調べることで、開催日一覧を取得できます。

    def get_event_date(self, from_, to_):
        """
        開催日の一覧を取得
        Parameters
        ----------
        from_ : str
            取得開始年月(format: yyyy-mm)
        to_ : str
            取得終了年月(format: yyyy-mm)
        Returns
        -------
        race_dates : list
            レース日一覧
        """
        start_date = datetime.strptime(from_, '%Y-%m')
        end_date = datetime.strptime(to_, '%Y-%m')
        # 探索範囲が不適切な場合
        if end_date < start_date:
            raise Exception(f'Error({self.__class_name}::get_event_date): Invalid argument')
            
        # 年月の差分取得
        diff, _ = monthmod(start_date, end_date)
        race_dates = []
        for idx in tqdm(range(diff.months + 1)):
            # 取得年月の計算
            target = start_date + relativedelta(months=idx)
            year = target.year
            month = str(target.month).zfill(2)
            # 取得URLの設定&データ取得
            url = f'{UrlPaths.CALENDAR_URL}?year={year}&month={month}'
            df = pd.read_html(url)[0]
            # 日付一覧を取得
            days = [str(val) for val in sum(df.values.tolist(), [])]
            # レースがある日のみを取得し、年月日を計算
            events = [f'{year}{month}{val.split()[0].zfill(2)}' for val in days if len(val.split()) > 1]
            race_dates += events
            time.sleep(self.__wait_time + np.random.rand())
            
        return race_dates

ここで、monthmodは、引数で与えられた年月日間の差分を計算してくれる関数です。

今回は、月ごとに確認していくため、取得開始時期から1ヶ月ずつずらしていくことで、所望する動作を実現できます。

また、取得先のURL(CALENDAR_URL: https://race.netkeiba.com/top/calendar.html)は、クエリ文字列を指定できるため、以下のように算出した年月の情報を付与することで、取得したいページにアクセスできます。

url = f'{UrlPaths.CALENDAR_URL}?year={year}&month={month}'
# 2022年1月の場合、以下のようになる
# https://race.netkeiba.com/top/calendar.html?year=2022&month=01

例えば、2022年1月を指定した場合、以下のようなページが取得できます。

上記は、HTMLのtableタグで構成されているため、pandasのread_htmlで情報が取得できます。

レース開催日には、日付以外に、開催場所が含まれているため、各日付の情報が2つ以上に分割で切る場合がレース開催日に該当すると判断し、候補日として取り上げます。

# 日付一覧を取得（テーブルデータは2次元データのため、sum関数を用いて、1次元のリストに変換）
days = [str(val) for val in sum(df.values.tolist(), [])]
# レースがある日のみを取得し、年月日を計算
events = [f'{year}{month}{val.split()[0].zfill(2)}' for val in days if len(val.split()) > 1]

レースID一覧の取得

先程のカレンダーからアクセスできるレース開催日の詳細から、レースIDを調べることができます。

    def get_race_ids(self, race_dates):
        """
        レースID一覧を取得
        
        Parameters
        ----------
        race_dates : list
            レース日一覧
            
        Returns
        -------
        race_ids : list
            レースID一覧
        """
        race_ids = []
        
        for date in tqdm(race_dates):
            url = f'{UrlPaths.RACE_LIST_URL}?kaisai_date={date}'
            response = requests.get(url)
            soup = BeautifulSoup(response.content, 'html.parser')
            # 該当レース一覧を取得
            targets = soup.find_all('dd', attrs={'class': 'RaceList_Data'})
            for target in targets:
                items = target.find_all('span', attrs={'class': 'MyRace_List_Item'})
                out = [re.sub(r'\D', '', val.get('id')) for val in items]
                race_ids += out
            time.sleep(self.__wait_time + np.random.rand())
                
        return race_ids

ただ、このページはajaxで非同期通信によりページが更新されるようになっているため、PythonでWebページにアクセスしてもレースIDを取得できません。

少し調べたところ、https://race.netkeiba.com/top/race_list_sub.htmlというファイルがベースになっていることが分かったため、Pythonではこちらにアクセスし、情報を取得します。

Webページのソースコードを確認すると、class名がRaceList_Dataとなっている内部に、class名がMyRace_List_Itemとなっている箇所があります。

class名がMyRace_List_Itemとなっている要素のidにレースIDが埋め込まれているため、ここの情報を取得すればよさそうです。

ここまで分かれば、BeautifulSoupのfind_allメソッドと正規表現を用いることで、レースIDを取得できます。

該当するコードは以下のようになります。

# 該当レース一覧を取得
targets = soup.find_all('dd', attrs={'class': 'RaceList_Data'})
for target in targets:
    items = target.find_all('span', attrs={'class': 'MyRace_List_Item'})
    out = [re.sub(r'\D', '', val.get('id')) for val in items]
    race_ids += out

Webページの取得・保存

今回の場合、Webページを取得する際は、URLと保存先が変わるだけで、実際の処理は似通ったものとなるため、内部用のメソッドを定義しました。

実装結果は以下のようになります。

    def __scrape_html(self, ids, base_url, output_dir, isSkip=True):
        """
        Webからデータを取得
        
        Parameters
        ----------
        ids : list
            取得対象のID
        base_url : str
            取得先のURL
            利用時のフォーマット：f'{base_url}/{target_id}'
        output_dir : str
            出力ディレクトリ
            利用時のフォーマット：os.path.join(output_dir, f'{target_id}.bin')
        isSkip : boolean
            既にファイルが存在する場合の対応
            True:  読み飛ばす
            False: Webから再取得する
        
        Returns
        -------
        html_filepaths : list
            htmlファイルパス
        """
        html_filepaths = []
        
        for target_id in tqdm(ids):
            html_filename = os.path.join(output_dir, f'{target_id}.bin')
            # ファイルが存在するかつ、スキップする場合
            if os.path.exists(html_filename) and isSkip:
                continue
            else:
                # ===================
                # Webからデータを取得
                # ===================
                time.sleep(self.__wait_time + np.random.rand())
                # スクレイピング実行
                url = f'{base_url}/{target_id}'
                response = requests.get(url)
                # バイナリデータをhtmlファイルに保存
                with open(html_filename, 'wb') as fout:
                    fout.write(response.content)
                html_filepaths += [html_filename]
    
        return html_filepaths

また、今回はDockerを用いており、Windows環境とLinux環境で文字コードの違いが生じます。

このような環境の違いを考慮せずに処理を行う為に、binary形式でデータを扱う方針としました。

先に述べたように、このメソッドを用いることで、呼び出し元では引数を切り替えるだけで済みます。

すべて示すと冗長になるため、レース結果を取得する場合の例を以下に示します。

    def scrape_html_race(self, race_ids, isSkip=True):
        """
        レース結果のhtmlファイルを取得
        
        Parameters
        ----------
        race_ids : list
            レースID一覧
            
        Returns
        -------
        html_filepaths : list
            レース結果のhtmlファイルパス
        """
        html_filepaths = self.__scrape_html(
            race_ids, UrlPaths.RACE_URL, SystemPaths.HTML_RACE_DIR, isSkip=isSkip
        )

        return html_filepaths

他のメソッド(scrape_html_ped, scrape_html_horse_with_master)も同様の処理となります。

詳細は、冒頭に示したGitHubのリンクをご確認ください。

保存したWebページのスクレイピング

本記事の最後になる、スクレイピングに関してです。

以降の前処理を行う上での準備も行っているので、参考になれば幸いです。

まず、以下に示すように、保存したHTMLファイルを読み込む処理を定義します。

    def __read_html_file(self, html_filename):
        """
        htmlファイルの読み込み
        
        Parameters
        ----------
        html_filename : str
            htmlファイル名
            
        Returns
        -------
        html : binary
            htmlファイル（バイナリ形式）
        target_id : str
            対象のID
        """
        if os.path.exists(html_filename):
            # データの読み込み
            with open(html_filename, 'rb') as fin:
                html = fin.read()
            target_id = os.path.splitext(os.path.basename(html_filename))[0]
        else:
            html, target_id = None, None

        return html, target_id

最初に述べた通り、ファイル名に対象のID（race_idやhorse_id）を指定する構成としたため、こちらの情報をもとにデータを管理します。

レース結果の取得

次に、レース結果を取得する処理を実装します。

こちらも、Webページのソースコードを確認するとレース結果がtableタグとして定義されているため、pandasのread_htmlを用いることで取得できます。

ただし、horse_idやjockey_id（騎手ID）はソースコード中にしかないため、BeautifulSoupもあわせて利用します。

上記を踏まえた実装結果は以下のようになります。

    def get_rawdata_results(self, html_filepaths):
        """
        レース結果テーブルの取得
        
        Parameters
        ----------
        html_filepaths : list
            レース結果のhtmlファイルパス
            
        Returns
        -------
        race_results : pandas.DataFrame
            全レース結果テーブル
        """
        horse_pattern = re.compile('^/horse')
        jockey_pattern = re.compile('^/jockey')
        races = {}
        
        for html_filename in tqdm(html_filepaths):
            html, race_id = self.__read_html_file(html_filename)
            # 無効なファイルパスは読み飛ばす
            if html is None:
                continue

            try:
                df = pd.read_html(html)[0]
                soup = BeautifulSoup(html, 'html.parser')
                # 馬IDと騎手IDを取得
                summary_table = soup.find('table', attrs={'class': 'race_table_01'})
                # 馬IDを取得
                atags = summary_table.find_all('a', attrs={'href': horse_pattern})
                horse_ids = [re.findall(r'\d+', atag['href'])[0] for atag in atags]
                # 騎手IDを取得
                atags = summary_table.find_all('a', attrs={'href': jockey_pattern})
                jockey_ids = [re.findall(r'\d+', atag['href'])[0] for atag in atags]
                df['horse_id'] = horse_ids
                df['jockey_id'] = jockey_ids
                df['race_id'] = race_id
                races[race_id] = df.set_index('race_id')
            # IndexError, AttributeErrorは読み飛ばす
            except (IndexError, AttributeError):
                continue
            # 接続切れ等のエラー処理
            except Exception as e:
                _, _, tb = sys.exc_info()
                print(f'Error({self.__class_name}::get_rawdata_results, {tb.tb_lineno})[{race_id}] {e}')
                break
            # Jupyterのエラー処理
            except:
                break
        # 結果集計
        race_results = pd.concat([val for val in races.values()])
        
        return race_results

取得後のデータは、以下のような形式になります。

レース情報の取得

レース情報は、tableタグで囲われていないため、BeautifulSoupで該当箇所を抽出する必要があります。

また、以降の前処理の事を踏まえ、各データを「-」で結合した1つの文字列としてデータを保存します。

上記を踏まえた実装結果は以下のようになります。

    def get_rawdata_raceinfo(self, html_filepaths):
        """
        レース情報テーブルの取得
        
        Parameters
        ----------
        html_filepaths : list
            レース結果のhtmlファイルパス
            
        Returns
        -------
        all_race_info : pandas.DataFrame
            全レース情報テーブル
        """
        info = {}

        for html_filename in tqdm(html_filepaths):
            html, race_id = self.__read_html_file(html_filename)
            # 無効なファイルパスは読み飛ばす
            if html is None:
                continue

            try:
                soup = BeautifulSoup(html, 'html.parser')
                # 天候、レースの種類、コースの長さ、馬場の状態、日付などを取得
                texts = re.findall(f'\w+', ''.join([
                    item.text
                    for item in soup.find('div', attrs={'class': 'data_intro'}).find_all('p')[:2]
                ]))
                # DataFrameの生成
                data = {
                    'texts':   ['-'.join(texts)], 
                    'race_id': [race_id],
                }
                df = pd.DataFrame(data)
                info[race_id] = df.set_index('race_id')
            # AttributeErrorは読み飛ばす
            except AttributeError:
                continue
            # 接続切れ等のエラー処理
            except Exception as e:
                _, _, tb = sys.exc_info()
                print(f'Error({self.__class_name}::get_rawdata_info, {tb.tb_lineno})[{race_id}] {e}')
                break
            # Jupyterのエラー処理
            except:
                break

        # 結果集計
        all_race_info = pd.concat([val for val in info.values()])
        
        return all_race_info

取得後のデータは、以下のような形式になります。

払い戻し結果の取得

レース結果と同様に、払い戻し結果もtableタグとして定義されているため、pandasのread_htmlを用いることで取得できます。

1点注意することとして、2つのtableに分かれているため、pandasのDataFrameを結合して保存する必要があります。

上記を踏まえた実装結果は以下のようになります。

    def get_rawdata_payback(self, html_filepaths):
        """
        払い戻し結果テーブルの取得
        
        Parameters
        ----------
        html_filepaths : list
            レース結果のhtmlファイルパス
            
        Returns
        -------
        paybacks : pandas.DataFrame
            すべての払い戻し結果のテーブル
        """
        payouts = {}

        for html_filename in tqdm(html_filepaths):
            html, race_id = self.__read_html_file(html_filename)
            # 無効なファイルパスは読み飛ばす
            if html is None:
                continue

            try:
                dfs = pd.read_html(html)
                df = pd.concat([dfs[1], dfs[2]])
                df['race_id'] = race_id
                payouts[race_id] = df.set_index('race_id')
            # IndexError, AttributeErrorは読み飛ばす
            except (IndexError, AttributeError):
                continue
            # 接続切れ等のエラー処理
            except Exception as e:
                _, _, tb = sys.exc_info()
                print(f'Error({self.__class_name}::get_rawdata_payback, {tb.tb_lineno})[{race_id}] {e}')
                break
            # Jupyterのエラー処理
            except:
                break
                
        # 結果集計
        paybacks = pd.concat([val for val in payouts.values()])
        
        return paybacks

取得後のデータは、以下のようになります。

馬の過去成績

レース結果、払い戻し結果と同様に、馬の過去成績もtableタグとして定義されているため、pandasのread_htmlを用いることで取得できます。

ただし、馬の過去成績は、受賞歴の有無により取り出す位置が異なります。

上記を踏まえた実装結果は以下のようになります。

    def get_rawdata_horse(self, html_filepaths):
        """
        過去成績テーブルの取得
        
        Parameters
        ----------
        html_filepaths : list
            馬の過去成績データのhtmlファイルパス
            
        Returns
        -------
        horse_results : pandas.DataFrame
            すべての馬の過去成績テーブル
        """
        horses = {}
        
        for html_filename in tqdm(html_filepaths):
            html, horse_id = self.__read_html_file(html_filename)
            # 無効なファイルパスは読み飛ばす
            if html is None:
                continue

            try:
                dfs = pd.read_html(html)
                df = dfs[4] if dfs[3].columns[0] == '受賞歴' else dfs[3]
                df['horse_id'] = horse_id
                horses[horse_id] = df.set_index('horse_id')
            # IndexError, AttributeErrorは読み飛ばす
            except (IndexError, AttributeError):
                continue
            # 接続切れ等のエラー処理
            except Exception as e:
                _, _, tb = sys.exc_info()
                print(f'Error({self.__class_name}::get_rawdata_horse, {tb.tb_lineno})[{horse_id}] {e}')
                break
            # Jupyterのエラー処理
            except:
                break
                
        # 結果集計
        horse_results = pd.concat([val for val in horses.values()])

        return horse_results

取得後のデータは、以下のようになります。

血統情報の取得

血統情報は、以降の後処理のことを考慮し、変則的ですが、該当するhorse_idと親等の情報をタプル型で保存します。

また、親等の情報は、該当するhorse_idの親要素（tdタグ）のrowspanから判定できます。

rowspanと親等の関係は、以下のようになります。

rowspanの値	対応する親等
16	1親等（両親）
8	2親等（祖父母）
4	3親等（曾祖父母）
2	4親等（高祖父母）
1	5親等（5世の祖）

rowspanと親等の関係

このような関係を保持しておくことで、血統情報を利用する際に「何親等まで利用するか」を制御できます。

上記を踏まえた実装結果は以下のようになります。

    def get_rawdata_ped(self, html_filepaths):
        """
        血統情報テーブルの取得
        
        Parameters
        ----------
        html_filepaths : list
            馬の過去成績データのhtmlファイルパス
            
        Returns
        -------
        ped_results : pandas.DataFrame
            すべての血統情報テーブル
        """
        peds = {}
        sex_pattern = re.compile(r'b_ml|b_fml')
        horse_id_pattern = re.compile(r'^/horse/[0-9a-z]+/$')
        relatives = {
            '16': 1, # 両親（1親等）
            '8':  2, # 祖父母（2親等）
            '4':  3, # 曾祖父母（3親等）
            '2':  4, # 高祖父母（4親等）
            '1':  5, # 5世の祖（5親等）
        }
        
        for html_filename in tqdm(html_filepaths):
            html, horse_id = self.__read_html_file(html_filename)
            # 無効なファイルパスは読み飛ばす
            if html is None:
                continue

            try:
                soup = BeautifulSoup(html, 'html.parser')
                ped_table = soup.find('table', attrs={'class': 'blood_table'})
                tds = ped_table.find_all('td', attrs={'class': sex_pattern})
                atags = [td.find('a', attrs={'href': horse_id_pattern}) for td in tds]
                ped_horse_ids = [
                    (re.sub(r'^/horse/', '', atag['href'])[:-1], relatives[atag.parent.get('rowspan', '1')])
                    for atag in atags if hasattr(atag, 'href')
                ]
                peds[horse_id] = pd.DataFrame({f'{horse_id}': ped_horse_ids})
            # IndexError, AttributeErrorは読み飛ばす
            except (IndexError, AttributeError):
                continue
            # 接続切れ等のエラー処理
            except Exception as e:
                _, _, tb = sys.exc_info()
                print(f'Error({self.__class_name}::get_rawdata_horse, {tb.tb_lineno})[{horse_id}] {e}')
                break
            # Jupyterのエラー処理
            except:
                break

        # 結果集計
        ped_results = pd.concat([val for val in peds.values()], axis=1).T.add_prefix('peds_').rename_axis('horse_id')

        return ped_results

取得後のデータは、以下のようになります。

また、columnは、以下のようになっています。

スクレイピング結果の保存

スクレイピングにより得られた結果を保存する処理を実装します。

この時、過去の結果に追記する形式にしたいため、古いデータが存在する場合は、重複する部分を削除した上で保存するようにします。

上記を踏まえた実装結果は以下のようになります。

    def update_rawdata(self, filepath, new_df):
        """
        テーブルの更新
        
        Parameters
        ----------
        filepath : str
            テーブルの保存先
        new_df : pd.DataFrame
            保存するテーブル情報
        """
        # ファイルが存在する場合
        if os.path.exists(filepath):
            old_df = pd.read_pickle(filepath)
            # 重複を削除
            filtered_old = old_df[~old_df.index.isin(new_df.index)]
            df = pd.concat([filtered_old, new_df])
        else:
            df = new_df.copy()
        # 更新結果を保存
        df.to_pickle(filepath)

また、馬の過去成績や血統情報を取得する際は、horse_idが必要になるため、同様に読み込むためのメソッドも定義します。

実装結果は以下のようになります。

    def load_rawdata(self, filepath):
        """
        テーブルの読み込み
        
        Parameters
        ----------
        filepath : str
            テーブルの読み込み先
        df : pd.DataFrame
            読み込んだテーブル情報
        """
        # ファイルが存在しない場合
        if not os.path.exists(filepath):
            raise Exception(f'Error({self.__class_name}::load_rawdata): Does not exist {filepath}')
        df = pd.read_pickle(filepath)
        
        return df

実行結果の例

今回のプログラムを実行した場合、ディレクトリ構成の例は以下のようになります。


.
|-- Dockerfile
|-- docker-compose.yml
|-- entrypoint.sh
`-- workspace
    `-- keiba
        |-- data
        |   |-- html
        |   |   |-- horse
        |   |   |   |-- 2000100030.bin # 収集したWebページの情報
        |   |   |   |-- 2000100231.bin
        |   |   |   |-- 2000100785.bin
        |   |   |   `-- ...
        |   |   |-- ped
        |   |   |   |-- 2000100030.bin
        |   |   |   |-- 2000100231.bin
        |   |   |   |-- 2000100785.bin
        |   |   |   `-- ...
        |   |   `-- race
        |   |       |-- 202201010101.bin
        |   |       |-- 202201010102.bin
        |   |       |-- 202201010103.bin
        |   |       `-- ...
        |   |-- master
        |   |   `-- horse_results_updated_at.pkl # 取得した馬の過去成績の取得時期（今回は省略）
        |   `-- raw
        |       |-- results.pkl # レース結果
        |       |-- race_info.pkl # レース情報
        |       |-- payback.pkl # 払い戻し結果
        |       |-- horse_results.pkl # 馬の過去成績
        |       `-- ped_results.pkl # 血統情報
        |-- models
        |-- modules
        |   |-- __init__.py
        |   |-- Constants.py
        |   `-- Collection.py
        `-- scrape.ipynb

以降では、これらの情報を用いて前処理を行っていきます。

まとめ

今回は、機械学習を用いた競馬予想を行う上で必要となるデータを収集する方法について解説しました。

機械学習には膨大な量のデータが必要になるため、これらの情報も効率良く収集する基盤が必要になります。

今回は、データ収集の基盤を構築できたと思います。

次回以降では、これらのデータに対する前処理について解説したいと思います。

効率良く技術習得したい方へ

今回の話の中で、プログラミングについてよく分からなかった方もいると思います。

このような場合、エラーが発生した際に対応できなくなってしまうため、経験者からフォローしてもらえる環境下で勉強することをおすすめします。

詳細は、以下の記事をご覧ください。

: 【比較】プログラミングスクールおすすめランキング6選【初心者向け】
続きを見る

前処理関連の記事を執筆しました。

【Python】機械学習を用いた競馬予想【データ収集編】

【Python】機械学習を用いた競馬予想【環境構築編】

【比較】プログラミングスクールおすすめランキング6選【初心者向け】

今回の実装結果

注意点

データの収集の概要

レース結果

馬の過去成績

血統情報

データの収集方法

データ収集の準備

データ収集

scrape.ipynb

modules/Collection.py

初期化

開催日一覧の取得

レースID一覧の取得

Webページの取得・保存

保存したWebページのスクレイピング

レース結果の取得

レース情報の取得

払い戻し結果の取得

馬の過去成績

血統情報の取得

スクレイピング結果の保存

実行結果の例

まとめ

【比較】プログラミングスクールおすすめランキング6選【初心者向け】

【Python】機械学習を用いた競馬予想【前処理・特徴量生成編】