個人投資家のためのWebスクレイピング（４）〜　Pythonを使って、東証「空売り比率」を取得し、グラフ化してみよう【下】

前回積み残してしまった過去12ヶ月のバックデータを取り込むようにスクリプトを修正しましたので、アップデートしておきます。

　空売り集計のトップページ（https://www.jpx.co.jp/markets/statistics-equities/short-selling/index.html）の上部右側にバックデータの年月を選択できるので、選択して表示させてみると、アドレスは以下のようになっています。「.html」の左側の数字だけが変化しているだけなので、簡単に取り込めそうです。

https://www.jpx.co.jp/markets/statistics-equities/short-selling/00-archives-01.html
https://www.jpx.co.jp/markets/statistics-equities/short-selling/00-archives-02.html
...
https://www.jpx.co.jp/markets/statistics-equities/short-selling/00-archives-12.html

これらのアドレスをurlsにリストとして保存しておいて、その後、ひとつひとつurlとしてアクセスし、pdf_listに保存場所を抽出していきます。

urls = []
url = 'https://www.jpx.co.jp/markets/statistics-equities/short-selling/index.html'
urls.append(url)

# urlsにバックデータのurlを蓄積
for i in range(12):
    if i <= 8:
        urls.append("https://www.jpx.co.jp/markets/statistics-equities/short-selling/00-archives-0" + str(i+1) + ".html")
    else:
        urls.append("https://www.jpx.co.jp/markets/statistics-equities/short-selling/00-archives-" + str(i+1) + ".html")

# urlsからひとつひとうつurlを取り出し、url毎に必要なpdf保存場所をpdf_listに抽出
for url in urls:
    res = requests.get(url)
    # 東証のホームページだとres.encoding = 'ISO-8859-1'となり、res.textが文字化けするため、
    # 以下の行を入れる。そうすると、res.encoding = 'utf-8'となる。
    res.encoding = res.apparent_encoding
    soup = BeautifulSoup(res.text, 'html.parser')
    s = soup.find('div', {'class': 'component-normal-table'})
    a_tags = s.find_all('a')

    for a_tag in a_tags:
        if a_tag.get('href')[-5] == 'm':
            pdf_list.append(a_tag.get('href'))

そして、抽出したpdf_listをもとに、pdf_ファイルを取得して、tempフォルダ下に保存していきます。

base_url = 'https://www.jpx.co.jp'

# tempフォルダ下にpdfファイルを取得する
for i, x in enumerate(pdf_list):
    url = base_url + x
    urllib.request.urlretrieve(url,'temp/shortselling'+ str(i) + '.pdf')

後は、前回と一緒です。ただし、Errorが発生してもスクリプトが止まらないようにtry〜exceptを加えておきます。また、傾向を掴むために、期間中の平均値を点線で加えておきます。そのスクリプトを実行した結果はこちらになります。それっぽいグラフになりました。

f:id:akatak:20180717220333p:plain

今回のスクリプトもこちらにアップしておきます。ご参考まで。

[東京証券取引所「空売り比率」推移（過去12ヶ月）](https://gist.g

akatak.hatenadiary.jp