AutoML Vision用の学習用画像収集スクリプト

皆さんこんにちは。@best_not_bestです。

Google Cloud Platform（GCP）のプロダクトの1つに、Cloud AutoML Visionとういうものがあり、ノンプログラミングで画像認識モデルの作成が可能です。

学習用の画像は以下の方法でアップロード可能です。
（cf. https://cloud.google.com/vision/automl/docs/create#upload_your_images ）

Web UI上からアップロード（zipファイルで複数ファイルアップロードも可能）
Google Cloud Storage（GCS）からの読み込み

今回は後者の方法を用い、画像検索APIで画像を収集 → GCSへファイルをアップロードするスクリプトを作成します。
モデルは犬画像の認識モデルを作成します！

AutoML Visionに加え、GCS、Stackdriver Loggingを使用しますので、必要に応じてGCP上から各種APIをオンにしてください。

環境

マシン/OS

MacBook Pro (13-inch, 2017, Four Thunderbolt 3 Ports)
OS 10.12.6
pyenv: 1.2.8
pyenv-virtualenv: 1.1.3

Python

pyenv、pyenv-virtualenvで 3.7.1 の環境を構築します。

$ mkdir hogehoge
$ cd hogehoge

$ pyenv local 3.7.1
$ pyenv virtualenv 3.7.1 hogehoge
$ pyenv local hogehoge

$ pip install -U pip

以下のライブラリをインストールします。

requirements.txt

cachetools==3.0.0
certifi==2018.11.29
chardet==3.0.4
dill==0.2.8.2
docopt==0.6.2
flake8==3.6.0
flake8-docstrings==1.1.0
flake8-polyfill==1.0.1
future==0.16.0
gapic-google-cloud-logging-v2==0.91.3
google-api-core==1.7.0
google-api-python-client==1.7.7
google-auth==1.6.2
google-auth-httplib2==0.0.3
google-cloud-core==0.29.1
google-cloud-logging==1.9.1
google-cloud-storage==1.13.2
google-gax==0.15.16
google-resumable-media==0.3.2
googleapis-common-protos==1.5.5
grpcio==1.17.1
httplib2==0.12.0
idna==2.8
mccabe==0.6.1
numpy==1.15.4
oauth2client==3.0.0
pandas==0.23.4
ply==3.8
proto-google-cloud-logging-v2==0.91.3
protobuf==3.6.1
pyasn1==0.4.4
pyasn1-modules==0.2.2
pycodestyle==2.4.0
pydocstyle==3.0.0
pyflakes==2.0.0
python-dateutil==2.7.5
pytz==2018.7
PyYAML==3.13
requests==2.21.0
rsa==4.0
six==1.12.0
snowballstemmer==1.2.1
uritemplate==3.0.0
urllib3==1.24.1

ディレクトリ構成

scripts
 └─ scripts.py
model
 └─ get_images.py
config
 └─ conf.yml
outputs

処理概要

画像検索APIで画像を収集
ローカルに画像を保存
画像ファイルをGCSへアップロード
画像のURIとラベルを含むCSVファイルの作成
CSVファイルをGCSへアップロード

画像検索APIはGoogle Custom Search APIを使いたい所ですが、リクエスト制限があるため、Bing Image Search APIを使用します。
Microsoftアカウントの作成等は、以下の記事を参考にさせて頂きました。

Bingの画像検索APIを使って画像を大量に収集する - Qiita

ファイル説明

conf.yml

APIキーや、検索ワードを設定します。
検索ワードはみんなの犬図鑑から日本犬をピックアップしています。（全犬種やりたかったけど時間がなかった・・・。）

conf.yml

api_key: 'xxxxx'
gcs_bucket_name: 'yyyyy'
images_per_requests: 50
request_count: 10
dir_name: 'dog'
search_words:
  -
    label: 'akita'
    word: '秋田犬'
  -
    label: 'kai'
    word: '甲斐犬'
  -
    label: 'kishu'
    word: '紀州犬'
  -
    label: 'shikoku'
    word: '四国犬'
  -
    label: 'shiba'
    word: '柴犬'
  -
    label: 'tosa'
    word: '土佐犬'
  -
    label: 'spitz'
    word: '日本スピッツ'
  -
    label: 'japanese-terrier'
    word: '日本テリア'
  -
    label: 'hokkaido'
    word: '北海道犬'

api_key: Bingの画像検索APIキー
gcs_bucket_name: アップロードするGCSのバケット名（バケット名は<GCPのプロジェクトID>-vcmにする必要があります。）
images_per_requests: 1リクエストあたりの要求画像数
request_count: リクエスト数
dir_name: ローカル、及びGCSの保存ディレクトリ名
search_words.label: 画像に付けるラベル名
search_words.word: 画像検索ワード

get_images.py

Bing画像検索APIから画像の取得、GCSへのファイルアップロードを行います。

get_images.py

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

"""Get images from Bing Search."""

from google.cloud import storage
import hashlib
import os
import pandas as pd
import requests
import urllib

class CommonError(Exception):
    """common error class."""

    pass

class GetImagesFromBingSearch(object):
    """Get images from Bing Search."""

    def __init__(
        self,
        api_key: str,
        gcs_bucket_name: str,
        gcs_dir_name: str,
    ):
        """init."""
        self.headers = {
            'Content-Type': 'multipart/form-data',
            'Ocp-Apim-Subscription-Key': api_key,
        }
        gcs_client = storage.Client()
        self.gcs_bucket_name = gcs_bucket_name
        self.gcs_bucket = gcs_client.get_bucket(gcs_bucket_name)
        self.csv_data = []
        self.gcs_dir_name = gcs_dir_name

    def get_image_url(
        self,
        search_word: str,
        images_per_requests: int = 50,
        request_count: int = 20,
    ) -> list:
        """get image url."""
        image_url_list = []

        # offset
        for offset in range(0, (images_per_requests * request_count), images_per_requests):
            # query parameters
            params = urllib.parse.urlencode(
                {
                    'q': search_word,
                    'mkt': 'ja-JP',
                    'count': images_per_requests,
                    'offset': offset,
                }
            )

            try:
                # get
                response = requests.get(
                    'https://api.cognitive.microsoft.com/bing/v7.0/images/search',
                    headers=self.headers,
                    params=params
                )
                response.raise_for_status()
                search_results = response.json()
            except Exception as e:
                raise CommonError(e)

            # append url
            for values in search_results['value']:
                image_url_list.append(values['contentUrl'])

        return list(set(image_url_list))

    def get_image(
        self,
        image_url: str,
        output_dir_path: str,
        blob_dir: str,
    ) -> bool:
        """get image."""
        opener = urllib.request.build_opener()
        urllib.request.install_opener(opener)

        # check extension
        parsed_url_path = urllib.parse.urlparse(image_url).path.split(':')[0]
        extension = os.path.splitext(parsed_url_path)[-1].lower()
        if extension not in ('.jpg', '.jpeg', '.gif', '.png', '.bmp'):
            msg = 'extension error. url:"%s".' % (image_url)
            raise CommonError(msg)

        try:
            # get
            response = requests.get(image_url, allow_redirects=True, timeout=5)
        except Exception as e:
            raise CommonError(e)

        # check content
        if len(response.content) == 0:
            msg = 'content error. url:"%s".' % (image_url)
            raise CommonError(msg)

        # save
        hashed_url = hashlib.sha256(image_url.encode('utf-8')).hexdigest()
        output_file_path = os.path.join(
            output_dir_path,
            hashed_url + extension,
        )
        with open(output_file_path, 'wb') as fp:
            fp.write(response.content)

        # upload to GCS
        blob_name = 'automl_vision/' \
            + self.gcs_dir_name \
            + '/' \
            + blob_dir \
            + '/' \
            + hashed_url \
            + extension
        blob = self.gcs_bucket.blob(blob_name)
        blob.upload_from_filename(output_file_path)

        # add data
        self.csv_data.append(
            {
                'gcs_uri': 'gs://' + self.gcs_bucket_name + '/' + blob_name,
                'label': blob_dir,
            }
        )

        return True

    def make_csv(
        self,
        output_file_path: str,
    ) -> bool:
        """make CSV file."""
        df = pd.DataFrame.from_dict(self.csv_data)
        df = df.ix[
            :,
            [
                'gcs_uri',
                'label',
            ]
        ]
        df.to_csv(output_file_path, index=False, header=False)

        # upload to GCS
        blob_name = 'automl_vision/' \
            + self.gcs_dir_name \
            + '/' \
            + os.path.basename(output_file_path)
        blob = self.gcs_bucket.blob(blob_name)
        blob.upload_from_filename(output_file_path)

        return True

get_image_url: Bing画像検索APIから画像URLの取得を行います。
get_image: 画像URLから画像の取得 → ローカルに保存 → GCSへのファイルアップロードを行います。
- 盛り込み過ぎたのでUnitテストしづらい・・・。
- 特定の拡張子を持つファイルのみ保存されます。また、ファイルサイズが0のファイルは保存されません。
make_csv: 前述した「画像のURIとラベルを含むCSVファイル」を作成します。

scripts.py

スクリプト部分になります。ログはStackdriver Loggingに送られます。

scripts.py

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

"""Get images from Bing Search.

Usage:
    scripts.py
        --conf_file_path=<conf_file_path>
        --output_dir_path=<output_dir_path>
    scripts.py -h | --help
Options:
    -h --help show this screen and exit.
"""

from docopt import docopt
import google.cloud.logging
import logging
import os
import shutil
import sys
import yaml

try:
    import get_images
except ImportError:
    sys.path.append(os.path.abspath(os.path.dirname(__file__)) + '/../model')
    import get_images

if __name__ == '__main__':
    # logging config
    logging.basicConfig(format='%(asctime)s %(levelname)s: %(message)s')

    # logging
    logging_client = google.cloud.logging.Client()
    logging_client.setup_logging()
    logging.info('%s start.' % (__file__))

    # get parameters
    args = docopt(__doc__)
    conf_file_path = args['--conf_file_path']
    output_dir_path = args['--output_dir_path']

    # config
    with open(conf_file_path) as f:
        conf_data = yaml.load(f)
    api_key = conf_data['api_key']
    search_words = conf_data['search_words']
    gcs_bucket_name = conf_data['gcs_bucket_name']
    images_per_requests = conf_data['images_per_requests']
    request_count = conf_data['request_count']
    dir_name = conf_data['dir_name']

    # create model
    gifbs = get_images.GetImagesFromBingSearch(
        api_key,
        gcs_bucket_name,
        dir_name,
    )

    for search_word in search_words:
        label = search_word['label']
        word = search_word['word']

        # make dir
        output_tmp_dir_path = os.path.join(
            os.path.abspath(output_dir_path),
            dir_name,
            label,
        )
        if os.path.isdir(output_tmp_dir_path):
            # remove dir
            shutil.rmtree(output_tmp_dir_path)
        os.mkdir(output_tmp_dir_path)

        try:
            # get image url
            image_url_list = gifbs.get_image_url(
                word,
                images_per_requests=images_per_requests,
                request_count=request_count,
            )
        except get_images.CommonError as e:
            logging.warning(e)

        for image_url in image_url_list:
            # get image, save and upload GCS
            try:
                gifbs.get_image(
                    image_url,
                    output_tmp_dir_path,
                    label,
                )
            except get_images.CommonError as e:
                logging.warning(e)
                continue

    # make CSV file
    output_file_path = os.path.join(
        os.path.abspath(output_dir_path),
        dir_name,
        'data.csv',
    )
    gifbs.make_csv(output_file_path)

    logging.info('%s end.' % (__file__))
    sys.exit(0)

実行方法

conf_file_path、output_dir_pathがコマンド引数となります。それぞれ<設定ファイルのパス>、<画像の保存ディレクトリのパス>を指定ください。

実行例

$ python scripts/scripts.py  \
  --conf_file_path=./config/conf.yml \
  --output_dir_path=./outputs/

実行結果

結構な数のWarningログが出力されますが、プログラム内で意図的に出力しているログであり、必要な画像はGCSへとアップロードされています。

実行結果

2018-12-22 23:47:05,421 INFO: scripts/scripts.py start.
scripts/scripts.py start.
2018-12-22 23:47:21,630 WARNING: extension error. url:"https://item-shopping.c.yimg.jp/i/j/usual_irish3".
extension error. url:"https://item-shopping.c.yimg.jp/i/j/usual_irish3".
（中略）
2018-12-22 23:51:12,407 INFO: scripts/scripts.py end.
scripts/scripts.py end.
Program shutting down, attempting to send 1 queued log entries to Stackdriver Logging...
Waiting up to 5 seconds.
Sent all pending logs.

gs://<gcs_bucket_name>/automl_vision/<dir_name>/<search_words.label>/配下にラベル毎に画像ファイルが、gs://<gcs_bucket_name>/automl_vision/<dir_name>/data.csvに、「画像のURIとラベルを含むCSVファイル」がアップロードされます。