ML-Pipeline-of-Paper-Category-Classification

Arxiv APIで取得した論文データについて、タイトルから主カテゴリを予測する分類器を学習、デプロイする

Requirements

Poetry
gcloud CLI
docker compose

Setup

GCP Authentification

$ gcloud auth login
$ gcloud components install pubsub-emulator

Install Dependencies

$ make install

Environmental Variables

$ vi .env

以下の情報を記入＋環境変数としてexportしておく

GCP_PROJECT_ID=your project id
TOPIC_ID=your topic id
AR_REPOSITORY_NAME=artifact registory repository name
LOCATION=asia-northeast1
DATA_BUCKET=gs://xxx
SOURCE_CSV_URI=gs://xxx/data.csv
CONFIG_FILE_URI=gs:/xxx/config.json
ROOT_BUCKET=gs://yyy
JOB_NAME=cloud run job name
SCHEDULER_NAME=cloud scheduler name
DATASET_NAME=dataset name
TABLE_NAME=table name
BQ_FUNC_NAME=cloud functions name to use bigquery
PIPELINE_NAME=vertex ai pipelines name

Boot MLflow Server

$ make mlflow

Build & Push Docker Image

$ gcloud auth configure-docker asia-northeast1-docker.pkg.dev
$ gcloud artifacts repositories create $AR_REPOSITORY_NAME --location=$LOCATION --repository-format=docker
$ docker compose build
$ docker compose push

Deploy Cloud Functions to Use BiqQuery

データセットが更新されたらBigQueryも自動更新する関数をデプロイする

$ make deploy_bq_func

Cloud Run Job to Scrape Paper Data

Deploy

$ make deploy_job

Exec

下記コマンドを実行すると前回実行時点からの差分となる論文情報が取得される

$ make exec_job

Create Scheduler

Cloud Run Jobの定期実行をしたい場合は下記コマンドを実行してください（デフォルトは月1回）

$ make create_scheduler

Exec Pipeline

$ make pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML-Pipeline-of-Paper-Category-Classification

Requirements

Setup

GCP Authentification

Install Dependencies

Environmental Variables

Boot MLflow Server

Build & Push Docker Image

Deploy Cloud Functions to Use BiqQuery

Cloud Run Job to Scrape Paper Data

Deploy

Exec

Create Scheduler

Exec Pipeline

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

ML-Pipeline-of-Paper-Category-Classification

Requirements

Setup

GCP Authentification

Install Dependencies

Environmental Variables

Boot MLflow Server

Build & Push Docker Image

Deploy Cloud Functions to Use BiqQuery

Cloud Run Job to Scrape Paper Data

Deploy

Exec

Create Scheduler

Exec Pipeline