SH(μμΈμ£Όνλμ곡μ¬) λ° LH(νκ΅ν μ§μ£Όν곡μ¬) 곡곡μλμ£Όν κ³΅κ³ λ₯Ό μλμΌλ‘ ν¬λ‘€λ§νκ³ , PDFμμ μ 보λ₯Ό μΆμΆνμ¬ PostgreSQL λ°μ΄ν°λ² μ΄μ€μ μ μ₯νλ Airflow κΈ°λ° νμ΄νλΌμΈμ λλ€.
- μ£Όμ κΈ°λ₯
- μμ€ν μν€ν μ²
- κΈ°μ μ€ν
- μ§μ κ³΅μ¬ λ° μ²μ½ μ ν
- λΉ λ₯Έ μμ (Docker)
- νκ²½ μ€μ
- Airflow DAG κ°μ΄λ
- λ°μ΄ν°λ² μ΄μ€ μ€ν€λ§
- λ°μ΄ν° μ‘°ν λ°©λ²
- λ‘컬 κ°λ° νκ²½
- νλ‘μ νΈ κ΅¬μ‘°
- λ¬Έμ ν΄κ²°
- SH 곡μ¬: BeautifulSoup κΈ°λ° μ μ μΉ ν¬λ‘€λ§
- LH 곡μ¬: Selenium κΈ°λ° λμ νμ΄μ§ ν¬λ‘€λ§
- PDF νμΌ μ§μ λ€μ΄λ‘λ
- Excel νμΌμ PDFλ‘ μλ λ³ν (LibreOffice νμ©)
- λ€μ€ νμΌ μλ λ³ν©
- Upstage Information Extract API νμ©
- JSON μ€ν€λ§ κΈ°λ° κ΅¬μ‘°νλ λ°μ΄ν° μΆμΆ
- κ³΅κΈ νλ‘μ νΈ, μΌμ , μ격 μ건 λ± μλ νμ±
- Apache Airflow κΈ°λ° μ€μΌμ€λ§
- μ 체 μμ§ / μ¦λΆ μμ§ λͺ¨λ μ§μ
- μ€λ³΅ λ°μ΄ν° μλ νν°λ§
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β λ°μ΄ν° μμ€ β
βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ€
β SH κ³΅μ¬ ννμ΄μ§ β LH κ³΅μ¬ ννμ΄μ§ β
β (μ μ HTML νμ΄μ§) β (JavaScript λμ νμ΄μ§) β
ββββββββββββββββ¬βββββββββββββββ΄ββββββββββββββββ¬ββββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
β sh_parsing_with_db.py β β lh_parsing_with_db.py β
β βββββββββββββββββββββ β β ββββββββββββββββββββββββββββ β
β - BeautifulSoup β β - Selenium + ChromeDriver β
β - requests β β - μμ
λ€μ΄λ‘λ & λ³ν β
β - PDF μ§μ λ€μ΄λ‘λ β β - LibreOffice (soffice) β
ββββββββββββββββ¬ββββββββββββ β - PDF λ³ν© (pypdf) β
β βββββββββββββββββ¬βββββββββββββββββββ
β β
βββββββββββββββββ¬ββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββ
β PostgreSQL DB β
β βββββββββββββββββββββββββββ β
β - announcements (κ³΅κ³ ) β
β - pdf_files (첨λΆνμΌ) β
β - program_info (νλ‘κ·Έλ¨) β
β - supply_projects (곡κΈ) β
βββββββββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β information_extract_with_db β
β βββββββββββββββββββββββββββ β
β - Upstage API νΈμΆ β
β - PDF β ꡬ쑰ν λ°μ΄ν° β
β - μ κ·νλ ν
μ΄λΈ μ μ₯ β
βββββββββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β Apache Airflow β
β βββββββββββββββββββββββββββ β
β - 4κ° DAG κ΄λ¦¬ β
β - μ€μΌμ€ μλ μ€ν β
β - λͺ¨λν°λ§ & λ‘κΉ
β
βββββββββββββββββββββββββββββββββ
| κ΅¬λΆ | κΈ°μ |
|---|---|
| μν¬νλ‘μ° | Apache Airflow 2.8.0 |
| μΈμ΄ | Python 3.11 |
| λ°μ΄ν°λ² μ΄μ€ | PostgreSQL (μ격) |
| ORM | SQLAlchemy |
| ν¬λ‘€λ§ (SH) | BeautifulSoup4, requests |
| ν¬λ‘€λ§ (LH) | Selenium, Chromium, ChromeDriver |
| νμΌ λ³ν | LibreOffice (soffice) |
| PDF μ²λ¦¬ | pypdf |
| AI/ML | Upstage Information Extract API |
| 컨ν μ΄λ | Docker, Docker Compose |
- κ΅λ―Όκ³΅κ³΅μλμ£Όν
- λμνμνμ£Όν
- λ§€μ μλμ£Όν
- μ₯κΈ°μμ¬μ£Όν
- μ₯κΈ°μ μΈμ£Όν
- μ₯κΈ°μ μΈμ£Όν2(미리λ΄μ§)
- μ μΈμλ
- μ²λ μμ¬μ£Όν
- ν볡주ν
- κ΅λ―Όμλ
- 곡곡μλ
- μꡬμλ
- ν볡주ν
- λ§€μ μλ
- λ§€μ μλμ£Όν
- ν΅ν©κ³΅κ³ μλ
- Docker 20.10 μ΄μ
- Docker Compose 2.0 μ΄μ
- μ격 PostgreSQL λ°μ΄ν°λ² μ΄μ€
# .env νμΌ μμ±
cp .env.example .env
# .env νμΌ μμ
nano .env# μ΄λ―Έμ§ λΉλ λ° μ»¨ν
μ΄λ μμ
docker-compose up -d
# λ‘κ·Έ νμΈ
docker-compose logs -f
# νΉμ μλΉμ€ λ‘κ·Έλ§ νμΈ
docker-compose logs -f airflow-schedulerλΈλΌμ°μ μμ http://localhost:8081 μ μ
- Username:
admin(λλ.envμμ μ€μ ν κ°) - Password:
admin(λλ.envμμ μ€μ ν κ°)
μΉ UIμμ μ€ν:
- μνλ DAG μ°ΎκΈ° (μ:
sh_housing_pipeline_full) - ν κΈ λ²νΌμ μΌμ νμ±ν
- "Trigger DAG" λ²νΌ ν΄λ¦
CLIλ‘ μ€ν:
# SH μ 체 μμ§
docker-compose exec airflow-scheduler airflow dags trigger sh_housing_pipeline_full
# LH μ 체 μμ§
docker-compose exec airflow-scheduler airflow dags trigger lh_housing_pipeline_full
# SH μ¦λΆ μμ§
docker-compose exec airflow-scheduler airflow dags trigger sh_housing_pipeline_incremental
# LH μ¦λΆ μμ§
docker-compose exec airflow-scheduler airflow dags trigger lh_housing_pipeline_incremental# 컨ν
μ΄λ μ€μ§
docker-compose down
# 컨ν
μ΄λ μ¬μμ
docker-compose restart
# νΉμ μλΉμ€λ§ μ¬μμ
docker-compose restart airflow-scheduler# μ ν리μΌμ΄μ
λ°μ΄ν°λ² μ΄μ€ μ€μ (νμ)
DB_HOST=your_remote_db_host
DB_PORT=5432
DB_NAME=housing_db
DB_USER=your_db_user
DB_PASSWORD=your_db_password
# Airflow λ©νλ°μ΄ν° λ°μ΄ν°λ² μ΄μ€ μ€μ (νμ)
AIRFLOW_DB_HOST=your_remote_db_host
AIRFLOW_DB_PORT=5432
AIRFLOW_DB_NAME=airflow
AIRFLOW_DB_USER=your_db_user
AIRFLOW_DB_PASSWORD=your_db_password
# Upstage API ν€ (νμ)
UPSTAGE_API_KEY=your_upstage_api_key_here
# Docker μ€μ
AIRFLOW_UID=50000
_AIRFLOW_WWW_USER_USERNAME=admin
_AIRFLOW_WWW_USER_PASSWORD=admin
# νμΌ μ μ₯ μ€μ
ATTACH_BASE_DIR=attachments
SAVE_JSON_BACKUP=false
# LibreOffice μ€μ (LH ν¬λ‘€λ§μ©)
LIBREOFFICE_ENABLED=true
EXCEL_LAYOUT_SAFE_MODE=true| DAG μ΄λ¦ | λμ | μ€ν λ°©μ | μ€μΌμ€ | μ©λ |
|---|---|---|---|---|
sh_housing_pipeline_full |
SH | μλ | - | μ΅μ΄ μ 체 λ°μ΄ν° μμ§ |
sh_housing_pipeline_incremental |
SH | μλ | λ§€μΌ 09:00 | μΌμΌ μ κ· κ³΅κ³ μμ§ |
lh_housing_pipeline_full |
LH | μλ | - | μ΅μ΄ μ 체 λ°μ΄ν° μμ§ |
lh_housing_pipeline_incremental |
LH | μλ | λ§€μΌ 09:30 | μΌμΌ μ κ· κ³΅κ³ μμ§ |
| νλͺ© | SH | LH |
|---|---|---|
| ν¬λ‘€λ§ λ°©μ | BeautifulSoup (μ μ ) | Selenium (λμ ) |
| νμΌ νμ | PDF + Excel | |
| μμ λ³ν | ν΄λΉ μμ | LibreOffice λ³ν |
| μ€ν μκ° | λΉ λ¦ | λλ¦Ό (λΈλΌμ°μ ꡬλ) |
| μ¬μλ νμ | 2ν | 1ν |
init_database (DB μ΄κΈ°ν)
β
crawl_*_data (ν¬λ‘€λ§ & PDF λ€μ΄λ‘λ)
β
extract_*_info (μ 보 μΆμΆ & DB μ μ₯)
dags/ ν΄λμ DAG νμΌμμ schedule_interval μμ :
# λ§€μΌ μ€μ 9μ
schedule_interval='0 9 * * *'
# λ§€μ£Ό μμμΌ μ€μ 9μ
schedule_interval='0 9 * * 1'
# λ§€μ£Ό μ/μ/κΈ μ€μ 9μ
schedule_interval='0 9 * * 1,3,5'Docker μ¬μ©μ κΆμ₯ν©λλ€. λ‘컬 μ€ν μ μΆκ° μ€μΉκ° νμν©λλ€.
- Python 3.8 μ΄μ
- PostgreSQL 12 μ΄μ
- Chromium + ChromeDriver (LH ν¬λ‘€λ§μ©)
- LibreOffice (LH μμ λ³νμ©)
# Python μμ‘΄μ±
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Chromium & ChromeDriver
brew install chromium
brew install chromedriver
# LibreOffice
brew install --cask libreoffice# λ°μ΄ν°λ² μ΄μ€ μ΄κΈ°ν
python database.py
# SH ν¬λ‘€λ§
python sh_parsing_with_db.py
# LH ν¬λ‘€λ§
python lh_parsing_with_db.py
# μ 보 μΆμΆ
python information_extract_with_db.py.
βββ dags/ # Airflow DAG νμΌ
β βββ sh_housing_pipeline_full.py
β βββ sh_housing_pipeline_incremental.py
β βββ lh_housing_pipeline_full.py
β βββ lh_housing_pipeline_incremental.py
βββ scripts/ # μ νΈλ¦¬ν° μ€ν¬λ¦½νΈ
β βββ check_db.py # DB μν νμΈ
β βββ check_airflow_db.py # Airflow DB νμΈ
β βββ resets_db.py # DB 리μ
β βββ migrate_pdf_to_minio.py # MinIO λ§μ΄κ·Έλ μ΄μ
β βββ test_db_connection.py # DB μ°κ²° ν
μ€νΈ
βββ attachments/ # λ€μ΄λ‘λλ νμΌ
β βββ SH/ # SH κ³΅μ¬ (μ²μ½μ νλ³ ν΄λ)
β βββ LH/ # LH κ³΅μ¬ (μ²μ½μ νλ³ ν΄λ)
βββ schema/ # JSON μ€ν€λ§
β βββ schema.json
βββ logs/ # Airflow μ€ν λ‘κ·Έ
βββ plugins/ # Airflow 컀μ€ν
νλ¬κ·ΈμΈ
βββ config.py # νκ²½ μ€μ
βββ database.py # DB μ°κ²° κ΄λ¦¬
βββ models.py # SQLAlchemy ORM λͺ¨λΈ
βββ sh_parsing_with_db.py # SH ν¬λ‘€λ§ λͺ¨λ
βββ lh_parsing_with_db.py # LH ν¬λ‘€λ§ λͺ¨λ
βββ information_extract_with_db.py # μ 보 μΆμΆ λͺ¨λ
βββ Dockerfile # Docker μ΄λ―Έμ§ μ μ
βββ docker-compose.yml # 컨ν
μ΄λ μ€μΌμ€νΈλ μ΄μ
βββ requirements.txt # Python μμ‘΄μ±
βββ README.md