Skip to content

Commit 1f8a504

Browse files
committed
architecture docs for scraper integration
1 parent e523bb0 commit 1f8a504

File tree

7 files changed

+1562
-0
lines changed

7 files changed

+1562
-0
lines changed
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# PropertyWebScraper Microservice Integration
2+
3+
## Overview
4+
5+
This document set describes the plan for integrating **PropertyWebScraper** (PWS) as an external extraction microservice into **PropertyWebBuilder** (PWB).
6+
7+
### Problem
8+
9+
PWB has a built-in scraping system using Ruby "Pasarela" classes — one per portal (Rightmove, Zoopla, Idealista, etc.). This approach has drawbacks:
10+
11+
1. **Adding a new portal requires Ruby code** — a new Pasarela class, tests, and deployment
12+
2. **Portal HTML changes break extraction** — fixing requires a PWB code change and redeploy
13+
3. **Limited portal coverage** — PWB supports ~10 portals vs PWS's 18+
14+
4. **Duplicate effort** — PWS already solves extraction with a mature, JSON-config-driven approach
15+
16+
### Solution
17+
18+
Use PWS as a dedicated extraction microservice. PWB sends a property URL (+ optional pre-rendered HTML) to PWS, receives structured property data back, and imports it through the existing `PropertyImportFromScrapeService`.
19+
20+
```
21+
PWB (Rails) PWS (Astro/Rails)
22+
────────── ──────────────────
23+
User enters URL ──→ ExternalScraperClient ──HTTP POST──→ /api/v1/extract
24+
│ │
25+
│ HtmlExtractor + Mappings
26+
│ │
27+
receives JSON ←──────────────────── { property data }
28+
29+
PropertyImportFromScrapeService
30+
31+
RealtyAsset + Listing + Photos
32+
```
33+
34+
### Documents in This Set
35+
36+
| # | Document | Purpose |
37+
|---|----------|---------|
38+
| 00 | OVERVIEW (this file) | High-level summary |
39+
| 01 | ARCHITECTURE | System architecture and data flow |
40+
| 02 | DATA_MAPPING | Field-by-field mapping between PWS and PWB |
41+
| 03 | API_CONTRACT | Exact API request/response specification |
42+
| 04 | PWS_CHANGES | Recommended changes to PropertyWebScraper |
43+
| 05 | PWB_CHANGES | Changes needed in PropertyWebBuilder |
44+
| 06 | IMPLEMENTATION_PHASES | Phased rollout plan |
45+
46+
### Decision Record
47+
48+
- **Integration style:** HTTP microservice (not gem dependency, not code port)
49+
- **PWS deployment target:** Astro app (modern, actively maintained) with Rails engine as fallback
50+
- **Auth:** API key via `X-Api-Key` header
51+
- **Fallback:** PWB retains its Pasarela system as a fallback when PWS is unavailable
52+
- **HTML source:** PWB fetches HTML via its existing connectors (HTTP/Playwright), sends to PWS for extraction only
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# 01 — Architecture
2+
3+
## Separation of Concerns
4+
5+
The integration splits responsibilities cleanly:
6+
7+
| Responsibility | Owner | Rationale |
8+
|----------------|-------|-----------|
9+
| HTML fetching (HTTP, Playwright) | PWB | Already has robust connectors with blocking detection, Playwright fallback, manual HTML entry |
10+
| HTML extraction (parsing, field mapping) | PWS | Purpose-built extraction engine with JSON-configurable mappings for 18+ portals |
11+
| Portal registry (which host → which parser) | PWS | Maintains the mapping between hostnames and scraper configurations |
12+
| Data import (creating RealtyAsset, Listings, Photos) | PWB | Owns the data model, multi-tenancy, and business logic |
13+
| Scrape history & deduplication | PWB | Owns `ScrapedProperty` records scoped to each tenant website |
14+
15+
## System Architecture
16+
17+
```
18+
┌─────────────────────────────────────────────────────────────────┐
19+
│ PropertyWebBuilder (PWB) │
20+
│ │
21+
│ ┌──────────────┐ ┌─────────────────────┐ │
22+
│ │ URL Import │───→│ PropertyScraperSvc │ │
23+
│ │ Controller │ │ (orchestrator) │ │
24+
│ └──────────────┘ └─────────┬───────────┘ │
25+
│ │ │
26+
│ ┌───────────┴───────────┐ │
27+
│ │ │ │
28+
│ ┌───────▼────────┐ ┌─────────▼──────────┐ │
29+
│ │ ScraperConnector│ │ ExternalScraperClient│ [NEW] │
30+
│ │ (HTTP/Playwright)│ │ (calls PWS API) │ │
31+
│ └───────┬────────┘ └─────────┬──────────┘ │
32+
│ │ │ │
33+
│ │ raw HTML │ raw HTML │
34+
│ │ │ │
35+
│ ┌───────▼────────┐ │ │
36+
│ │ Pasarela │ │ │
37+
│ │ (local parser) │ ┌─────────▼──────────┐ │
38+
│ │ [FALLBACK] │ │ PWS Microservice │ │
39+
│ └───────┬────────┘ │ (HTTP POST) │ │
40+
│ │ └─────────┬──────────┘ │
41+
│ │ │ │
42+
│ └───────────┬───────────┘ │
43+
│ │ │
44+
│ ┌───────────▼───────────┐ │
45+
│ │ extracted_data JSON │ │
46+
│ └───────────┬───────────┘ │
47+
│ │ │
48+
│ ┌───────────▼───────────┐ │
49+
│ │ ScrapedProperty │ │
50+
│ │ (stored to DB) │ │
51+
│ └───────────┬───────────┘ │
52+
│ │ │
53+
│ ┌───────────▼───────────┐ │
54+
│ │ ImportFromScrapeService │ │
55+
│ │ → RealtyAsset │ │
56+
│ │ → SaleListing/Rental │ │
57+
│ │ → PropPhotos │ │
58+
│ └────────────────────────┘ │
59+
└─────────────────────────────────────────────────────────────────┘
60+
61+
┌─────────────────────────────────────────────────────────────────┐
62+
│ PropertyWebScraper (PWS) │
63+
│ │
64+
│ ┌────────────────┐ │
65+
│ │ /api/v1/extract │ ←── POST { url, html } │
66+
│ │ (new endpoint) │ │
67+
│ └───────┬────────┘ │
68+
│ │ │
69+
│ ┌───────▼─────────┐ ┌──────────────────┐ │
70+
│ │ PortalRegistry │───→│ ScraperMapping │ │
71+
│ │ (host → mapping) │ │ (JSON config) │ │
72+
│ └─────────────────┘ └────────┬─────────┘ │
73+
│ │ │
74+
│ ┌───────────────────────────────▼─────────────────────┐ │
75+
│ │ HtmlExtractor │ │
76+
│ │ - defaultValues → set country, currency, etc. │ │
77+
│ │ - textFields → title, description, address, etc. │ │
78+
│ │ - intFields → bedrooms, bathrooms, etc. │ │
79+
│ │ - floatFields → price, lat, lng, area, etc. │ │
80+
│ │ - booleanFields → for_sale, for_rent, etc. │ │
81+
│ │ - images → image_urls array │ │
82+
│ │ - features → features array │ │
83+
│ │ - ScrapedContentSanitizer → clean & validate │ │
84+
│ └──────────────────────────────┬──────────────────────┘ │
85+
│ │ │
86+
│ { extracted property JSON } │
87+
└─────────────────────────────────────────────────────────────────┘
88+
```
89+
90+
## Request Flow (Happy Path)
91+
92+
1. User enters a property URL in PWB's admin UI
93+
2. `PropertyScraperService` creates/finds a `ScrapedProperty` record
94+
3. PWB's HTTP or Playwright connector fetches the raw HTML
95+
4. `ExternalScraperClient` sends `{ url, html }` to PWS's `/api/v1/extract` endpoint
96+
5. PWS identifies the portal from the URL host, loads the JSON mapping, runs `HtmlExtractor`
97+
6. PWS returns `{ success: true, data: { asset_data, listing_data, images } }`
98+
7. PWB saves `extracted_data` and `extracted_images` to the `ScrapedProperty` record
99+
8. User previews and confirms the import
100+
9. `PropertyImportFromScrapeService` creates `RealtyAsset` + `SaleListing`/`RentalListing` + `PropPhoto` records
101+
102+
## Fallback Strategy
103+
104+
```
105+
ExternalScraperClient.call(url, html)
106+
107+
├── Success → use PWS extracted data
108+
109+
└── Failure (timeout, 5xx, connection refused, unsupported portal)
110+
111+
└── Fall back to local Pasarela
112+
113+
├── Success → use local extracted data
114+
115+
└── Failure → show "manual HTML entry" form to user
116+
```
117+
118+
PWS failure triggers fallback silently. The `ScrapedProperty` record tracks which extraction method was used via a new `extraction_source` field: `"external"`, `"local"`, or `"manual"`.
119+
120+
## Configuration
121+
122+
PWB needs these settings (stored in ENV or `Rails.application.credentials`):
123+
124+
```ruby
125+
# Required
126+
PWS_API_URL=https://scraper.example.com # Base URL of PWS deployment
127+
PWS_API_KEY=your-api-key-here # Matches PWS's PROPERTY_SCRAPER_API_KEY
128+
129+
# Optional
130+
PWS_TIMEOUT=15 # HTTP timeout in seconds (default: 15)
131+
PWS_ENABLED=true # Feature flag to enable/disable (default: true)
132+
```
133+
134+
## Deployment Considerations
135+
136+
- PWS can be deployed as a standalone Astro SSR app (recommended) or Rails engine
137+
- No shared database — communication is purely via HTTP/JSON
138+
- PWS is stateless for extraction — it doesn't need to persist anything for PWB's use case
139+
- PWS can serve multiple PWB instances (multi-tenant safe since it just parses HTML)
140+
- Rate limiting on PWS side is optional since PWB controls fetch timing

0 commit comments

Comments
 (0)