|
1 | | -# GinioCrawler — dokumentacja |
| 1 | +# GinioCrawler |
2 | 2 |
|
3 | | -## O co chodzi? |
| 3 | +## DESCRIPTION |
4 | 4 |
|
5 | | -Mała apka do wyszukiwania firm po frazie (np. *“producenci granulatu Polska”*), pobierania stron i wyciągania kontaktów (emaile, telefony). Zapisuje wyniki do **CSV** i **XLSX**. |
| 5 | +Small, pragmatic lead-gen helper: type a query → get company contacts → export to Excel/CSV. Built to help small businesses assemble contact lists without manual copy-paste. |
6 | 6 |
|
7 | | -## Wymagania |
| 7 | +– Uses SerpAPI (Google results) |
8 | 8 |
|
9 | | -* Python 3.10+ (dev) / Windows 10+ (EXE) |
10 | | -* Klucz do wyszukiwarki: **SERPAPI\_KEY** [SERPAPI LINK](https://serpapi.com) |
11 | | -* Internet 😅 |
| 9 | +– Extracts emails and phone numbers from target pages |
12 | 10 |
|
13 | | -## Instalacja (dev) |
| 11 | +– Exports to .xlsx and .csv |
14 | 12 |
|
15 | | -```bash |
16 | | -python -m venv .venv |
17 | | -# Win PowerShell: |
18 | | -.venv\Scripts\Activate.ps1 |
19 | | -# macOS/Linux: |
20 | | -# source .venv/bin/activate |
| 13 | +– If SERPAPI\_KEY is missing, the app will prompt for it on first run |
21 | 14 |
|
22 | | -pip install -r requirements.txt |
23 | | -# jeśli używasz GUI i zapisu klucza: |
24 | | -pip install python-dotenv pandas openpyxl |
25 | | -``` |
| 15 | +## DEMO |
| 16 | + |
| 17 | +// TODO: provide sample |
| 18 | + |
| 19 | +## FEATURES |
| 20 | + |
| 21 | +– Targeted search via SerpAPI (country/language aware) |
| 22 | + |
| 23 | +– Email and phone extraction from result pages |
26 | 24 |
|
27 | | -## Konfiguracja klucza SERPAPI |
| 25 | +– Clean Excel/CSV export with consistent columns |
28 | 26 |
|
29 | | -Masz dwie drogi: |
| 27 | +– Simple GUI flow (and basic CLI) |
30 | 28 |
|
31 | | -1. **Zmienna środowiskowa** |
32 | | - Windows (PowerShell): |
| 29 | +– Safety knobs: polite delays and rate limits |
33 | 30 |
|
34 | | - ```powershell |
35 | | - setx SERPAPI_KEY "TWÓJ_KLUCZ" |
36 | | - ``` |
| 31 | +## ARCHITECTURE (HIGH LEVEL) |
37 | 32 |
|
38 | | - Potem zrestartuj terminal/aplikację. |
39 | | -2. **GUI zapisze klucz samo** (jeśli masz `ensure_api_key()`): |
40 | | - Przy pierwszym uruchomieniu **app\_gui.py** / EXE wyskoczy okno → wklejasz klucz → zapisze się do |
41 | | - `%APPDATA%\GinioCrawler\.env`. |
| 33 | +**Query → SerpAPI (Google) → result URLs → fetch and parse → extract contacts → dedupe → export (xlsx/csv)** |
42 | 34 |
|
43 | | -## Uruchomienie — konsola (CLI) |
| 35 | +REQUIREMENTS |
| 36 | + |
| 37 | +– Python 3.9+ |
| 38 | + |
| 39 | +– SerpAPI account (free tier works): [CLICK](https://serpapi.com) |
| 40 | + |
| 41 | +*Note: no manual env setup required; the app will ask for the key if it’s missing.* |
| 42 | + |
| 43 | +## **QUICKSTART — GUI** |
44 | 44 |
|
45 | 45 | ```bash |
| 46 | +#1. |
| 47 | +git clone https://github.com/SculptTechProject/GinioCrawler.git |
| 48 | +# 2. |
| 49 | +cd GinioCrawler |
| 50 | +# 3. |
| 51 | +pip install -r requirements.txt |
| 52 | +# 4. Run your entry script, for example: |
| 53 | +python app_gui.py |
| 54 | + # If SERPAPI_KEY is not set, the app will prompt for it and continue. |
| 55 | +``` |
| 56 | + |
| 57 | +## QUICKSTART — CLI |
| 58 | + |
| 59 | +```bash |
| 60 | +#1. |
| 61 | +git clone https://github.com/SculptTechProject/GinioCrawler.git |
| 62 | +# 2. |
| 63 | +cd GinioCrawler |
| 64 | +# 3. |
| 65 | +pip install -r requirements.txt |
| 66 | +# 4. Please make sure you provided SERPAPI_KEY, then: |
46 | 67 | python main.py |
47 | | -# wpisz frazę, np. "SoftwareHouse Warszawa" |
48 | 68 | ``` |
49 | 69 |
|
50 | | -Wyniki lecą do: |
| 70 | +## OUTPUT SCHEMA (TYPICAL COLUMNS) |
| 71 | + |
| 72 | +// TODO: provide sample |
| 73 | + |
| 74 | +## GOOD CITIZEN (ETHICS AND LIMITS) |
| 75 | + |
| 76 | +– Respect websites’ robots.txt and Terms of Service |
| 77 | + |
| 78 | +– Keep reasonable rate limits; do not hammer the same domain |
| 79 | + |
| 80 | +– SerpAPI has quotas; heavy usage may require a paid plan |
| 81 | + |
| 82 | +– Use responsibly; this tool is for legitimate contact discovery (no spam) |
| 83 | + |
| 84 | +## TROUBLESHOOTING |
51 | 85 |
|
52 | | -* `wyniki/csv/wyniki_YYYYMMDD_HHMMSS.csv` |
53 | | -* `wyniki/excel/wyniki_YYYYMMDD_HHMMSS.xlsx` |
| 86 | +– Empty results: make the query more specific; check SerpAPI quota; set proper country/lang |
54 | 87 |
|
55 | | -Kolumny: `url, title, emails, phones, contact_url`. |
56 | | -W `emails` i `phones` wartości są rozdzielone **spacją**. |
| 88 | +– Slow or blocked: increase delays, lower concurrency, fetch fewer pages |
57 | 89 |
|
58 | | -## Uruchomienie — GUI |
| 90 | +– Excel won’t open: try CSV, or ensure .xlsx is written with a supported library |
| 91 | + |
| 92 | +– Key prompt loops: verify your SerpAPI key and remaining credits |
| 93 | + |
| 94 | +## PACKAGING (DISTRIBUTABLES) |
| 95 | + |
| 96 | +**Windows (.exe):** |
59 | 97 |
|
60 | 98 | ```bash |
61 | | -python app_gui.py |
62 | | -``` |
| 99 | +pip install pyinstaller |
63 | 100 |
|
64 | | -* Wpisz frazę. |
65 | | -* (Opcjonalnie) kliknij **Wybierz…** i wskaż folder wyjściowy (w środku stworzy `csv/` i `excel/`). |
66 | | -* Kliknij **Start**. Po zakończeniu otworzy folder z Excellem. |
| 101 | +pyinstaller –onefile –name GinioCrawler app.py |
67 | 102 |
|
68 | | -## Budowanie EXE (Windows) |
| 103 | +Output: dist/GinioCrawler.exe |
| 104 | +``` |
| 105 | + |
| 106 | +**macOS (.app / .dmg):** |
69 | 107 |
|
70 | 108 | ```bash |
71 | 109 | pip install pyinstaller |
72 | | -pyinstaller --onefile --windowed --name "GinioCrawler" app_gui.py |
73 | | -# opcjonalnie: --icon icon.ico |
| 110 | + |
| 111 | +pyinstaller –windowed –name GinioCrawler app.py |
| 112 | + |
| 113 | +hdiutil create -volname GinioCrawler -srcfolder dist/GinioCrawler.app -ov -format UDZO dist/GinioCrawler.dmg |
74 | 114 | ``` |
75 | 115 |
|
76 | | -Plik znajdziesz w `dist/GinioCrawler.exe`. Zrób skrót na pulpit. |
| 116 | +*Note: unsigned app; users can open via Right-click → Open. (Signing/notarization can be added later in CI.)* |
| 117 | + |
| 118 | +## TESTS (WHAT TO COVER + QUICK START) |
| 119 | + |
| 120 | +Install and run: |
| 121 | + |
| 122 | +pip install pytest |
| 123 | + |
| 124 | +pytest -q |
| 125 | + |
| 126 | +Recommended coverage: |
| 127 | + |
| 128 | +– search/SerpAPI: correct request, pagination, error handling and rate/limit behavior |
| 129 | + |
| 130 | +– fetch: retries with backoff, timeouts, robots.txt respected |
| 131 | + |
| 132 | +– extract: email/phone patterns (various formats), duplicates handling, URL normalization |
| 133 | + |
| 134 | +– export: column order and names, files openable in Excel and CSV |
| 135 | + |
| 136 | +– CLI/UX: missing SERPAPI\_KEY triggers prompt; flag parsing; happy path without real network calls (mocked) |
| 137 | + |
| 138 | +## ROADMAP (SUGGESTED) |
| 139 | + |
| 140 | +– Saved queries and recent exports |
| 141 | + |
| 142 | +– De-duplication across sessions |
| 143 | + |
| 144 | +– Fallback engines and smarter retry strategy |
77 | 145 |
|
78 | | -## Jak to działa (skrót techniczny) |
| 146 | +– Better parsing and validation for contacts |
79 | 147 |
|
80 | | -* **SerpAPI** zwraca listę URL-i dla frazy. |
81 | | -* **httpx + BeautifulSoup** pobiera stronę, szuka maili/telefonów i linku **Kontakt** (głębia 1). |
82 | | -* Szanuje `robots.txt`. |
83 | | -* Zapis: **CSV (UTF-8-SIG)** + **XLSX** (auto-szerokości, nagłówki, hiperlinki). |
84 | | -* Separator wielu maili/telefonów: **spacja**. |
| 148 | +– Dockerfile for one-command runs |
85 | 149 |
|
86 | | -## Częste problemy |
| 150 | +## LICENSE |
87 | 151 |
|
88 | | -* **„Brak SERPAPI\_KEY”** – ustaw zmienną środowiskową albo użyj GUI z zapisem do `.env`. |
89 | | -* **„ModuleNotFoundError: pandas/openpyxl”** – `pip install pandas openpyxl`. |
90 | | -* **Puste wyniki** – fraza zbyt ogólna / strony blokują boty / brak kontaktu na [www](http://www/). |
91 | | -* **Excel zlepia numery** – w XLSX kolumna „phones” jest tekstem; jeśli nie, włącz format „Tekst”. |
| 152 | +MIT 👀️ |
92 | 153 |
|
93 | | -## Dobre praktyki / etyka |
94 | 154 |
|
95 | | -* Szanuj **`robots.txt`** i limity serwisów. |
96 | | -* Nie bombarduj równoległymi żądaniami (możesz dodać `httpx.Limits` i `asyncio.Semaphore`). |
97 | | -* Sprawdzaj regulaminy serwisów; używaj oficjalnych API wyszukiwarek. |
|
0 commit comments