Skip to content

Fixes #9#10

Open
myz21 wants to merge 11 commits intoHappyHackingSpace:mainfrom
myz21:main
Open

Fixes #9#10
myz21 wants to merge 11 commits intoHappyHackingSpace:mainfrom
myz21:main

Conversation

@myz21
Copy link

@myz21 myz21 commented Jan 17, 2026

Changes Made to src/main.py

Old System

  • Tried common paths first (/privacy, /privacy-policy)
  • Found something? Return immediately
  • Result: Short URLs preferred → returns landing pages

New System

  1. New Functions:
  • _collect_link_candidates() - Collects all privacy links from the page
  • _score_candidate() - Scores each URL by priority and anchor text
  • _pick_best_verified_candidate() - Selects and verifies the best URL
  1. Regex Priority Ranking:
  • privacy-policy ← HIGHEST (detailed policy)
  • privacy/policy
  • privacy-policy-[a-z]+
  • ...
  • privacy ← LOWEST (hub page)
  • legal
  • terms

3-Tier Discovery Strategy:

  1. Link-based (real links on page)
  2. ↓ (if not found) Sitemap-based (from sitemap.xml)
  3. ↓ (if not found) Common Paths (/privacy, /privacy-policy, etc.)

Result

Finds the most detailed policy from real site links, skips landing pages. But it requires longer time.

  • Old: /privacy (landing page)
  • New: /legal/page/row/privacy-policy/tr (full policy)

Summary by CodeRabbit

  • New Features

    • Multi-phase privacy URL discovery with prioritized pattern matching, sitemap checks, and staged fallbacks
    • Improved candidate collection, scoring and verification to surface best policy links
    • Environment-configurable AI endpoint
  • Improvements

    • Prefer main content when extracting policy text; stronger HTTP and dynamic-page fallbacks
    • Parallelized chunk analysis, larger default chunks, higher token limits
    • Detailed per-phase timing, debug output and performance summary

@coderabbitai
Copy link

coderabbitai bot commented Jan 17, 2026

Warning

Rate limit exceeded

@myz21 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 46 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3524e645-635f-4fa7-997b-15ec0b214597

📥 Commits

Reviewing files that changed from the base of the PR and between 12950e3 and 40a423f.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • .env.example
  • pyproject.toml
  • src/main.py

Walkthrough

Implements a multi‑phase privacy URL discovery and verification flow in src/main.py (link, sitemap, common paths) with regex‑prioritized matching, candidate scoring/verification, improved content extraction (prefer <main>), parallelized chunk analysis, OpenAI base URL support, and increased LLM token limits.

Changes

Cohort / File(s) Summary
Policy Discovery & Core Flow
src/main.py
Rewrote resolve_privacy_url() into three phases (link-based, sitemap-based, common paths). Added prioritized regex catalog, candidate collection, scoring, verification, and _improve_candidate() to refine generic links.
Link/Sitemap Helpers
src/main.py
Added _collect_link_candidates(), _score_candidate(), _pick_best_verified_candidate(), find_best_policy_url(), and sitemap extraction/processing helpers (_get_sitemaps_from_robots, _fetch_sitemap_urls).
Content Fetching & Extraction
src/main.py
Introduced _http_get with richer headers/shorter timeout, main-first HTML extraction (<main> then <body>), Selenium fallback refinement, _extract_text_http, and updated fetch_policy_text() with new fallback order and content checks.
Chunking, Analysis & OpenAI
src/main.py
Changed split_text_into_chunks() signature/defaults (larger chunk_size), updated analyze_chunk_json() return typing to `dict[str, Any]
Instrumentation & UX
src/main.py
Added timing instrumentation for discovery/fetch/analysis phases, per-phase and per-chunk timing prints, and a performance summary in final output.
Typing, Imports & Misc
src/main.py
Expanded type annotations, adjusted exported function signatures, added re and other imports for regex handling, refined timeouts/headers, and minor robustness tweaks.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Resolver as resolve_privacy_url
    participant Fetcher as Page Fetcher
    participant LinkDiscovery as Link Discovery
    participant Sitemap as Sitemap Discovery
    participant Prober as Common Paths Prober
    participant Verifier as Scoring & Verification
    participant OpenAI as Chunk Analysis

    Caller->>Resolver: resolve_privacy_url(input_url)
    activate Resolver

    Resolver->>Fetcher: fetch page & homepage links
    Fetcher-->>LinkDiscovery: provide link candidates
    LinkDiscovery->>Verifier: score & verify candidates
    alt Verified candidate found
        Verifier-->>Resolver: best_url
    else
        Resolver->>Sitemap: fetch robots.txt -> sitemap.xml
        Sitemap->>Verifier: score & verify sitemap candidates
        alt Sitemap candidate found
            Verifier-->>Resolver: best_url
        else
            Resolver->>Prober: probe known common privacy paths
            Prober->>Verifier: score & verify probes
            Verifier-->>Resolver: best_url or fallback
        end
    end

    Resolver->>Caller: return (final_url, source)
    deactivate Resolver

    Note over OpenAI, Resolver: After final URL fetch, split content -> parallel chunk analysis
    Resolver->>OpenAI: analyze chunks (ThreadPoolExecutor, OPENAI_BASE_URL, increased max_tokens)
    OpenAI-->>Resolver: chunk analysis results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

"I hopped through HTML, sitemap, and path,
Regex in paw, I chased the policy's tail,
Phases three and chunks that race,
Threads humming as tokens swell,
A happy rabbit found the privacy trail 🐇"

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Fixes #9' is vague and generic, providing no meaningful information about what changes were made or what problem was solved. Replace with a specific title summarizing the main change, such as 'Refactor privacy URL discovery with link-first pipeline and sitemap fallback' or 'Improve privacy policy discovery accuracy with multi-phase strategy'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 95.24% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main.py (1)

369-411: Remove the duplicated resolve_privacy_url; the earlier version is dead code.
The second definition overwrites the first, so edits to the earlier version (including _improve_candidate usage) never take effect and are easy to miss. Please delete the first definition and, if you still want hub-page deepening, apply _improve_candidate in the active flow.

♻️ Suggested integration in the active resolver
-        if best_url:
-            return best_url, input_url
+        if best_url:
+            return _improve_candidate(best_url), input_url
...
-            return cand, input_url
+            return _improve_candidate(cand), input_url
...
-            return cand, input_url
+            return _improve_candidate(cand), input_url

Also applies to: 487-539

🤖 Fix all issues with AI agents
In `@src/main.py`:
- Around line 414-446: The current _collect_link_candidates filters out URLs
using _is_privacy_like which doesn’t catch "legal"/"terms" patterns from the
regex priority list; update _collect_link_candidates to also accept URLs that
match the module’s priority regex list (use the same regexes used elsewhere
instead of only _is_privacy_like) so links like /legal or /terms are included,
preserve the deduplication map behavior (candidates: Dict[str, str]) and ensure
you still keep the best anchor text per URL when inserting/updating candidates,
and stop when limit is reached.
- Around line 555-571: The code currently defaults OPENAI_BASE_URL to
"https://openrouter.ai/api/v1" which routes users to OpenRouter unexpectedly;
change os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1") to
os.getenv("OPENAI_BASE_URL") so the OpenAI SDK uses its native endpoint by
default (keep the existing OpenAI(client) usage in main.py/resp creation), and
then add a clear entry for OPENAI_BASE_URL with the OpenRouter value to
.env.example and a short note in README instructing users to set
OPENAI_BASE_URL=https://openrouter.ai/api/v1 explicitly if they want OpenRouter.

Comment on lines 555 to 571
"""Analyze a text chunk with the LLM and return one JSON object."""
api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")

if not api_key:
raise RuntimeError("OPENAI_API_KEY is not set. Configure your .env file.")
client = OpenAI(api_key=api_key)
#client = OpenAI(api_key=api_key)
client = OpenAI(api_key=api_key, base_url=base_url)
resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_SCORER},
{"role": "user", "content": build_user_prompt(text_chunk)},
],
temperature=0,
max_tokens=600,
max_tokens=2000, #changed from 300 to 2000
response_format={"type": "json_object"},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's verify the actual code at the specified lines in src/main.py
head -571 src/main.py | tail -20

Repository: HappyHackingSpace/privacy-policy-analyzer

Length of output: 902


🏁 Script executed:

# Let's also check the full context around those lines (550-580) to understand the function
sed -n '550,580p' src/main.py

Repository: HappyHackingSpace/privacy-policy-analyzer

Length of output: 1158


🏁 Script executed:

# Check if there's any documentation or .env files that explain the expected behavior
fd -t f -e md -e txt -e env | grep -E "(README|\.env|config|SETUP)" | head -20

Repository: HappyHackingSpace/privacy-policy-analyzer

Length of output: 94


🏁 Script executed:

# Search for OPENAI_BASE_URL usage throughout the codebase to understand intent
rg "OPENAI_BASE_URL|openrouter" -n

Repository: HappyHackingSpace/privacy-policy-analyzer

Length of output: 176


🏁 Script executed:

# Check README.md for setup instructions and configuration guidance
cat README.md

Repository: HappyHackingSpace/privacy-policy-analyzer

Length of output: 8864


🏁 Script executed:

# Look for .env.example or any environment setup documentation
fd -type f \( -name ".env*" -o -name "*example*" -o -name "*setup*" \) -not -path "*/.*/*"

Repository: HappyHackingSpace/privacy-policy-analyzer

Length of output: 255


🏁 Script executed:

# Search for any comments or documentation about OpenRouter choice
rg -i "openrouter|base.url|api.endpoint" -B 2 -A 2

Repository: HappyHackingSpace/privacy-policy-analyzer

Length of output: 1874


🏁 Script executed:

# Check the .env.example file content to see if OPENAI_BASE_URL is documented
cat .env.example

Repository: HappyHackingSpace/privacy-policy-analyzer

Length of output: 129


Document or reconsider the OpenRouter default endpoint.

The code defaults OPENAI_BASE_URL to https://openrouter.ai/api/v1 without any documentation in .env.example or README. Users following the documented setup (setting only OPENAI_API_KEY) will be routed to OpenRouter instead of OpenAI's native endpoint, which they won't discover until encountering unexpected behavior, costs, or rate limits.

Either:

  1. Add OPENAI_BASE_URL to .env.example and document the OpenRouter default in the README, or
  2. Default to the OpenAI SDK's native endpoint and require explicit OPENAI_BASE_URL=https://openrouter.ai/api/v1 in .env for OpenRouter.
🛠️ Option 2: Use SDK default unless explicitly configured
-    base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
+    base_url = os.getenv("OPENAI_BASE_URL")
     
     if not api_key:
         raise RuntimeError("OPENAI_API_KEY is not set. Configure your .env file.")
     `#client` = OpenAI(api_key=api_key)
-    client = OpenAI(api_key=api_key, base_url=base_url)
+    client = OpenAI(api_key=api_key) if not base_url else OpenAI(api_key=api_key, base_url=base_url)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Analyze a text chunk with the LLM and return one JSON object."""
api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
if not api_key:
raise RuntimeError("OPENAI_API_KEY is not set. Configure your .env file.")
client = OpenAI(api_key=api_key)
#client = OpenAI(api_key=api_key)
client = OpenAI(api_key=api_key, base_url=base_url)
resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_SCORER},
{"role": "user", "content": build_user_prompt(text_chunk)},
],
temperature=0,
max_tokens=600,
max_tokens=2000, #changed from 300 to 2000
response_format={"type": "json_object"},
"""Analyze a text chunk with the LLM and return one JSON object."""
api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_BASE_URL")
if not api_key:
raise RuntimeError("OPENAI_API_KEY is not set. Configure your .env file.")
`#client` = OpenAI(api_key=api_key)
client = OpenAI(api_key=api_key) if not base_url else OpenAI(api_key=api_key, base_url=base_url)
resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_SCORER},
{"role": "user", "content": build_user_prompt(text_chunk)},
],
temperature=0,
max_tokens=2000, `#changed` from 300 to 2000
response_format={"type": "json_object"},
🤖 Prompt for AI Agents
In `@src/main.py` around lines 555 - 571, The code currently defaults
OPENAI_BASE_URL to "https://openrouter.ai/api/v1" which routes users to
OpenRouter unexpectedly; change os.getenv("OPENAI_BASE_URL",
"https://openrouter.ai/api/v1") to os.getenv("OPENAI_BASE_URL") so the OpenAI
SDK uses its native endpoint by default (keep the existing OpenAI(client) usage
in main.py/resp creation), and then add a clear entry for OPENAI_BASE_URL with
the OpenRouter value to .env.example and a short note in README instructing
users to set OPENAI_BASE_URL=https://openrouter.ai/api/v1 explicitly if they
want OpenRouter.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main.py (1)

369-411: Critical: Duplicate function definition - this entire block is dead code.

resolve_privacy_url is defined twice: here (lines 369-411) and again at lines 500-552. Python uses the last definition, so this first implementation is completely shadowed and never executes.

Additionally, this first version calls _improve_candidate() on discovered URLs (lines 384, 396, 402), while the active second version at line 500 does not call _improve_candidate() anywhere. This means the "upgrade generic pages to deeper policy links" feature described in the PR objectives is not functioning.

Recommended action: Remove this dead code block (lines 369-411) and integrate _improve_candidate calls into the active function at line 500.

🛠️ Example fix for active function (lines 527-528)
         best_url = _pick_best_verified_candidate(candidates_list, max_verify=5)
         if best_url:
-            return best_url, input_url
+            return _improve_candidate(best_url), input_url
🤖 Fix all issues with AI agents
In `@src/main.py`:
- Line 692: The print statement formatting in the progress message (print(...)
in src/main.py) is missing a space after the bracket; update the string in the
print call that starts with "[3/3]Analyzing" to include a space so it reads
"[3/3] Analyzing {len(chunks)} chunks in parallel..." to fix the typo.
- Around line 489-496: The loop currently returns a URL immediately when
score[0] <= 1 (high-priority candidate) without checking if the link is live,
which can return broken URLs; change the logic in the for-loop that iterates
over scored[:max_verify] so that even when score[0] <= 1 you call
_light_verify(url) and only return url if that verification succeeds (print the
same debug message on success); if verification fails, continue to the next
candidate (and only return after a successful _light_verify or after exhausting
candidates), using the existing variables scored, max_verify, _light_verify,
candidates and the same debug message.
- Around line 198-206: The Selenium path currently accepts main text >100 chars
while the HTTP path enforces a ≥400 char minimum; make these thresholds
consistent by introducing a shared constant (e.g., CONTENT_MIN_LENGTH = 400) and
using it in both the HTTP extraction checks and the Selenium extraction block
(the driver.find_element(By.TAG_NAME, "main") / body.get_attribute("innerText")
logic), and ensure you use the same comparison operator (>=) in both places so
identical pages behave consistently regardless of extraction method.
🧹 Nitpick comments (3)
src/main.py (3)

139-143: Redundant redirect status check.

With allow_redirects=True, the response status code reflects the final destination, not intermediate redirects. The explicit check for 3xx codes (lines 141-142) will rarely match unless the server returns a redirect without a Location header or hits a redirect limit.

This isn't a bug, but the logic could be simplified.

♻️ Simplified version
-        if 200 <= r.status_code < 300:
-            return True
-        if r.status_code in (301, 302, 303, 307, 308):
-            return True
-        return False
+        return 200 <= r.status_code < 400

354-364: Consider using logging instead of print for DEBUG statements.

Multiple print(f"DEBUG: ...") statements are scattered throughout (lines 354, 363, 494, 534, 543, 549, 648, 654, 692). These should use Python's logging module for proper log level control and configurability.


695-703: Move import to top of file and note that chunk order is not preserved.

  1. The ThreadPoolExecutor import at line 695 should be at the top of the file with other imports.
  2. Iterating over futures dict doesn't guarantee completion order matches submission order. If aggregate_chunk_results depends on chunk ordering, this could affect results. The current code assigns index (line 702) but the list order itself is completion-order.
♻️ Preserve chunk order using as_completed with sorting
+from concurrent.futures import ThreadPoolExecutor, as_completed
+
 # ... at top of file
 
-    from concurrent.futures import ThreadPoolExecutor
     with ThreadPoolExecutor(max_workers=5) as executor:
-        futures = {executor.submit(analyze_chunk_json, chunk, args.model): i for i, chunk in enumerate(chunks, 1)}
-        for future in futures:
-            idx = futures[future]
-            res = future.result()
+        futures = [executor.submit(analyze_chunk_json, chunk, args.model) for chunk in chunks]
+        for idx, future in enumerate(futures, 1):
+            res = future.result()  # blocks in order
             if res:
                 res["index"] = idx
                 results.append(res)

@myz21
Copy link
Author

myz21 commented Jan 29, 2026

I’ve streamlined the privacy policy discovery and analysis system. Here’s a rundown of the optimizations and the current performance status:

Key Improvements

  • Smarter Discovery: Built a 0-9 ranking system that evaluates URLs and anchor text to nail down the right policy links faster.
  • Parallel Processing: Switched to ThreadPoolExecutor to handle text chunks simultaneously, cutting down the old sequential wait times.
  • Browser Tuning: Optimized Selenium with an eager load strategy and disabled images to shave seconds off page loads.
  • Lightweight Checks: Used HEAD requests for link validation to avoid downloading unnecessary data.

Performance Bottlenecks

Testing on sites like TikTok and Google highlighted a few hurdles:

  • LLM Latency: Analysis takes up to 170 seconds on free-tier models—this is currently our biggest slowdown.
  • Rate Limits (429): Large documents (>10 chunks) are hitting API quota limits.
  • Bot Detection: Anti-bot measures on sites like Google can push Selenium fetch times up to 15 seconds.

src/main.py Outdated


def _http_get(url: str, timeout: int = 15) -> Optional[requests.Response]:
def _http_get(url: str, timeout: int = 5) -> Optional[requests.Response]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional kullanılmaması gerek py 3.9+ gerek yok | kullanılmalı

src/main.py Outdated


def _fetch_text(url: str, timeout: int = 12) -> Optional[str]:
def _fetch_text(url: str, timeout: int = 5) -> Optional[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aynı şekilde

src/main.py Outdated
driver.set_page_load_timeout(10)
driver.get(url)
WebDriverWait(driver, 12).until(
WebDriverWait(driver, 5).until(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 sayısı nedir açıklanmalı cost olması gerek

src/main.py Outdated
try:
content_element = driver.find_element(By.TAG_NAME, "main")
text = content_element.get_attribute("innerText")
if text and len(text.strip()) > 100:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neden 100?

for idx, pattern in enumerate(_PRIVACY_REGEX_PATTERNS):
if pattern.search(url):
return idx
return 999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bu ne demek

src/main.py Outdated
# Collect links from input page and homepage
candidates_set: Dict[str, str] = {} # url -> anchor_text

for page_url in [input_url, base]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for page_url in (input_url, base):
    if not (resp := _http_get(page_url)):
        continue

    for url, text in _collect_link_candidates(resp.text, resp.url, limit=100):
        candidates_set.setdefault(url, text)

yapabilirsin

src/main.py Outdated

# === PHASE 3: Common paths (last resort) ===
path_heads: List[str] = []
for p in _COMMON_PATHS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for candidate_url in (base + path for path in _COMMON_PATHS):
    if _head_ok(candidate_url) or _light_verify(candidate_url):
        if _light_verify(candidate_url):
            print(f" Found via common path: {candidate_url}")
            return candidate_url, input_url

kısaltabilirsin

src/main.py Outdated
@@ -351,17 +567,20 @@ def split_text_into_chunks(
def analyze_chunk_json(text_chunk: str, model: str) -> Optional[Dict[str, Any]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing issues

def analyze_chunk_json(text_chunk: str, model: str) -> Optional[Dict[str, Any]]:
"""Analyze a text chunk with the LLM and return one JSON object."""
api_key = os.getenv("OPENAI_API_KEY")
base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.env.tempalte eklenmeli

src/main.py Outdated

print(json.dumps(out, ensure_ascii=False, indent=2))

print("\n" + "="*40)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buradaki printlerin kaldırılması onun yerline click kullanılması gerek

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
src/main.py (3)

156-171: ⚠️ Potential issue | 🟡 Minor

Use one shared minimum-content threshold for HTTP and Selenium paths.

HTTP extraction uses >= 400 while Selenium accepts > 100 at Line 201, causing inconsistent acceptance for the same page content.

♻️ Suggested fix
+CONTENT_MIN_LENGTH = 400
...
-                return t if len(t) >= 400 else None
+                return t if len(t) >= CONTENT_MIN_LENGTH else None
...
-    return t if len(t) >= 400 else None
+    return t if len(t) >= CONTENT_MIN_LENGTH else None
...
-            if text and len(text.strip()) > 100:
+            if text and len(text.strip()) >= CONTENT_MIN_LENGTH:
                 return text

Also applies to: 197-203

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 156 - 171, Define a single shared minimum content
constant (e.g., MIN_CONTENT_CHARS = 400) in src/main.py and use it in both the
HTTP and Selenium extraction paths instead of hard-coded literals; replace the
HTTP check "return t if len(t) >= 400" and the Selenium path's "> 100" check
with "len(t) >= MIN_CONTENT_CHARS" so both branches (the code using _http_get
and the code that sets content_element/get_text into t) enforce the same
threshold and comparison semantics.

489-493: ⚠️ Potential issue | 🟠 Major

Do not bypass URL verification for top-ranked candidates.

At Lines 491-492, high-priority candidates are returned without _light_verify. This can emit broken URLs.

🔧 Suggested fix
-    for i, (url, text, score) in enumerate(scored[:max_verify]):
-        `#if` the score is already very good, skip verification
-        if score[0] <= 1: 
-            return url
+    for _, (url, text, score) in enumerate(scored[:max_verify]):
         if _light_verify(url):
             print(f"DEBUG: Selected URL '{url}' from {len(candidates)} candidates (score: {score})")
             return url
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 489 - 493, The loop over scored candidates
currently returns a top-ranked URL when score[0] <= 1 before calling
_light_verify, which bypasses verification; change the control flow in the loop
(the for i, (url, text, score) in enumerate(scored[:max_verify]) block) so that
you do not return immediately for high-scoring items—instead run
_light_verify(url) for every candidate (including those with score[0] <= 1) and
only return the url when _light_verify succeeds; if verification fails, continue
to the next candidate and only return after a successful verification (or
exhaust candidates).

570-575: ⚠️ Potential issue | 🟠 Major

Defaulting to OpenRouter should be explicit or documented.

At Line 570, defaulting OPENAI_BASE_URL to OpenRouter reroutes users unexpectedly when they only set OPENAI_API_KEY.

🔧 Suggested fix (SDK default unless explicitly configured)
-    base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
+    base_url = os.getenv("OPENAI_BASE_URL")
@@
-    client = OpenAI(api_key=api_key, base_url=base_url)
+    client = OpenAI(api_key=api_key) if not base_url else OpenAI(api_key=api_key, base_url=base_url)
In openai Python SDK 1.102.0, what is the default API base URL when `OpenAI(api_key=...)` is created without `base_url`?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 570 - 575, The code defaults OPENAI_BASE_URL to
OpenRouter and always passes base_url into the OpenAI constructor (base_url
variable and client = OpenAI(api_key=api_key, base_url=base_url)), which
silently overrides the SDK default; change the logic so OPENAI_BASE_URL is
optional: read os.getenv("OPENAI_BASE_URL") without a default, and only pass
base_url into OpenAI(...) when that env var is present (or otherwise leave the
OpenAI(...) call as OpenAI(api_key=api_key)) so the SDK default base URL is
preserved unless explicitly configured; update any docstring or README to state
that OPENAI_BASE_URL must be set to override the SDK default.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/main.py`:
- Around line 576-585: The OpenAI per-chunk call (client.chat.completions.create
used to assign resp with build_user_prompt) has no error handling and futures
are consumed with an unguarded future.result(); wrap the API call inside a
try/except that catches network/HTTP/timeout/rate-limit exceptions and returns
None (or a sentinel) for failed chunks, and then guard future.result() where
futures are iterated so you skip/ignore None results or catch exceptions from
future.result() and continue processing remaining chunks; reference
client.chat.completions.create, build_user_prompt, resp, and future.result()
when applying the changes.
- Around line 369-411: Remove the obsolete duplicate implementation of
resolve_privacy_url (the one that uses find_best_policy_url and
_extract_text_http) so only the intended phase-based implementation remains (the
one that uses _collect_link_candidates and _pick_best_verified_candidate);
delete the first definition entirely to avoid shadowing and ensure imports/refs
still match the remaining resolve_privacy_url, then run tests/static checks to
verify no references break.

---

Duplicate comments:
In `@src/main.py`:
- Around line 156-171: Define a single shared minimum content constant (e.g.,
MIN_CONTENT_CHARS = 400) in src/main.py and use it in both the HTTP and Selenium
extraction paths instead of hard-coded literals; replace the HTTP check "return
t if len(t) >= 400" and the Selenium path's "> 100" check with "len(t) >=
MIN_CONTENT_CHARS" so both branches (the code using _http_get and the code that
sets content_element/get_text into t) enforce the same threshold and comparison
semantics.
- Around line 489-493: The loop over scored candidates currently returns a
top-ranked URL when score[0] <= 1 before calling _light_verify, which bypasses
verification; change the control flow in the loop (the for i, (url, text, score)
in enumerate(scored[:max_verify]) block) so that you do not return immediately
for high-scoring items—instead run _light_verify(url) for every candidate
(including those with score[0] <= 1) and only return the url when _light_verify
succeeds; if verification fails, continue to the next candidate and only return
after a successful verification (or exhaust candidates).
- Around line 570-575: The code defaults OPENAI_BASE_URL to OpenRouter and
always passes base_url into the OpenAI constructor (base_url variable and client
= OpenAI(api_key=api_key, base_url=base_url)), which silently overrides the SDK
default; change the logic so OPENAI_BASE_URL is optional: read
os.getenv("OPENAI_BASE_URL") without a default, and only pass base_url into
OpenAI(...) when that env var is present (or otherwise leave the OpenAI(...)
call as OpenAI(api_key=api_key)) so the SDK default base URL is preserved unless
explicitly configured; update any docstring or README to state that
OPENAI_BASE_URL must be set to override the SDK default.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cf48adf and 8f7396a.

📒 Files selected for processing (1)
  • src/main.py

Comment on lines 576 to 585
resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_SCORER},
{"role": "user", "content": build_user_prompt(text_chunk)},
],
temperature=0,
max_tokens=600,
max_tokens=2000, #changed from 300 to 2000
response_format={"type": "json_object"},
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

sed -n '567,590p' src/main.py
sed -n '692,705p' src/main.py

Repository: HappyHackingSpace/privacy-policy-analyzer

Length of output: 1651


Guard per-chunk API failures to avoid aborting the whole analysis.

The OpenAI call at lines 576-585 lacks exception handling for network errors (timeouts, 429 rate limits, etc.), and future.result() at line ~700 is unguarded. A single failed chunk currently aborts the entire analysis.

Wrap the API call in try-except to return None on failure, and guard future.result() to skip failed chunks gracefully.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 576 - 585, The OpenAI per-chunk call
(client.chat.completions.create used to assign resp with build_user_prompt) has
no error handling and futures are consumed with an unguarded future.result();
wrap the API call inside a try/except that catches
network/HTTP/timeout/rate-limit exceptions and returns None (or a sentinel) for
failed chunks, and then guard future.result() where futures are iterated so you
skip/ignore None results or catch exceptions from future.result() and continue
processing remaining chunks; reference client.chat.completions.create,
build_user_prompt, resp, and future.result() when applying the changes.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/main.py (1)

693-704: ⚠️ Potential issue | 🟡 Minor

Missing space in progress message and unguarded future.result().

  1. Typo (line 693): Missing space — "[3/3]Analyzing" should be "[3/3] Analyzing".

  2. Unguarded future.result() (line 701): If any chunk's API call fails (network error, rate limit), future.result() raises an exception, aborting the entire analysis. Wrap in try/except to skip failed chunks gracefully.

🛠️ Proposed fix
-    print(f"[3/3]Analyzing {len(chunks)} chunks in parallel...")
+    print(f"[3/3] Analyzing {len(chunks)} chunks in parallel...")
     
     # Parallel analysis of chunks
     from concurrent.futures import ThreadPoolExecutor
     with ThreadPoolExecutor(max_workers=5) as executor:
         futures = {executor.submit(analyze_chunk_json, chunk, args.model): i for i, chunk in enumerate(chunks, 1)}
         for future in futures:
             idx = futures[future]
-            res = future.result()
-            if res:
-                res["index"] = idx
-                results.append(res)
+            try:
+                res = future.result()
+                if res:
+                    res["index"] = idx
+                    results.append(res)
+            except Exception as e:
+                print(f"Warning: Chunk {idx} analysis failed: {e}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 693 - 704, The progress print message is missing a
space and the ThreadPoolExecutor loop calls future.result() unguarded which will
raise and abort processing if a chunk fails; update the string to "[3/3]
Analyzing {len(chunks)} chunks in parallel..." and wrap the call to
future.result() inside a try/except around the loop that iterates over futures
(created when submitting analyze_chunk_json) to catch exceptions, log or skip
failed chunk results, and continue appending successful results (ensuring you
still set res["index"] = idx when res is returned).
🧹 Nitpick comments (4)
src/main.py (4)

696-696: Move import to module level.

ThreadPoolExecutor is imported inside the function body. Per PEP 8, imports should be at the top of the file.

♻️ Proposed fix

At module level (near line 1):

from concurrent.futures import ThreadPoolExecutor

Then remove line 696.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` at line 696, Move the "from concurrent.futures import
ThreadPoolExecutor" import out of the function body and place it at the top of
the module (module-level imports), then delete the local import statement inside
the function so all references to ThreadPoolExecutor use the top-level import;
keep the existing usage of ThreadPoolExecutor unchanged.

501-553: Active resolve_privacy_url implementation — clean up debug prints.

This is the intended 3-phase implementation. After removing the shadowed definition at lines 370-412, clean up the DEBUG print statements (lines 535, 544, 550) or convert them to proper logging.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 501 - 553, In resolve_privacy_url, replace the
three DEBUG print statements with proper logging calls (e.g., use the module
logger) or remove them: the prints occur when a sitemap candidate is found
inside the loop over _get_sitemaps_from_robots/_fetch_sitemap_urls and in the
two places inside the common-path checks that currently call print(f"DEBUG:
..."). Update those prints to logger.debug(...) (or remove) so they don't leak
stdout; keep the rest of the 3-phase logic and return values unchanged and
ensure the logger is imported/initialized if added.

190-190: Move _SELENIUM_WAIT_TIMEOUT to module level.

The constant is defined inside the function body. Per the commit message "use a constant for Selenium wait timeout", this should be a module-level constant for better visibility and configurability.

♻️ Proposed refactor

Move to module level near other constants:

+_SELENIUM_WAIT_TIMEOUT = 5
+
 def fetch_content_with_selenium(url: str) -> str | None:
     """Return visible text using headless Chrome; robust for dynamic pages."""
     chromedriver_autoinstaller.install()
     # ... options setup ...
-    _SELENIUM_WAIT_TIMEOUT = 5
     try:
         driver.set_page_load_timeout(10)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` at line 190, Move the `_SELENIUM_WAIT_TIMEOUT` constant out of
the function body and declare it at module level alongside the other constants;
locate the occurrence of `_SELENIUM_WAIT_TIMEOUT` inside the function (the
function that currently defines it) and remove the in-function definition, then
add a top-level declaration `_SELENIUM_WAIT_TIMEOUT = 5` near the other module
constants so the constant is globally visible and configurable.

294-299: Extract magic number 999 to a named constant.

The value 999 is used as a "no match" sentinel but lacks explanation. Define a module-level constant for clarity.

♻️ Proposed fix
+_NO_MATCH_PRIORITY = 999  # Sentinel for URLs not matching any priority pattern

 def _get_url_priority(url: str) -> int:
     """Return the priority index of a URL based on regex patterns. Lower is better."""
     for idx, pattern in enumerate(_PRIVACY_REGEX_PATTERNS):
         if pattern.search(url):
             return idx
-    return 999
+    return _NO_MATCH_PRIORITY

Then update references at lines 332, 445-446, and 492.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.py` around lines 294 - 299, Introduce a descriptive module-level
constant (e.g., NO_MATCH_PRIORITY = 999) and replace the magic literal 999 in
_get_url_priority with that constant; also replace any other uses in this module
where 999 is used as the "no match" sentinel (places that return, compare to, or
rely on that sentinel) so all comparisons/returns use NO_MATCH_PRIORITY instead
of the literal. Ensure the constant is documented with a short comment and
update the return in _get_url_priority and any other functions or conditionals
that previously used 999 to reference this new symbol.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/main.py`:
- Around line 693-704: The progress print message is missing a space and the
ThreadPoolExecutor loop calls future.result() unguarded which will raise and
abort processing if a chunk fails; update the string to "[3/3] Analyzing
{len(chunks)} chunks in parallel..." and wrap the call to future.result() inside
a try/except around the loop that iterates over futures (created when submitting
analyze_chunk_json) to catch exceptions, log or skip failed chunk results, and
continue appending successful results (ensuring you still set res["index"] = idx
when res is returned).

---

Nitpick comments:
In `@src/main.py`:
- Line 696: Move the "from concurrent.futures import ThreadPoolExecutor" import
out of the function body and place it at the top of the module (module-level
imports), then delete the local import statement inside the function so all
references to ThreadPoolExecutor use the top-level import; keep the existing
usage of ThreadPoolExecutor unchanged.
- Around line 501-553: In resolve_privacy_url, replace the three DEBUG print
statements with proper logging calls (e.g., use the module logger) or remove
them: the prints occur when a sitemap candidate is found inside the loop over
_get_sitemaps_from_robots/_fetch_sitemap_urls and in the two places inside the
common-path checks that currently call print(f"DEBUG: ..."). Update those prints
to logger.debug(...) (or remove) so they don't leak stdout; keep the rest of the
3-phase logic and return values unchanged and ensure the logger is
imported/initialized if added.
- Line 190: Move the `_SELENIUM_WAIT_TIMEOUT` constant out of the function body
and declare it at module level alongside the other constants; locate the
occurrence of `_SELENIUM_WAIT_TIMEOUT` inside the function (the function that
currently defines it) and remove the in-function definition, then add a
top-level declaration `_SELENIUM_WAIT_TIMEOUT = 5` near the other module
constants so the constant is globally visible and configurable.
- Around line 294-299: Introduce a descriptive module-level constant (e.g.,
NO_MATCH_PRIORITY = 999) and replace the magic literal 999 in _get_url_priority
with that constant; also replace any other uses in this module where 999 is used
as the "no match" sentinel (places that return, compare to, or rely on that
sentinel) so all comparisons/returns use NO_MATCH_PRIORITY instead of the
literal. Ensure the constant is documented with a short comment and update the
return in _get_url_priority and any other functions or conditionals that
previously used 999 to reference this new symbol.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1b6eaba1-e680-475b-9c4f-a8023caaddb6

📥 Commits

Reviewing files that changed from the base of the PR and between 8f7396a and 12950e3.

📒 Files selected for processing (1)
  • src/main.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants