Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (3)
WalkthroughImplements a multi‑phase privacy URL discovery and verification flow in Changes
Sequence DiagramsequenceDiagram
participant Caller
participant Resolver as resolve_privacy_url
participant Fetcher as Page Fetcher
participant LinkDiscovery as Link Discovery
participant Sitemap as Sitemap Discovery
participant Prober as Common Paths Prober
participant Verifier as Scoring & Verification
participant OpenAI as Chunk Analysis
Caller->>Resolver: resolve_privacy_url(input_url)
activate Resolver
Resolver->>Fetcher: fetch page & homepage links
Fetcher-->>LinkDiscovery: provide link candidates
LinkDiscovery->>Verifier: score & verify candidates
alt Verified candidate found
Verifier-->>Resolver: best_url
else
Resolver->>Sitemap: fetch robots.txt -> sitemap.xml
Sitemap->>Verifier: score & verify sitemap candidates
alt Sitemap candidate found
Verifier-->>Resolver: best_url
else
Resolver->>Prober: probe known common privacy paths
Prober->>Verifier: score & verify probes
Verifier-->>Resolver: best_url or fallback
end
end
Resolver->>Caller: return (final_url, source)
deactivate Resolver
Note over OpenAI, Resolver: After final URL fetch, split content -> parallel chunk analysis
Resolver->>OpenAI: analyze chunks (ThreadPoolExecutor, OPENAI_BASE_URL, increased max_tokens)
OpenAI-->>Resolver: chunk analysis results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main.py (1)
369-411: Remove the duplicatedresolve_privacy_url; the earlier version is dead code.
The second definition overwrites the first, so edits to the earlier version (including_improve_candidateusage) never take effect and are easy to miss. Please delete the first definition and, if you still want hub-page deepening, apply_improve_candidatein the active flow.♻️ Suggested integration in the active resolver
- if best_url: - return best_url, input_url + if best_url: + return _improve_candidate(best_url), input_url ... - return cand, input_url + return _improve_candidate(cand), input_url ... - return cand, input_url + return _improve_candidate(cand), input_urlAlso applies to: 487-539
🤖 Fix all issues with AI agents
In `@src/main.py`:
- Around line 414-446: The current _collect_link_candidates filters out URLs
using _is_privacy_like which doesn’t catch "legal"/"terms" patterns from the
regex priority list; update _collect_link_candidates to also accept URLs that
match the module’s priority regex list (use the same regexes used elsewhere
instead of only _is_privacy_like) so links like /legal or /terms are included,
preserve the deduplication map behavior (candidates: Dict[str, str]) and ensure
you still keep the best anchor text per URL when inserting/updating candidates,
and stop when limit is reached.
- Around line 555-571: The code currently defaults OPENAI_BASE_URL to
"https://openrouter.ai/api/v1" which routes users to OpenRouter unexpectedly;
change os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1") to
os.getenv("OPENAI_BASE_URL") so the OpenAI SDK uses its native endpoint by
default (keep the existing OpenAI(client) usage in main.py/resp creation), and
then add a clear entry for OPENAI_BASE_URL with the OpenRouter value to
.env.example and a short note in README instructing users to set
OPENAI_BASE_URL=https://openrouter.ai/api/v1 explicitly if they want OpenRouter.
| """Analyze a text chunk with the LLM and return one JSON object.""" | ||
| api_key = os.getenv("OPENAI_API_KEY") | ||
| base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1") | ||
|
|
||
| if not api_key: | ||
| raise RuntimeError("OPENAI_API_KEY is not set. Configure your .env file.") | ||
| client = OpenAI(api_key=api_key) | ||
| #client = OpenAI(api_key=api_key) | ||
| client = OpenAI(api_key=api_key, base_url=base_url) | ||
| resp = client.chat.completions.create( | ||
| model=model, | ||
| messages=[ | ||
| {"role": "system", "content": SYSTEM_SCORER}, | ||
| {"role": "user", "content": build_user_prompt(text_chunk)}, | ||
| ], | ||
| temperature=0, | ||
| max_tokens=600, | ||
| max_tokens=2000, #changed from 300 to 2000 | ||
| response_format={"type": "json_object"}, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's verify the actual code at the specified lines in src/main.py
head -571 src/main.py | tail -20Repository: HappyHackingSpace/privacy-policy-analyzer
Length of output: 902
🏁 Script executed:
# Let's also check the full context around those lines (550-580) to understand the function
sed -n '550,580p' src/main.pyRepository: HappyHackingSpace/privacy-policy-analyzer
Length of output: 1158
🏁 Script executed:
# Check if there's any documentation or .env files that explain the expected behavior
fd -t f -e md -e txt -e env | grep -E "(README|\.env|config|SETUP)" | head -20Repository: HappyHackingSpace/privacy-policy-analyzer
Length of output: 94
🏁 Script executed:
# Search for OPENAI_BASE_URL usage throughout the codebase to understand intent
rg "OPENAI_BASE_URL|openrouter" -nRepository: HappyHackingSpace/privacy-policy-analyzer
Length of output: 176
🏁 Script executed:
# Check README.md for setup instructions and configuration guidance
cat README.mdRepository: HappyHackingSpace/privacy-policy-analyzer
Length of output: 8864
🏁 Script executed:
# Look for .env.example or any environment setup documentation
fd -type f \( -name ".env*" -o -name "*example*" -o -name "*setup*" \) -not -path "*/.*/*"Repository: HappyHackingSpace/privacy-policy-analyzer
Length of output: 255
🏁 Script executed:
# Search for any comments or documentation about OpenRouter choice
rg -i "openrouter|base.url|api.endpoint" -B 2 -A 2Repository: HappyHackingSpace/privacy-policy-analyzer
Length of output: 1874
🏁 Script executed:
# Check the .env.example file content to see if OPENAI_BASE_URL is documented
cat .env.exampleRepository: HappyHackingSpace/privacy-policy-analyzer
Length of output: 129
Document or reconsider the OpenRouter default endpoint.
The code defaults OPENAI_BASE_URL to https://openrouter.ai/api/v1 without any documentation in .env.example or README. Users following the documented setup (setting only OPENAI_API_KEY) will be routed to OpenRouter instead of OpenAI's native endpoint, which they won't discover until encountering unexpected behavior, costs, or rate limits.
Either:
- Add
OPENAI_BASE_URLto.env.exampleand document the OpenRouter default in the README, or - Default to the OpenAI SDK's native endpoint and require explicit
OPENAI_BASE_URL=https://openrouter.ai/api/v1in.envfor OpenRouter.
🛠️ Option 2: Use SDK default unless explicitly configured
- base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1")
+ base_url = os.getenv("OPENAI_BASE_URL")
if not api_key:
raise RuntimeError("OPENAI_API_KEY is not set. Configure your .env file.")
`#client` = OpenAI(api_key=api_key)
- client = OpenAI(api_key=api_key, base_url=base_url)
+ client = OpenAI(api_key=api_key) if not base_url else OpenAI(api_key=api_key, base_url=base_url)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """Analyze a text chunk with the LLM and return one JSON object.""" | |
| api_key = os.getenv("OPENAI_API_KEY") | |
| base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1") | |
| if not api_key: | |
| raise RuntimeError("OPENAI_API_KEY is not set. Configure your .env file.") | |
| client = OpenAI(api_key=api_key) | |
| #client = OpenAI(api_key=api_key) | |
| client = OpenAI(api_key=api_key, base_url=base_url) | |
| resp = client.chat.completions.create( | |
| model=model, | |
| messages=[ | |
| {"role": "system", "content": SYSTEM_SCORER}, | |
| {"role": "user", "content": build_user_prompt(text_chunk)}, | |
| ], | |
| temperature=0, | |
| max_tokens=600, | |
| max_tokens=2000, #changed from 300 to 2000 | |
| response_format={"type": "json_object"}, | |
| """Analyze a text chunk with the LLM and return one JSON object.""" | |
| api_key = os.getenv("OPENAI_API_KEY") | |
| base_url = os.getenv("OPENAI_BASE_URL") | |
| if not api_key: | |
| raise RuntimeError("OPENAI_API_KEY is not set. Configure your .env file.") | |
| `#client` = OpenAI(api_key=api_key) | |
| client = OpenAI(api_key=api_key) if not base_url else OpenAI(api_key=api_key, base_url=base_url) | |
| resp = client.chat.completions.create( | |
| model=model, | |
| messages=[ | |
| {"role": "system", "content": SYSTEM_SCORER}, | |
| {"role": "user", "content": build_user_prompt(text_chunk)}, | |
| ], | |
| temperature=0, | |
| max_tokens=2000, `#changed` from 300 to 2000 | |
| response_format={"type": "json_object"}, |
🤖 Prompt for AI Agents
In `@src/main.py` around lines 555 - 571, The code currently defaults
OPENAI_BASE_URL to "https://openrouter.ai/api/v1" which routes users to
OpenRouter unexpectedly; change os.getenv("OPENAI_BASE_URL",
"https://openrouter.ai/api/v1") to os.getenv("OPENAI_BASE_URL") so the OpenAI
SDK uses its native endpoint by default (keep the existing OpenAI(client) usage
in main.py/resp creation), and then add a clear entry for OPENAI_BASE_URL with
the OpenRouter value to .env.example and a short note in README instructing
users to set OPENAI_BASE_URL=https://openrouter.ai/api/v1 explicitly if they
want OpenRouter.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main.py (1)
369-411: Critical: Duplicate function definition - this entire block is dead code.
resolve_privacy_urlis defined twice: here (lines 369-411) and again at lines 500-552. Python uses the last definition, so this first implementation is completely shadowed and never executes.Additionally, this first version calls
_improve_candidate()on discovered URLs (lines 384, 396, 402), while the active second version at line 500 does not call_improve_candidate()anywhere. This means the "upgrade generic pages to deeper policy links" feature described in the PR objectives is not functioning.Recommended action: Remove this dead code block (lines 369-411) and integrate
_improve_candidatecalls into the active function at line 500.🛠️ Example fix for active function (lines 527-528)
best_url = _pick_best_verified_candidate(candidates_list, max_verify=5) if best_url: - return best_url, input_url + return _improve_candidate(best_url), input_url
🤖 Fix all issues with AI agents
In `@src/main.py`:
- Line 692: The print statement formatting in the progress message (print(...)
in src/main.py) is missing a space after the bracket; update the string in the
print call that starts with "[3/3]Analyzing" to include a space so it reads
"[3/3] Analyzing {len(chunks)} chunks in parallel..." to fix the typo.
- Around line 489-496: The loop currently returns a URL immediately when
score[0] <= 1 (high-priority candidate) without checking if the link is live,
which can return broken URLs; change the logic in the for-loop that iterates
over scored[:max_verify] so that even when score[0] <= 1 you call
_light_verify(url) and only return url if that verification succeeds (print the
same debug message on success); if verification fails, continue to the next
candidate (and only return after a successful _light_verify or after exhausting
candidates), using the existing variables scored, max_verify, _light_verify,
candidates and the same debug message.
- Around line 198-206: The Selenium path currently accepts main text >100 chars
while the HTTP path enforces a ≥400 char minimum; make these thresholds
consistent by introducing a shared constant (e.g., CONTENT_MIN_LENGTH = 400) and
using it in both the HTTP extraction checks and the Selenium extraction block
(the driver.find_element(By.TAG_NAME, "main") / body.get_attribute("innerText")
logic), and ensure you use the same comparison operator (>=) in both places so
identical pages behave consistently regardless of extraction method.
🧹 Nitpick comments (3)
src/main.py (3)
139-143: Redundant redirect status check.With
allow_redirects=True, the response status code reflects the final destination, not intermediate redirects. The explicit check for 3xx codes (lines 141-142) will rarely match unless the server returns a redirect without aLocationheader or hits a redirect limit.This isn't a bug, but the logic could be simplified.
♻️ Simplified version
- if 200 <= r.status_code < 300: - return True - if r.status_code in (301, 302, 303, 307, 308): - return True - return False + return 200 <= r.status_code < 400
354-364: Consider usinglogginginstead ofMultiple
print(f"DEBUG: ...")statements are scattered throughout (lines 354, 363, 494, 534, 543, 549, 648, 654, 692). These should use Python'sloggingmodule for proper log level control and configurability.
695-703: Move import to top of file and note that chunk order is not preserved.
- The
ThreadPoolExecutorimport at line 695 should be at the top of the file with other imports.- Iterating over
futuresdict doesn't guarantee completion order matches submission order. Ifaggregate_chunk_resultsdepends on chunk ordering, this could affect results. The current code assignsindex(line 702) but the list order itself is completion-order.♻️ Preserve chunk order using as_completed with sorting
+from concurrent.futures import ThreadPoolExecutor, as_completed + # ... at top of file - from concurrent.futures import ThreadPoolExecutor with ThreadPoolExecutor(max_workers=5) as executor: - futures = {executor.submit(analyze_chunk_json, chunk, args.model): i for i, chunk in enumerate(chunks, 1)} - for future in futures: - idx = futures[future] - res = future.result() + futures = [executor.submit(analyze_chunk_json, chunk, args.model) for chunk in chunks] + for idx, future in enumerate(futures, 1): + res = future.result() # blocks in order if res: res["index"] = idx results.append(res)
|
I’ve streamlined the privacy policy discovery and analysis system. Here’s a rundown of the optimizations and the current performance status: Key Improvements
Performance BottlenecksTesting on sites like TikTok and Google highlighted a few hurdles:
|
src/main.py
Outdated
|
|
||
|
|
||
| def _http_get(url: str, timeout: int = 15) -> Optional[requests.Response]: | ||
| def _http_get(url: str, timeout: int = 5) -> Optional[requests.Response]: |
There was a problem hiding this comment.
Optional kullanılmaması gerek py 3.9+ gerek yok | kullanılmalı
src/main.py
Outdated
|
|
||
|
|
||
| def _fetch_text(url: str, timeout: int = 12) -> Optional[str]: | ||
| def _fetch_text(url: str, timeout: int = 5) -> Optional[str]: |
src/main.py
Outdated
| driver.set_page_load_timeout(10) | ||
| driver.get(url) | ||
| WebDriverWait(driver, 12).until( | ||
| WebDriverWait(driver, 5).until( |
There was a problem hiding this comment.
5 sayısı nedir açıklanmalı cost olması gerek
src/main.py
Outdated
| try: | ||
| content_element = driver.find_element(By.TAG_NAME, "main") | ||
| text = content_element.get_attribute("innerText") | ||
| if text and len(text.strip()) > 100: |
| for idx, pattern in enumerate(_PRIVACY_REGEX_PATTERNS): | ||
| if pattern.search(url): | ||
| return idx | ||
| return 999 |
src/main.py
Outdated
| # Collect links from input page and homepage | ||
| candidates_set: Dict[str, str] = {} # url -> anchor_text | ||
|
|
||
| for page_url in [input_url, base]: |
There was a problem hiding this comment.
for page_url in (input_url, base):
if not (resp := _http_get(page_url)):
continue
for url, text in _collect_link_candidates(resp.text, resp.url, limit=100):
candidates_set.setdefault(url, text)
yapabilirsin
src/main.py
Outdated
|
|
||
| # === PHASE 3: Common paths (last resort) === | ||
| path_heads: List[str] = [] | ||
| for p in _COMMON_PATHS: |
There was a problem hiding this comment.
for candidate_url in (base + path for path in _COMMON_PATHS):
if _head_ok(candidate_url) or _light_verify(candidate_url):
if _light_verify(candidate_url):
print(f" Found via common path: {candidate_url}")
return candidate_url, input_url
kısaltabilirsin
src/main.py
Outdated
| @@ -351,17 +567,20 @@ def split_text_into_chunks( | |||
| def analyze_chunk_json(text_chunk: str, model: str) -> Optional[Dict[str, Any]]: | |||
| def analyze_chunk_json(text_chunk: str, model: str) -> Optional[Dict[str, Any]]: | ||
| """Analyze a text chunk with the LLM and return one JSON object.""" | ||
| api_key = os.getenv("OPENAI_API_KEY") | ||
| base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1") |
src/main.py
Outdated
|
|
||
| print(json.dumps(out, ensure_ascii=False, indent=2)) | ||
|
|
||
| print("\n" + "="*40) |
There was a problem hiding this comment.
buradaki printlerin kaldırılması onun yerline click kullanılması gerek
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
src/main.py (3)
156-171:⚠️ Potential issue | 🟡 MinorUse one shared minimum-content threshold for HTTP and Selenium paths.
HTTP extraction uses
>= 400while Selenium accepts> 100at Line 201, causing inconsistent acceptance for the same page content.♻️ Suggested fix
+CONTENT_MIN_LENGTH = 400 ... - return t if len(t) >= 400 else None + return t if len(t) >= CONTENT_MIN_LENGTH else None ... - return t if len(t) >= 400 else None + return t if len(t) >= CONTENT_MIN_LENGTH else None ... - if text and len(text.strip()) > 100: + if text and len(text.strip()) >= CONTENT_MIN_LENGTH: return textAlso applies to: 197-203
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 156 - 171, Define a single shared minimum content constant (e.g., MIN_CONTENT_CHARS = 400) in src/main.py and use it in both the HTTP and Selenium extraction paths instead of hard-coded literals; replace the HTTP check "return t if len(t) >= 400" and the Selenium path's "> 100" check with "len(t) >= MIN_CONTENT_CHARS" so both branches (the code using _http_get and the code that sets content_element/get_text into t) enforce the same threshold and comparison semantics.
489-493:⚠️ Potential issue | 🟠 MajorDo not bypass URL verification for top-ranked candidates.
At Lines 491-492, high-priority candidates are returned without
_light_verify. This can emit broken URLs.🔧 Suggested fix
- for i, (url, text, score) in enumerate(scored[:max_verify]): - `#if` the score is already very good, skip verification - if score[0] <= 1: - return url + for _, (url, text, score) in enumerate(scored[:max_verify]): if _light_verify(url): print(f"DEBUG: Selected URL '{url}' from {len(candidates)} candidates (score: {score})") return url🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 489 - 493, The loop over scored candidates currently returns a top-ranked URL when score[0] <= 1 before calling _light_verify, which bypasses verification; change the control flow in the loop (the for i, (url, text, score) in enumerate(scored[:max_verify]) block) so that you do not return immediately for high-scoring items—instead run _light_verify(url) for every candidate (including those with score[0] <= 1) and only return the url when _light_verify succeeds; if verification fails, continue to the next candidate and only return after a successful verification (or exhaust candidates).
570-575:⚠️ Potential issue | 🟠 MajorDefaulting to OpenRouter should be explicit or documented.
At Line 570, defaulting
OPENAI_BASE_URLto OpenRouter reroutes users unexpectedly when they only setOPENAI_API_KEY.🔧 Suggested fix (SDK default unless explicitly configured)
- base_url = os.getenv("OPENAI_BASE_URL", "https://openrouter.ai/api/v1") + base_url = os.getenv("OPENAI_BASE_URL") @@ - client = OpenAI(api_key=api_key, base_url=base_url) + client = OpenAI(api_key=api_key) if not base_url else OpenAI(api_key=api_key, base_url=base_url)In openai Python SDK 1.102.0, what is the default API base URL when `OpenAI(api_key=...)` is created without `base_url`?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 570 - 575, The code defaults OPENAI_BASE_URL to OpenRouter and always passes base_url into the OpenAI constructor (base_url variable and client = OpenAI(api_key=api_key, base_url=base_url)), which silently overrides the SDK default; change the logic so OPENAI_BASE_URL is optional: read os.getenv("OPENAI_BASE_URL") without a default, and only pass base_url into OpenAI(...) when that env var is present (or otherwise leave the OpenAI(...) call as OpenAI(api_key=api_key)) so the SDK default base URL is preserved unless explicitly configured; update any docstring or README to state that OPENAI_BASE_URL must be set to override the SDK default.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/main.py`:
- Around line 576-585: The OpenAI per-chunk call (client.chat.completions.create
used to assign resp with build_user_prompt) has no error handling and futures
are consumed with an unguarded future.result(); wrap the API call inside a
try/except that catches network/HTTP/timeout/rate-limit exceptions and returns
None (or a sentinel) for failed chunks, and then guard future.result() where
futures are iterated so you skip/ignore None results or catch exceptions from
future.result() and continue processing remaining chunks; reference
client.chat.completions.create, build_user_prompt, resp, and future.result()
when applying the changes.
- Around line 369-411: Remove the obsolete duplicate implementation of
resolve_privacy_url (the one that uses find_best_policy_url and
_extract_text_http) so only the intended phase-based implementation remains (the
one that uses _collect_link_candidates and _pick_best_verified_candidate);
delete the first definition entirely to avoid shadowing and ensure imports/refs
still match the remaining resolve_privacy_url, then run tests/static checks to
verify no references break.
---
Duplicate comments:
In `@src/main.py`:
- Around line 156-171: Define a single shared minimum content constant (e.g.,
MIN_CONTENT_CHARS = 400) in src/main.py and use it in both the HTTP and Selenium
extraction paths instead of hard-coded literals; replace the HTTP check "return
t if len(t) >= 400" and the Selenium path's "> 100" check with "len(t) >=
MIN_CONTENT_CHARS" so both branches (the code using _http_get and the code that
sets content_element/get_text into t) enforce the same threshold and comparison
semantics.
- Around line 489-493: The loop over scored candidates currently returns a
top-ranked URL when score[0] <= 1 before calling _light_verify, which bypasses
verification; change the control flow in the loop (the for i, (url, text, score)
in enumerate(scored[:max_verify]) block) so that you do not return immediately
for high-scoring items—instead run _light_verify(url) for every candidate
(including those with score[0] <= 1) and only return the url when _light_verify
succeeds; if verification fails, continue to the next candidate and only return
after a successful verification (or exhaust candidates).
- Around line 570-575: The code defaults OPENAI_BASE_URL to OpenRouter and
always passes base_url into the OpenAI constructor (base_url variable and client
= OpenAI(api_key=api_key, base_url=base_url)), which silently overrides the SDK
default; change the logic so OPENAI_BASE_URL is optional: read
os.getenv("OPENAI_BASE_URL") without a default, and only pass base_url into
OpenAI(...) when that env var is present (or otherwise leave the OpenAI(...)
call as OpenAI(api_key=api_key)) so the SDK default base URL is preserved unless
explicitly configured; update any docstring or README to state that
OPENAI_BASE_URL must be set to override the SDK default.
| resp = client.chat.completions.create( | ||
| model=model, | ||
| messages=[ | ||
| {"role": "system", "content": SYSTEM_SCORER}, | ||
| {"role": "user", "content": build_user_prompt(text_chunk)}, | ||
| ], | ||
| temperature=0, | ||
| max_tokens=600, | ||
| max_tokens=2000, #changed from 300 to 2000 | ||
| response_format={"type": "json_object"}, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
sed -n '567,590p' src/main.py
sed -n '692,705p' src/main.pyRepository: HappyHackingSpace/privacy-policy-analyzer
Length of output: 1651
Guard per-chunk API failures to avoid aborting the whole analysis.
The OpenAI call at lines 576-585 lacks exception handling for network errors (timeouts, 429 rate limits, etc.), and future.result() at line ~700 is unguarded. A single failed chunk currently aborts the entire analysis.
Wrap the API call in try-except to return None on failure, and guard future.result() to skip failed chunks gracefully.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/main.py` around lines 576 - 585, The OpenAI per-chunk call
(client.chat.completions.create used to assign resp with build_user_prompt) has
no error handling and futures are consumed with an unguarded future.result();
wrap the API call inside a try/except that catches
network/HTTP/timeout/rate-limit exceptions and returns None (or a sentinel) for
failed chunks, and then guard future.result() where futures are iterated so you
skip/ignore None results or catch exceptions from future.result() and continue
processing remaining chunks; reference client.chat.completions.create,
build_user_prompt, resp, and future.result() when applying the changes.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/main.py (1)
693-704:⚠️ Potential issue | 🟡 MinorMissing space in progress message and unguarded
future.result().
Typo (line 693): Missing space —
"[3/3]Analyzing"should be"[3/3] Analyzing".Unguarded future.result() (line 701): If any chunk's API call fails (network error, rate limit),
future.result()raises an exception, aborting the entire analysis. Wrap in try/except to skip failed chunks gracefully.🛠️ Proposed fix
- print(f"[3/3]Analyzing {len(chunks)} chunks in parallel...") + print(f"[3/3] Analyzing {len(chunks)} chunks in parallel...") # Parallel analysis of chunks from concurrent.futures import ThreadPoolExecutor with ThreadPoolExecutor(max_workers=5) as executor: futures = {executor.submit(analyze_chunk_json, chunk, args.model): i for i, chunk in enumerate(chunks, 1)} for future in futures: idx = futures[future] - res = future.result() - if res: - res["index"] = idx - results.append(res) + try: + res = future.result() + if res: + res["index"] = idx + results.append(res) + except Exception as e: + print(f"Warning: Chunk {idx} analysis failed: {e}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 693 - 704, The progress print message is missing a space and the ThreadPoolExecutor loop calls future.result() unguarded which will raise and abort processing if a chunk fails; update the string to "[3/3] Analyzing {len(chunks)} chunks in parallel..." and wrap the call to future.result() inside a try/except around the loop that iterates over futures (created when submitting analyze_chunk_json) to catch exceptions, log or skip failed chunk results, and continue appending successful results (ensuring you still set res["index"] = idx when res is returned).
🧹 Nitpick comments (4)
src/main.py (4)
696-696: Move import to module level.
ThreadPoolExecutoris imported inside the function body. Per PEP 8, imports should be at the top of the file.♻️ Proposed fix
At module level (near line 1):
from concurrent.futures import ThreadPoolExecutorThen remove line 696.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` at line 696, Move the "from concurrent.futures import ThreadPoolExecutor" import out of the function body and place it at the top of the module (module-level imports), then delete the local import statement inside the function so all references to ThreadPoolExecutor use the top-level import; keep the existing usage of ThreadPoolExecutor unchanged.
501-553: Activeresolve_privacy_urlimplementation — clean up debug prints.This is the intended 3-phase implementation. After removing the shadowed definition at lines 370-412, clean up the
DEBUGprint statements (lines 535, 544, 550) or convert them to proper logging.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 501 - 553, In resolve_privacy_url, replace the three DEBUG print statements with proper logging calls (e.g., use the module logger) or remove them: the prints occur when a sitemap candidate is found inside the loop over _get_sitemaps_from_robots/_fetch_sitemap_urls and in the two places inside the common-path checks that currently call print(f"DEBUG: ..."). Update those prints to logger.debug(...) (or remove) so they don't leak stdout; keep the rest of the 3-phase logic and return values unchanged and ensure the logger is imported/initialized if added.
190-190: Move_SELENIUM_WAIT_TIMEOUTto module level.The constant is defined inside the function body. Per the commit message "use a constant for Selenium wait timeout", this should be a module-level constant for better visibility and configurability.
♻️ Proposed refactor
Move to module level near other constants:
+_SELENIUM_WAIT_TIMEOUT = 5 + def fetch_content_with_selenium(url: str) -> str | None: """Return visible text using headless Chrome; robust for dynamic pages.""" chromedriver_autoinstaller.install() # ... options setup ... - _SELENIUM_WAIT_TIMEOUT = 5 try: driver.set_page_load_timeout(10)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` at line 190, Move the `_SELENIUM_WAIT_TIMEOUT` constant out of the function body and declare it at module level alongside the other constants; locate the occurrence of `_SELENIUM_WAIT_TIMEOUT` inside the function (the function that currently defines it) and remove the in-function definition, then add a top-level declaration `_SELENIUM_WAIT_TIMEOUT = 5` near the other module constants so the constant is globally visible and configurable.
294-299: Extract magic number 999 to a named constant.The value
999is used as a "no match" sentinel but lacks explanation. Define a module-level constant for clarity.♻️ Proposed fix
+_NO_MATCH_PRIORITY = 999 # Sentinel for URLs not matching any priority pattern def _get_url_priority(url: str) -> int: """Return the priority index of a URL based on regex patterns. Lower is better.""" for idx, pattern in enumerate(_PRIVACY_REGEX_PATTERNS): if pattern.search(url): return idx - return 999 + return _NO_MATCH_PRIORITYThen update references at lines 332, 445-446, and 492.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.py` around lines 294 - 299, Introduce a descriptive module-level constant (e.g., NO_MATCH_PRIORITY = 999) and replace the magic literal 999 in _get_url_priority with that constant; also replace any other uses in this module where 999 is used as the "no match" sentinel (places that return, compare to, or rely on that sentinel) so all comparisons/returns use NO_MATCH_PRIORITY instead of the literal. Ensure the constant is documented with a short comment and update the return in _get_url_priority and any other functions or conditionals that previously used 999 to reference this new symbol.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/main.py`:
- Around line 693-704: The progress print message is missing a space and the
ThreadPoolExecutor loop calls future.result() unguarded which will raise and
abort processing if a chunk fails; update the string to "[3/3] Analyzing
{len(chunks)} chunks in parallel..." and wrap the call to future.result() inside
a try/except around the loop that iterates over futures (created when submitting
analyze_chunk_json) to catch exceptions, log or skip failed chunk results, and
continue appending successful results (ensuring you still set res["index"] = idx
when res is returned).
---
Nitpick comments:
In `@src/main.py`:
- Line 696: Move the "from concurrent.futures import ThreadPoolExecutor" import
out of the function body and place it at the top of the module (module-level
imports), then delete the local import statement inside the function so all
references to ThreadPoolExecutor use the top-level import; keep the existing
usage of ThreadPoolExecutor unchanged.
- Around line 501-553: In resolve_privacy_url, replace the three DEBUG print
statements with proper logging calls (e.g., use the module logger) or remove
them: the prints occur when a sitemap candidate is found inside the loop over
_get_sitemaps_from_robots/_fetch_sitemap_urls and in the two places inside the
common-path checks that currently call print(f"DEBUG: ..."). Update those prints
to logger.debug(...) (or remove) so they don't leak stdout; keep the rest of the
3-phase logic and return values unchanged and ensure the logger is
imported/initialized if added.
- Line 190: Move the `_SELENIUM_WAIT_TIMEOUT` constant out of the function body
and declare it at module level alongside the other constants; locate the
occurrence of `_SELENIUM_WAIT_TIMEOUT` inside the function (the function that
currently defines it) and remove the in-function definition, then add a
top-level declaration `_SELENIUM_WAIT_TIMEOUT = 5` near the other module
constants so the constant is globally visible and configurable.
- Around line 294-299: Introduce a descriptive module-level constant (e.g.,
NO_MATCH_PRIORITY = 999) and replace the magic literal 999 in _get_url_priority
with that constant; also replace any other uses in this module where 999 is used
as the "no match" sentinel (places that return, compare to, or rely on that
sentinel) so all comparisons/returns use NO_MATCH_PRIORITY instead of the
literal. Ensure the constant is documented with a short comment and update the
return in _get_url_priority and any other functions or conditionals that
previously used 999 to reference this new symbol.
Changes Made to
src/main.pyOld System
/privacy,/privacy-policy)New System
_collect_link_candidates()- Collects all privacy links from the page_score_candidate()- Scores each URL by priority and anchor text_pick_best_verified_candidate()- Selects and verifies the best URL3-Tier Discovery Strategy:
Result
Finds the most detailed policy from real site links, skips landing pages. But it requires longer time.
/privacy(landing page)/legal/page/row/privacy-policy/tr(full policy)Summary by CodeRabbit
New Features
Improvements