REFATORANDO E IMPLEMENTANDO FIXES by matholiveira91 · Pull Request #50 · OWASP/D4N155

matholiveira91 · 2026-04-01T19:52:16Z

melhorias

…rovements Rewrites the three main Python modules to address performance bottlenecks identified in the scraping, text processing and headless browser pipelines. GoMutation is preserved unchanged pending evaluation. modules/scraper.py - Replace sequential requests with asyncio + aiohttp parallel pipeline - Add configurable concurrency semaphore (default: 10 simultaneous requests) - Switch HTML parser from html.parser to lxml (up to 10x faster parsing) - Add SHA-256 disk cache per URL to skip redundant fetches on re-runs - Add automatic retry with exponential backoff (3 attempts per URL) - Use set() for immediate deduplication during word collection - Expose synchronous scrape() entry point for backward compatibility modules/aggressive.py - Replace per-URL browser instantiation with a single reusable Playwright instance - Add async tab pool with configurable concurrency (default: 4 simultaneous tabs) - Add JS-detection heuristic to delegate non-JS pages to the faster aiohttp path - Retain geckodriver support as --use-gecko fallback for legacy environments - geckodriver path also improved: single driver instance reused across all URLs modules/wordlist.py - Replace manual frequency dict with collections.Counter (C-level implementation) - Switch file reading to line-by-line streaming to avoid full file loading in RAM - Add Unicode normalization (NFKD) for correct handling of accented characters - Deduplicate early with set(); Counter.most_common() replaces manual sort - Fix bug #17: static text list not saved from interactive mode — add explicit flush + fsync to guarantee writes before process exit - GoMutation invoked via stdin pipe instead of temp file, reducing disk I/O - GoMutation binary preserved and unchanged main - Dispatch to correct Python module based on CLI flag (-w, -t, -b) - Conditional GoMutation compilation preserved (go build only when binary absent) - Interactive menu retained for no-argument invocations requirements.txt - Add aiohttp>=3.9.3 and playwright>=1.43.0 for async/headless improvements - Bump urllib3 to >=1.26.19 to address open security PR #45 (Snyk CVE fix) - Pin lxml>=5.1.0 and beautifulsoup4>=4.12.3 tests/test_improvements.py (new) - Unit tests for normalize(), tokenize(), Counter pipeline, top_words() - Streaming file reader test with 10k-line corpus - save_wordlist() test asserting bug #17 regression does not reoccur - extract_words_from_html() tests covering script stripping and deduplication - Cache path determinism and collision-resistance tests .github/workflows/ci.yml (new) - Test matrix across Python 3.10, 3.11 and 3.12 - Bandit static security analysis on modules/ - pip-audit dependency vulnerability scan on each PR - ShellCheck linting for main, functions.sh and load.sh Expected performance gains: - Standard mode (-w / -t): 5–20x faster on multi-URL targets - HTML parsing: up to 10x faster with lxml - Aggressive mode (-a): 3–10x faster with browser tab pool - Repeated runs: near-instant via disk cache

Foram identificados e corrigidos 6 bugs em cascata que causavam wordlist vazia no modo interativo e no modo agressivo. ### modules/search.py - Corrige dork malformado: a URL do alvo era passada diretamente ao getrails.search() (ex: 'http://scanme.org') quando o esperado é um dork do Google (ex: 'site:scanme.org'). Adicionada função url_to_dork() via urllib.parse.urlparse para extrair o hostname e formatar corretamente. ### getrails/google/search.py - Substitui backend googlesearch + get_random_user_agent pelo ddgs (DDGS), biblioteca ativamente mantida. O Google bloqueava todas as requisições retornando lista vazia silenciosamente, e o DuckDuckGo retornava CAPTCHA com seletor HTML desatualizado (a.result__url inexistente). - Dependência adicionada: pip install ddgs ### getrails/__init__.py - Amplia cláusula de exceção de HTTPError para Exception, garantindo que qualquer falha do backend primário acione o fallback. - Adiciona filtro de strings vazias no retorno: [r for r in result if r], evitando que [''] seja propagado como URL válida para o Go. ### modules/read.py - Substitui Selenium + geckodriver pelo Playwright no modo agressivo. O Selenium possuía timeout de 120s hardcoded no urllib3 interno (RemoteConnection), incontornável via API pública na versão instalada. - Adiciona fallback para conteúdo parcial quando Playwright expira: tenta page.content() antes de cair no static_read(), preservando o conteúdo já renderizado pela página. - Corrige _normalize(): a função removia o schema original e forçava https:// em todas as URLs. Agora preserva http:// ou https:// conforme fornecido pelo usuário, injetando http:// apenas quando não há schema. Isso corrigia alvos de teste que não possuem versão HTTPS (ex: scanme.org). ### Causa raiz do [WARN] Skipping invalid URL: '' - O split(',') aplicado sobre string vazia em getrails/duckduckgo/search.py retornava [''] em vez de [], que era silenciosamente propagado até o binário Go, que então emitia o aviso e descartava a URL sem processar nenhum conteúdo, resultando em wordlist vazia. Fixes: #wordlist-vazia #modo-agressivo #timeout #dork #schema

matholiveira91 and others added 5 commits June 19, 2023 14:09

teste

22bc649

testing python bandit scan

b9967c2

Merge remote-tracking branch 'origin/master' into refat-updates

29d1abc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REFATORANDO E IMPLEMENTANDO FIXES#50

REFATORANDO E IMPLEMENTANDO FIXES#50
matholiveira91 wants to merge 5 commits intomasterfrom
refat-updates

matholiveira91 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matholiveira91 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant