Skip to content

REFATORANDO E IMPLEMENTANDO FIXES#50

Open
matholiveira91 wants to merge 5 commits intomasterfrom
refat-updates
Open

REFATORANDO E IMPLEMENTANDO FIXES#50
matholiveira91 wants to merge 5 commits intomasterfrom
refat-updates

Conversation

@matholiveira91
Copy link
Copy Markdown
Collaborator

melhorias

matholiveira91 and others added 5 commits June 19, 2023 14:09
…rovements

Rewrites the three main Python modules to address performance bottlenecks
identified in the scraping, text processing and headless browser pipelines.
GoMutation is preserved unchanged pending evaluation.

modules/scraper.py
- Replace sequential requests with asyncio + aiohttp parallel pipeline
- Add configurable concurrency semaphore (default: 10 simultaneous requests)
- Switch HTML parser from html.parser to lxml (up to 10x faster parsing)
- Add SHA-256 disk cache per URL to skip redundant fetches on re-runs
- Add automatic retry with exponential backoff (3 attempts per URL)
- Use set() for immediate deduplication during word collection
- Expose synchronous scrape() entry point for backward compatibility

modules/aggressive.py
- Replace per-URL browser instantiation with a single reusable Playwright instance
- Add async tab pool with configurable concurrency (default: 4 simultaneous tabs)
- Add JS-detection heuristic to delegate non-JS pages to the faster aiohttp path
- Retain geckodriver support as --use-gecko fallback for legacy environments
- geckodriver path also improved: single driver instance reused across all URLs

modules/wordlist.py
- Replace manual frequency dict with collections.Counter (C-level implementation)
- Switch file reading to line-by-line streaming to avoid full file loading in RAM
- Add Unicode normalization (NFKD) for correct handling of accented characters
- Deduplicate early with set(); Counter.most_common() replaces manual sort
- Fix bug #17: static text list not saved from interactive mode — add explicit
  flush + fsync to guarantee writes before process exit
- GoMutation invoked via stdin pipe instead of temp file, reducing disk I/O
- GoMutation binary preserved and unchanged

main
- Dispatch to correct Python module based on CLI flag (-w, -t, -b)
- Conditional GoMutation compilation preserved (go build only when binary absent)
- Interactive menu retained for no-argument invocations

requirements.txt
- Add aiohttp>=3.9.3 and playwright>=1.43.0 for async/headless improvements
- Bump urllib3 to >=1.26.19 to address open security PR #45 (Snyk CVE fix)
- Pin lxml>=5.1.0 and beautifulsoup4>=4.12.3

tests/test_improvements.py (new)
- Unit tests for normalize(), tokenize(), Counter pipeline, top_words()
- Streaming file reader test with 10k-line corpus
- save_wordlist() test asserting bug #17 regression does not reoccur
- extract_words_from_html() tests covering script stripping and deduplication
- Cache path determinism and collision-resistance tests

.github/workflows/ci.yml (new)
- Test matrix across Python 3.10, 3.11 and 3.12
- Bandit static security analysis on modules/
- pip-audit dependency vulnerability scan on each PR
- ShellCheck linting for main, functions.sh and load.sh

Expected performance gains:
- Standard mode (-w / -t): 5–20x faster on multi-URL targets
- HTML parsing:            up to 10x faster with lxml
- Aggressive mode (-a):    3–10x faster with browser tab pool
- Repeated runs:           near-instant via disk cache
Foram identificados e corrigidos 6 bugs em cascata que causavam
wordlist vazia no modo interativo e no modo agressivo.

### modules/search.py
- Corrige dork malformado: a URL do alvo era passada diretamente ao
  getrails.search() (ex: 'http://scanme.org') quando o esperado é um
  dork do Google (ex: 'site:scanme.org'). Adicionada função url_to_dork()
  via urllib.parse.urlparse para extrair o hostname e formatar corretamente.

### getrails/google/search.py
- Substitui backend googlesearch + get_random_user_agent pelo ddgs (DDGS),
  biblioteca ativamente mantida. O Google bloqueava todas as requisições
  retornando lista vazia silenciosamente, e o DuckDuckGo retornava CAPTCHA
  com seletor HTML desatualizado (a.result__url inexistente).
- Dependência adicionada: pip install ddgs

### getrails/__init__.py
- Amplia cláusula de exceção de HTTPError para Exception, garantindo que
  qualquer falha do backend primário acione o fallback.
- Adiciona filtro de strings vazias no retorno: [r for r in result if r],
  evitando que [''] seja propagado como URL válida para o Go.

### modules/read.py
- Substitui Selenium + geckodriver pelo Playwright no modo agressivo.
  O Selenium possuía timeout de 120s hardcoded no urllib3 interno
  (RemoteConnection), incontornável via API pública na versão instalada.
- Adiciona fallback para conteúdo parcial quando Playwright expira:
  tenta page.content() antes de cair no static_read(), preservando
  o conteúdo já renderizado pela página.
- Corrige _normalize(): a função removia o schema original e forçava
  https:// em todas as URLs. Agora preserva http:// ou https:// conforme
  fornecido pelo usuário, injetando http:// apenas quando não há schema.
  Isso corrigia alvos de teste que não possuem versão HTTPS (ex: scanme.org).

### Causa raiz do [WARN] Skipping invalid URL: ''
- O split(',') aplicado sobre string vazia em getrails/duckduckgo/search.py
  retornava [''] em vez de [], que era silenciosamente propagado até o
  binário Go, que então emitia o aviso e descartava a URL sem processar
  nenhum conteúdo, resultando em wordlist vazia.

Fixes: #wordlist-vazia #modo-agressivo #timeout #dork #schema
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant