Skip to content

Improve prevalence metrics by recording all scanned sites in scan db #89

@ghostwords

Description

@ghostwords

We should record all sites we visited in scan db, including sites with no trackers. If there are redirects, we should record the actual final site domain that we scanned.

This will enable:

  • More meaningful tracker prevalence data
  • Greater scan visibility ("80% of visited sites contain tracking", top ten slowest sites to visit)
  • Listing of sites with no trackers
  • Improvements to scan site list quality

Note: there will be sites with no trackers that have GA on them; it's just that PB didn't record tracking there for whatever reason

This continues work started in 5211f67 and 4e4d5f2.

New scan db table idea:

CREATE TABLE scan_sites (
    scan_id INTEGER NOT NULL,
    initial_site_id INTEGER NOT NULL,
    final_site_id INTEGER NOT NULL,
    status_id INTEGER NOT NULL,
    start_time TIMESTAMP NOT NULL,
    end_time TIMESTAMP NOT NULL,
    UNIQUE(scan_id, initial_site_id)
    FOREIGN KEY(scan_id) REFERENCES scan(id)
    FOREIGN KEY(initial_site_id) REFERENCES site(id)
    FOREIGN KEY(final_site_id) REFERENCES site(id)
    FOREIGN KEY(status_id) REFERENCES site_status(id))
site_statuses = {
    "success": 1,
    "timeout": 2,
    "error": 3,
    "antibot": 4,
}

This will require updating some of the queries in sql/.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions