Skip to content
This repository was archived by the owner on Oct 28, 2022. It is now read-only.

Domain name normalization #87

@yuliya-ivaniukovich

Description

@yuliya-ivaniukovich

At the moment strings like www.domain.com, domain.com, subdomain.domain.com are treated as individual jobs with independent results. This can lead to some data collisions since we use document URL as a key of document tabel.
We should:

  • investigate if www.domain.com and domain.com are treated as the same address in Heritrix
  • for subdomains we can have independent Heritrix jobs, but we should link them to already existing documents if they were downloaded and checked before within another job.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions