Skip to content

Commit 699aa16

Browse files
authored
Merge pull request #2 from autogram-is/sets-and-filters
Sets and filters
2 parents 60b8fc5 + 04ba753 commit 699aa16

File tree

3 files changed

+34
-27
lines changed

3 files changed

+34
-27
lines changed

README.md

Lines changed: 5 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,9 @@
11
# URL Tools
22

3-
Processing, normalizing, and de-duplicating large piles of URLs can be a pain,
4-
particularly if you're trying to distinguish "real" unique URLs from the many
5-
variations that can appear in the wild. URLs with anchor links, query params
6-
in different orders, social sharing and analytics campaign cruft, accidental
7-
references to staging servers… You get the idea.
3+
Processing, normalizing, and de-duplicating large piles of URLs can be a pain, particularly if you're trying to distinguish "real" unique URLs from the many variations that can appear in the wild. URLs with anchor links, query params in different orders, social sharing and analytics campaign cruft, accidental references to staging servers… You get the idea.
84

9-
URL Tools is a helper library whose sole purpose is making that process just a
10-
little less frustrating. It consists of four major pieces:
5+
URL Tools is a helper library whose sole purpose is making that process just a little less frustrating. It consists of four major pieces:
116

12-
- `ParsedUrl`, a wrapper for the standard WHATWG `URL` class that mixes in the
13-
domain and subdomain parsing from `tldjs`. Serializing a `ParsedUrl` object to
14-
JSON also produces a broken out collection of its individual properties, rather
15-
than spitting out the `href` property, as is the `URL` class's habit.
16-
- `ParsedUrlSet`, a collection class that automatically parses, normalizes, and
17-
de-duplicates sets of existing Urls. It's a bit janky, since ES6's `Set`
18-
implementation only supports value comparison. As such, you can put `ParsedUrl`
19-
objects *into* the set, but after normalization they're stored as simple strings.
20-
It also keeps track of the URLs that it rejects as unparseable.
21-
- A light set of helper functions for common filtering and normalizing operations,
22-
including sorting querystring parameters, stripping social sharing cruft,
23-
remapping 'ww1', 'ww2', etc. subdomains to a single canonical one, identifying
24-
web vs. non-web URLs, flagging urls on public hosting like S3, and more.
25-
# Todo
26-
- [ ] A richer set of filters
27-
- [ ] Chainable filtering and transforming for ParsedUrlSet
7+
- `ParsedUrl`, a wrapper for the standard WHATWG `URL` class that mixes in the domain and subdomain parsing from [`tldjs`](https://www.npmjs.com/package/tldjs). Serializing a `ParsedUrl` object to JSON also produces a broken out collection of its individual properties, rather than spitting out the `href` property, as is the `URL` class's habit.
8+
- `ParsedUrlSet`, a collection class that automatically parses, normalizes, and de-duplicates sets of existing Urls. It's a bit janky, since ES6's `Set` implementation only supports value comparison. As such, you can put `ParsedUrl` objects *into* the set, but after normalization they're stored as simple strings. It also keeps track of the URLs that it rejects as unparseable.
9+
- A light set of helper functions for common filtering and normalizing operations, including sorting querystring parameters, stripping social sharing cruft, remapping 'ww1', 'ww2', etc. subdomains to a single canonical one, identifying web vs. non-web URLs, flagging urls on public hosting like S3, and more. These can be combined or defined on the fly to serve as the normalizer function for a `ParsedUrlSet`.

src/parsed-url-set.ts

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
import {ParsedUrl} from './parsed-url';
22
import {UrlMutator, Mutators} from './mutations';
3+
import {UrlFilter} from './filters';
34

45
export class ParsedUrlSet extends Set<string> {
56
static DefaultNormalizer: UrlMutator = Mutators.DefaultNormalizer;
6-
readonly rejected: Set<string> = new Set<string>();
7+
readonly unparseable: Set<string> = new Set<string>();
78

89
public constructor(
910
values: Array<string | ParsedUrl> = [],
@@ -18,7 +19,7 @@ export class ParsedUrlSet extends Set<string> {
1819
if (parsed) {
1920
super.add(parsed.href);
2021
} else {
21-
this.rejected.add(value as string);
22+
this.unparseable.add(value as string);
2223
}
2324
return this;
2425
}
@@ -36,7 +37,7 @@ export class ParsedUrlSet extends Set<string> {
3637
}
3738

3839
override clear(): void {
39-
this.rejected.clear();
40+
this.unparseable.clear();
4041
super.clear();
4142
}
4243

@@ -63,6 +64,12 @@ export class ParsedUrlSet extends Set<string> {
6364
}
6465
}
6566

67+
filter(filterFunction: UrlFilter): ParsedUrlSet {
68+
let urls: ParsedUrl[] = [...this.hydrate()];
69+
urls = urls.filter(u => filterFunction(u));
70+
return new ParsedUrlSet(urls);
71+
}
72+
6673
hydrate(): ParsedUrl[] {
6774
return [...this].map(u => new ParsedUrl(u)) as ParsedUrl[];
6875
}

tests/parsed-url-set.ts

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,24 @@ test('parsed url set normalizes duplicates', () => {
88

99
test('parsed url ignores invalid URLs', () => {
1010
const set = new ParsedUrlSet(fixtures.UNPARSEABLE_URLS);
11-
console.log([...set]);
1211
expect(set.size).toBe(0);
1312
});
13+
14+
test('parsed url set is filterable', () => {
15+
const set = new ParsedUrlSet([
16+
fixtures.NORMALIZED_URL.normalized,
17+
...fixtures.NON_WEB_URLS,
18+
]);
19+
20+
expect(set.filter(u => u.domain === 'example.com').size).toBe(1);
21+
});
22+
23+
test('parsed url set tracks unparseable rejections', () => {
24+
const set = new ParsedUrlSet([
25+
fixtures.NORMALIZED_URL.normalized,
26+
...fixtures.UNPARSEABLE_URLS,
27+
]);
28+
29+
expect(set.unparseable.size).toBe(fixtures.UNPARSEABLE_URLS.length);
30+
expect(set.size).toBe(1);
31+
});

0 commit comments

Comments
 (0)