Skip to content

Commit 712ca9c

Browse files
authored
Browsertrix normalization handling, take 3 (#912)
A couple more items following on from #909 and #910 that came up when I did a more exhaustive check of *all* our URLs in current Browsertrix, and of the source for the normalizer Browsertrix uses. This should cover everything until Browsertrix releases an update that changes how it normalizes.
1 parent 47beb9f commit 712ca9c

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

web_monitoring/utils.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,9 @@ def matchable_querystring(querystring: str) -> str:
197197
URLs are still matchable, even though they are not strictly correct.
198198
"""
199199
parsed = parse_qsl(querystring, keep_blank_values=True)
200+
# TODO: consider bringing in some more ignorable params from our custom
201+
# SURT implementation in web-monitoring-db.
202+
parsed = [(k, v) for k, v in parsed if not k.lower().startswith('utm_')]
200203
result = urlencode(sorted(parsed))
201204
if '=' not in querystring:
202205
result = re.sub(r'=', '', result)
@@ -231,7 +234,7 @@ def matchable_url(url: str) -> str:
231234
parsed = urlsplit(url)
232235
return parsed._replace(
233236
netloc=normalize_netloc(parsed),
234-
path=(parsed.path or '/').rstrip('/'),
237+
path=re.sub(r'//+', '/', (parsed.path or '/').rstrip('/')),
235238
query=matchable_querystring(parsed.query),
236239
fragment=''
237240
).geturl()

0 commit comments

Comments
 (0)