Skip to content

Commit d97898c

Browse files
committed
Fixed database extraction for mediawiki update.
Allowed changing the language of the download. Page pruning temporarily disabled
1 parent 17559df commit d97898c

7 files changed

+279
-64
lines changed

scripts/README.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Description of the process
2+
3+
## Parsing of the tables
4+
5+
### links.txt
6+
- `pl_from` -> Id of the "from" page of this link
7+
- (`pl_namespace`) -> We keep only if equals 0 (= namespace of the "from" page of this link)
8+
- `pl_target_id` -> Target of this link (foreign key to `linktarget`)
9+
10+
### targets.txt
11+
- `lt_id` -> Id of this link (index)
12+
- (`lt_ns`) -> We keep only if equals 0 (= namespace of the targeted page)
13+
- `lt_title` -> Title of the targeted page
14+
15+
### pages.txt
16+
- `page_id` -> Id of the page
17+
- (`page_namespace`) -> We keep only if equals 0 (= namespace of this page)
18+
- `page_title` -> Title of this page
19+
- `page_is_redirect` -> Boolean wether this page is a redirect
20+
- Ignore the eight following
21+
22+
### redirects.txt
23+
- `rd_from` -> Id of the page from which we are redirected
24+
- (`rd_namespace`) -> We keep only if equals 0 (= namespace of the page we are redirected to)
25+
- `rd_title` -> Title of the page we are redirected to
26+
- Ignore the two following
27+
28+
## Joining the tables
29+
30+
### redirects.with_ids.txt (replace_titles_in_redirects_file.py)
31+
Replaces for each redirection, `rd_title` with the targetted `page_id` by matching on `page_title`.
32+
The targetted page_id is then computed as a redirect recursively, until we get on a "final" page.
33+
- `rd_from` -> The id of the page we are redirected from
34+
- `page_id` -> The id of the page we get to following redirections recursively
35+
36+
### targets.with_ids.txt (replace_titles_and_redirects_in_targets_file.py)
37+
Replaces, for each linktarget, `lt_title` with the targetted `page_id` by matching on `page_title`.
38+
We then compute the "final" page obtained from this page following redirection, with the file `redirects.with_ids.txt`.
39+
- `lt_id` -> Id of this link
40+
- `page_id` -> The id of the page this link is pointing to, after having followed all redirections
41+
42+
### links.with_ids.txt (replace_titles_and_redirects_in_links_file.py)
43+
Replaces, for each pagelink, `lt_id` with the targetted `page_id` by joining with `links.with_ids.txt`.
44+
- `pl_from` -> Id of the "from" page, after having followed all redirections
45+
- `page_id` -> Id of the "to" page, after having followed all redirections
46+
47+
### page.pruned.txt (prune_pages_file.py)
48+
Prunes the pages file by removing pages which are marked as redirects but have no corresponding redirect in the redirects file.
49+
TEMPORARY DISABLED as it removed too many pages
50+
51+
## Sorting, grouping, and counting the links
52+
53+
### links.sorted_by_XXX_id.txt
54+
Then we sorts the `links.with_ids.txt` according to the first "source" id, into
55+
the file `links.sorted_by_source_id.txt`, and according to the second "target" id
56+
into the file `links.sorted_by_target_id.txt`.
57+
58+
### links.grouped_by_XXX_id.txt
59+
Then, we use those two files to *GROUP BY* the links by source and by target.
60+
The file `links.grouped_by_source_id.txt` is like this
61+
- `pl_from` -> Id of the "from" page
62+
- `targets` -> A `|`-separated string of the ids the "from" page targets
63+
64+
The file `links.grouped_by_target_id.txt` is like this
65+
- `froms` -> A `|`-separated string of the ids of the pages targeting the "target" page
66+
- `pl_target` -> Id of the "target" page
67+
68+
### links.with_counts.txt (combine_grouped_links_files.py)
69+
We merge the two files `links.grouped_by_XXX_id.txt` creating the following file
70+
- `page_id` -> The id of the page
71+
- `outgoing_links_count` -> The number of outgoing links from this page
72+
- `incoming_links_count` -> The number of incoming links to this page
73+
- `outgoing_links` -> A `|`-separated string of the ids of the pages this page links to
74+
- `incoming_links` -> A `|`-separated string of the ids of the pages linking to this page
75+
76+
## Making the database
77+
To make the database, we copy directly the contents of the three files into the corresponding tables
78+
- `links.with_counts.txt`
79+
- `page.pruned.txt`
80+
- `redirects.with_ids.txt`

scripts/buildDatabase.sh

Lines changed: 70 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,16 @@
11
#!/bin/bash
2-
32
set -euo pipefail
43

54
# Force default language for output sorting to be bytewise. Necessary to ensure uniformity amongst
65
# UNIX commands.
76
export LC_ALL=C
7+
PYTHON=python3
8+
LANGWIKI=frwiki
89

910
# By default, the latest Wikipedia dump will be downloaded. If a download date in the format
1011
# YYYYMMDD is provided as the first argument, it will be used instead.
1112
if [[ $# -eq 0 ]]; then
12-
DOWNLOAD_DATE=$(wget -q -O- https://dumps.wikimedia.your.org/enwiki/ | grep -Po '\d{8}' | sort | tail -n1)
13+
DOWNLOAD_DATE=$(wget -q -O- https://dumps.wikimedia.org/$LANGWIKI/ | grep -Po '\d{8}' | sort | tail -n1)
1314
else
1415
if [ ${#1} -ne 8 ]; then
1516
echo "[ERROR] Invalid download date provided: $1"
@@ -22,14 +23,16 @@ fi
2223
ROOT_DIR=`pwd`
2324
OUT_DIR="dump"
2425

25-
DOWNLOAD_URL="https://dumps.wikimedia.your.org/enwiki/$DOWNLOAD_DATE"
26-
TORRENT_URL="https://tools.wmflabs.org/dump-torrents/enwiki/$DOWNLOAD_DATE"
26+
DELETE_PROGRESSIVELY=false
27+
DOWNLOAD_URL="https://dumps.wikimedia.org/$LANGWIKI/$DOWNLOAD_DATE"
28+
TORRENT_URL="https://tools.wmflabs.org/dump-torrents/$LANGWIKI/$DOWNLOAD_DATE"
2729

28-
SHA1SUM_FILENAME="enwiki-$DOWNLOAD_DATE-sha1sums.txt"
29-
REDIRECTS_FILENAME="enwiki-$DOWNLOAD_DATE-redirect.sql.gz"
30-
PAGES_FILENAME="enwiki-$DOWNLOAD_DATE-page.sql.gz"
31-
LINKS_FILENAME="enwiki-$DOWNLOAD_DATE-pagelinks.sql.gz"
30+
SHA1SUM_FILENAME="$LANGWIKI-$DOWNLOAD_DATE-sha1sums.txt"
3231

32+
REDIRECTS_FILENAME="$LANGWIKI-$DOWNLOAD_DATE-redirect.sql.gz"
33+
PAGES_FILENAME="$LANGWIKI-$DOWNLOAD_DATE-page.sql.gz"
34+
LINKS_FILENAME="$LANGWIKI-$DOWNLOAD_DATE-pagelinks.sql.gz"
35+
TARGETS_FILENAME="$LANGWIKI-$DOWNLOAD_DATE-linktarget.sql.gz"
3336

3437
# Make the output directory if it doesn't already exist and move to it
3538
mkdir -p $OUT_DIR
@@ -77,6 +80,7 @@ download_file "sha1sums" $SHA1SUM_FILENAME
7780
download_file "redirects" $REDIRECTS_FILENAME
7881
download_file "pages" $PAGES_FILENAME
7982
download_file "links" $LINKS_FILENAME
83+
download_file "targets" $TARGETS_FILENAME
8084

8185
##########################
8286
# TRIM WIKIPEDIA DUMPS #
@@ -103,7 +107,7 @@ if [ ! -f redirects.txt.gz ]; then
103107
else
104108
echo "[WARN] Already trimmed redirects file"
105109
fi
106-
110+
if $DELETE_PROGRESSIVELY; then rm $REDIRECTS_FILENAME; fi
107111
if [ ! -f pages.txt.gz ]; then
108112
echo
109113
echo "[INFO] Trimming pages file"
@@ -116,16 +120,16 @@ if [ ! -f pages.txt.gz ]; then
116120
# Splice out the page title and whether or not the page is a redirect
117121
# Zip into output file
118122
time pigz -dc $PAGES_FILENAME \
119-
| sed -n 's/^INSERT INTO `page` VALUES (//p' \
120-
| sed -e 's/),(/\'$'\n/g' \
121-
| egrep "^[0-9]+,0," \
122-
| sed -e $"s/,0,'/\t/" \
123-
| sed -e $"s/',[^,]*,\([01]\).*/\t\1/" \
123+
| sed -n 's/^INSERT INTO `page` VALUES //p' \
124+
| egrep -o "\([0-9]+,0,'([^']*(\\\\')?)+',[01]," \
125+
| sed -re $"s/^\(([0-9]+),0,'/\1\t/" \
126+
| sed -re $"s/',([01]),/\t\1/" \
124127
| pigz --fast > pages.txt.gz.tmp
125128
mv pages.txt.gz.tmp pages.txt.gz
126129
else
127130
echo "[WARN] Already trimmed pages file"
128131
fi
132+
if $DELETE_PROGRESSIVELY; then rm $PAGES_FILENAME; fi
129133

130134
if [ ! -f links.txt.gz ]; then
131135
echo
@@ -141,14 +145,38 @@ if [ ! -f links.txt.gz ]; then
141145
time pigz -dc $LINKS_FILENAME \
142146
| sed -n 's/^INSERT INTO `pagelinks` VALUES (//p' \
143147
| sed -e 's/),(/\'$'\n/g' \
144-
| egrep "^[0-9]+,0,.*,0$" \
145-
| sed -e $"s/,0,'/\t/g" \
146-
| sed -e "s/',0//g" \
148+
| egrep "^[0-9]+,0,[0-9]+$" \
149+
| sed -e $"s/,0,/\t/g" \
147150
| pigz --fast > links.txt.gz.tmp
148151
mv links.txt.gz.tmp links.txt.gz
149152
else
150153
echo "[WARN] Already trimmed links file"
151154
fi
155+
if $DELETE_PROGRESSIVELY; then rm $LINKS_FILENAME; fi
156+
157+
if [ ! -f targets.txt.gz ]; then
158+
echo
159+
echo "[INFO] Trimming targets file"
160+
161+
# Unzip
162+
# Remove all lines that don't start with INSERT INTO...
163+
# Split into individual records
164+
# Only keep records in namespace 0
165+
# Replace namespace with a tab
166+
# Remove everything starting at the to page name's closing apostrophe
167+
# Zip into output file
168+
time pigz -dc $TARGETS_FILENAME \
169+
| sed -n 's/^INSERT INTO `linktarget` VALUES (//p' \
170+
| sed -e 's/),(/\'$'\n/g' \
171+
| egrep "^[0-9]+,0,.*$" \
172+
| sed -e $"s/,0,'/\t/g" \
173+
| sed -e "s/'$//g" \
174+
| pigz --fast > targets.txt.gz.tmp
175+
mv targets.txt.gz.tmp targets.txt.gz
176+
else
177+
echo "[WARN] Already trimmed targets file"
178+
fi
179+
if $DELETE_PROGRESSIVELY; then rm $TARGETS_FILENAME; fi
152180

153181

154182
###########################################
@@ -157,32 +185,46 @@ fi
157185
if [ ! -f redirects.with_ids.txt.gz ]; then
158186
echo
159187
echo "[INFO] Replacing titles in redirects file"
160-
time python "$ROOT_DIR/replace_titles_in_redirects_file.py" pages.txt.gz redirects.txt.gz \
188+
time $PYTHON "$ROOT_DIR/replace_titles_in_redirects_file.py" pages.txt.gz redirects.txt.gz \
161189
| sort -S 100% -t $'\t' -k 1n,1n \
162190
| pigz --fast > redirects.with_ids.txt.gz.tmp
163191
mv redirects.with_ids.txt.gz.tmp redirects.with_ids.txt.gz
164192
else
165193
echo "[WARN] Already replaced titles in redirects file"
166194
fi
195+
if $DELETE_PROGRESSIVELY; then rm redirects.txt.gz; fi
196+
197+
if [ ! -f targets.with_ids.txt.gz ]; then
198+
echo
199+
echo "[INFO] Replacing titles and redirects in targets file"
200+
time $PYTHON "$ROOT_DIR/replace_titles_and_redirects_in_targets_file.py" pages.txt.gz redirects.with_ids.txt.gz targets.txt.gz \
201+
| pigz --fast > targets.with_ids.txt.gz.tmp
202+
mv targets.with_ids.txt.gz.tmp targets.with_ids.txt.gz
203+
else
204+
echo "[WARN] Already replaced titles and redirects in targets file"
205+
fi
206+
if $DELETE_PROGRESSIVELY; then rm targets.txt.gz; fi
167207

168208
if [ ! -f links.with_ids.txt.gz ]; then
169209
echo
170210
echo "[INFO] Replacing titles and redirects in links file"
171-
time python "$ROOT_DIR/replace_titles_and_redirects_in_links_file.py" pages.txt.gz redirects.with_ids.txt.gz links.txt.gz \
211+
time $PYTHON "$ROOT_DIR/replace_titles_and_redirects_in_links_file.py" pages.txt.gz redirects.with_ids.txt.gz targets.with_ids.txt.gz links.txt.gz \
172212
| pigz --fast > links.with_ids.txt.gz.tmp
173213
mv links.with_ids.txt.gz.tmp links.with_ids.txt.gz
174214
else
175215
echo "[WARN] Already replaced titles and redirects in links file"
176216
fi
217+
if $DELETE_PROGRESSIVELY; then rm links.txt.gz targets.with_ids.txt.gz; fi
177218

178219
if [ ! -f pages.pruned.txt.gz ]; then
179220
echo
180221
echo "[INFO] Pruning pages which are marked as redirects but with no redirect"
181-
time python "$ROOT_DIR/prune_pages_file.py" pages.txt.gz redirects.with_ids.txt.gz \
222+
time $PYTHON "$ROOT_DIR/prune_pages_file.py" pages.txt.gz redirects.with_ids.txt.gz \
182223
| pigz --fast > pages.pruned.txt.gz
183224
else
184225
echo "[WARN] Already pruned pages which are marked as redirects but with no redirect"
185226
fi
227+
if $DELETE_PROGRESSIVELY; then rm pages.txt.gz; fi
186228

187229
#####################
188230
# SORT LINKS FILE #
@@ -210,6 +252,7 @@ if [ ! -f links.sorted_by_target_id.txt.gz ]; then
210252
else
211253
echo "[WARN] Already sorted links file by target page ID"
212254
fi
255+
if $DELETE_PROGRESSIVELY; then rm links.with_ids.txt.gz; fi
213256

214257

215258
#############################
@@ -225,6 +268,7 @@ if [ ! -f links.grouped_by_source_id.txt.gz ]; then
225268
else
226269
echo "[WARN] Already grouped source links file by source page ID"
227270
fi
271+
if $DELETE_PROGRESSIVELY; then rm links.sorted_by_source_id.txt.gz; fi
228272

229273
if [ ! -f links.grouped_by_target_id.txt.gz ]; then
230274
echo
@@ -235,6 +279,7 @@ if [ ! -f links.grouped_by_target_id.txt.gz ]; then
235279
else
236280
echo "[WARN] Already grouped target links file by target page ID"
237281
fi
282+
if $DELETE_PROGRESSIVELY; then rm links.sorted_by_target_id.txt.gz; fi
238283

239284

240285
################################
@@ -243,12 +288,13 @@ fi
243288
if [ ! -f links.with_counts.txt.gz ]; then
244289
echo
245290
echo "[INFO] Combining grouped links files"
246-
time python "$ROOT_DIR/combine_grouped_links_files.py" links.grouped_by_source_id.txt.gz links.grouped_by_target_id.txt.gz \
291+
time $PYTHON "$ROOT_DIR/combine_grouped_links_files.py" links.grouped_by_source_id.txt.gz links.grouped_by_target_id.txt.gz \
247292
| pigz --fast > links.with_counts.txt.gz.tmp
248293
mv links.with_counts.txt.gz.tmp links.with_counts.txt.gz
249294
else
250295
echo "[WARN] Already combined grouped links files"
251296
fi
297+
if $DELETE_PROGRESSIVELY; then rm links.grouped_by_source_id.txt.gz links.grouped_by_target_id.txt.gz; fi
252298

253299

254300
############################
@@ -258,14 +304,17 @@ if [ ! -f sdow.sqlite ]; then
258304
echo
259305
echo "[INFO] Creating redirects table"
260306
time pigz -dc redirects.with_ids.txt.gz | sqlite3 sdow.sqlite ".read $ROOT_DIR/../sql/createRedirectsTable.sql"
307+
if $DELETE_PROGRESSIVELY; then rm redirects.with_ids.txt.gz; fi
261308

262309
echo
263310
echo "[INFO] Creating pages table"
264311
time pigz -dc pages.pruned.txt.gz | sqlite3 sdow.sqlite ".read $ROOT_DIR/../sql/createPagesTable.sql"
312+
if $DELETE_PROGRESSIVELY; then rm pages.pruned.txt.gz; fi
265313

266314
echo
267315
echo "[INFO] Creating links table"
268316
time pigz -dc links.with_counts.txt.gz | sqlite3 sdow.sqlite ".read $ROOT_DIR/../sql/createLinksTable.sql"
317+
if $DELETE_PROGRESSIVELY; then rm links.with_counts.txt.gz; fi
269318

270319
echo
271320
echo "[INFO] Compressing SQLite file"
@@ -274,6 +323,5 @@ else
274323
echo "[WARN] Already created SQLite database"
275324
fi
276325

277-
278326
echo
279327
echo "[INFO] All done!"

scripts/combine_grouped_links_files.py

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -28,26 +28,27 @@
2828

2929
# Create a dictionary of page IDs to their incoming and outgoing links.
3030
LINKS = defaultdict(lambda: defaultdict(str))
31-
for line in io.BufferedReader(gzip.open(OUTGOING_LINKS_FILE, 'r')):
32-
[source_page_id, target_page_ids] = line.rstrip('\n').split('\t')
33-
LINKS[source_page_id]['outgoing'] = target_page_ids
31+
# outgoing is [0], incoming is [1]
32+
for line in io.BufferedReader(gzip.open(OUTGOING_LINKS_FILE, 'rb')):
33+
[source_page_id, target_page_ids] = line.rstrip(b'\n').split(b'\t')
34+
LINKS[int(source_page_id)][0] = target_page_ids
3435

35-
for line in io.BufferedReader(gzip.open(INCOMING_LINKS_FILE, 'r')):
36-
[target_page_id, source_page_ids] = line.rstrip('\n').split('\t')
37-
LINKS[target_page_id]['incoming'] = source_page_ids
36+
for line in io.BufferedReader(gzip.open(INCOMING_LINKS_FILE, 'rb')):
37+
[target_page_id, source_page_ids] = line.rstrip(b'\n').split(b'\t')
38+
LINKS[int(target_page_id)][1] = source_page_ids
3839

3940
# For each page in the links dictionary, print out its incoming and outgoing links as well as their
4041
# counts.
41-
for page_id, links in LINKS.iteritems():
42-
outgoing_links = links.get('outgoing', '')
43-
outgoing_links_count = 0 if outgoing_links is '' else len(
44-
outgoing_links.split('|'))
42+
for page_id, links in LINKS.items():
43+
outgoing_links = links.get(0, b'')
44+
outgoing_links_count = 0 if outgoing_links==b'' else len(
45+
outgoing_links.split(b'|'))
4546

46-
incoming_links = links.get('incoming', '')
47-
incoming_links_count = 0 if incoming_links is '' else len(
48-
incoming_links.split('|'))
47+
incoming_links = links.get(1, b'')
48+
incoming_links_count = 0 if incoming_links==b'' else len(
49+
incoming_links.split(b'|'))
4950

50-
columns = [page_id, str(outgoing_links_count), str(
51-
incoming_links_count), outgoing_links, incoming_links]
51+
columns = [str(page_id).encode(), str(outgoing_links_count).encode(), str(
52+
incoming_links_count).encode(), outgoing_links, incoming_links]
5253

53-
print('\t'.join(columns))
54+
print(b'\t'.join(columns).decode())

scripts/prune_pages_file.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,14 @@
2828

2929
# Create a dictionary of redirects.
3030
REDIRECTS = {}
31-
for line in io.BufferedReader(gzip.open(REDIRECTS_FILE, 'r')):
32-
[source_page_id, _] = line.rstrip('\n').split('\t')
31+
for line in io.BufferedReader(gzip.open(REDIRECTS_FILE, 'rb')):
32+
[source_page_id, _] = line.rstrip(b'\n').split(b'\t')
3333
REDIRECTS[source_page_id] = True
3434

3535
# Loop through the pages file, ignoring pages which are marked as redirects but which do not have a
3636
# corresponding redirect in the redirects dictionary, printing the remaining pages to stdout.
37-
for line in io.BufferedReader(gzip.open(PAGES_FILE, 'r')):
38-
[page_id, page_title, is_redirect] = line.rstrip('\n').split('\t')
37+
for line in io.BufferedReader(gzip.open(PAGES_FILE, 'rb')):
38+
[page_id, page_title, is_redirect] = line.rstrip(b'\n').split(b'\t')
3939

40-
if is_redirect == '0' or page_id in REDIRECTS:
41-
print('\t'.join([page_id, page_title, is_redirect]))
40+
if True or is_redirect == '0' or page_id in REDIRECTS:
41+
print(b'\t'.join([page_id, page_title, is_redirect]).decode())

0 commit comments

Comments
 (0)