Skip to content

Commit 66ff4d9

Browse files
authored
Merge pull request #188 from pabs3/fixes
Fix various issues
2 parents e6707a9 + 83b4f88 commit 66ff4d9

File tree

6 files changed

+42
-31
lines changed

6 files changed

+42
-31
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ rdoc
66
log
77
websites
88
.DS_Store
9+
.rake_tasks~
910

1011
## BUNDLER
1112
*.gem

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ It will download the last version of every file present on Wayback Machine to `.
4242
-x, --exclude EXCLUDE_FILTER Skip downloading of urls that match this filter
4343
(use // notation for the filter to be treated as a regex)
4444
-a, --all Expand downloading to error files (40x and 50x) and redirections (30x)
45-
-c, --concurrency NUMBER Number of multiple files to dowload at a time
45+
-c, --concurrency NUMBER Number of multiple files to download at a time
4646
Default is one file at a time (ie. 20)
4747
-p, --maximum-snapshot NUMBER Maximum snapshot pages to consider (Default is 100)
4848
Count an average of 150,000 snapshots per page
@@ -62,7 +62,7 @@ Example:
6262

6363
-s, --all-timestamps
6464

65-
Optional. This option will download all timestamps/snapshots for a given website. It will uses the timepstamp of each snapshot as directory.
65+
Optional. This option will download all timestamps/snapshots for a given website. It will uses the timestamp of each snapshot as directory.
6666

6767
Example:
6868

@@ -78,7 +78,7 @@ Example:
7878

7979
-f, --from TIMESTAMP
8080

81-
Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp.
81+
Optional. You may want to supply a from timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., https://web.archive.org/web/20060716231334/http://example.com). You can also use years (2006), years + month (200607), etc. It can be used in combination of To Timestamp.
8282
Wayback Machine Downloader will then fetch only file versions on or after the timestamp specified.
8383

8484
Example:
@@ -89,7 +89,7 @@ Example:
8989

9090
-t, --to TIMESTAMP
9191

92-
Optional. You may want to supply a to timestamp to lock your backup to a specifc version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., http://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp.
92+
Optional. You may want to supply a to timestamp to lock your backup to a specific version of the website. Timestamps can be found inside the urls of the regular Wayback Machine website (e.g., https://web.archive.org/web/20100916231334/http://example.com). You can also use years (2010), years + month (201009), etc. It can be used in combination of From Timestamp.
9393
Wayback Machine Downloader will then fetch only file versions on or before the timestamp specified.
9494

9595
Example:
@@ -169,7 +169,7 @@ Example:
169169

170170
-c, --concurrency NUMBER
171171

172-
Optional. Specify the number of multiple files you want to download at the same time. Allows to speed up the download of a website significantly. Default is to download one file at a time.
172+
Optional. Specify the number of multiple files you want to download at the same time. Allows one to speed up the download of a website significantly. Default is to download one file at a time.
173173

174174
Example:
175175

bin/wayback_machine_downloader

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ option_parser = OptionParser.new do |opts|
4646
options[:all] = true
4747
end
4848

49-
opts.on("-c", "--concurrency NUMBER", Integer, "Number of multiple files to dowload at a time", "Default is one file at a time (ie. 20)") do |t|
49+
opts.on("-c", "--concurrency NUMBER", Integer, "Number of multiple files to download at a time", "Default is one file at a time (ie. 20)") do |t|
5050
options[:threads_count] = t
5151
end
5252

lib/wayback_machine_downloader.rb

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ class WaybackMachineDownloader
1414

1515
include ArchiveAPI
1616

17-
VERSION = "2.2.1"
17+
VERSION = "2.3.0"
1818

1919
attr_accessor :base_url, :exact_url, :directory, :all_timestamps,
2020
:from_timestamp, :to_timestamp, :only_filter, :exclude_filter,
@@ -84,7 +84,7 @@ def get_all_snapshots_to_consider
8484
# Note: Passing a page index parameter allow us to get more snapshots,
8585
# but from a less fresh index
8686
print "Getting snapshot pages"
87-
snapshot_list_to_consider = ""
87+
snapshot_list_to_consider = []
8888
snapshot_list_to_consider += get_raw_list_from_api(@base_url, nil)
8989
print "."
9090
unless @exact_url
@@ -95,17 +95,15 @@ def get_all_snapshots_to_consider
9595
print "."
9696
end
9797
end
98-
puts " found #{snapshot_list_to_consider.lines.count} snaphots to consider."
98+
puts " found #{snapshot_list_to_consider.length} snaphots to consider."
9999
puts
100100
snapshot_list_to_consider
101101
end
102102

103103
def get_file_list_curated
104104
file_list_curated = Hash.new
105-
get_all_snapshots_to_consider.each_line do |line|
106-
next unless line.include?('/')
107-
file_timestamp = line[0..13].to_i
108-
file_url = line[15..-2]
105+
get_all_snapshots_to_consider.each do |file_timestamp, file_url|
106+
next unless file_url.include?('/')
109107
file_id = file_url.split('/')[3..-1].join('/')
110108
file_id = CGI::unescape file_id
111109
file_id = file_id.tidy_bytes unless file_id == ""
@@ -130,10 +128,8 @@ def get_file_list_curated
130128

131129
def get_file_list_all_timestamps
132130
file_list_curated = Hash.new
133-
get_all_snapshots_to_consider.each_line do |line|
134-
next unless line.include?('/')
135-
file_timestamp = line[0..13].to_i
136-
file_url = line[15..-2]
131+
get_all_snapshots_to_consider.each do |file_timestamp, file_url|
132+
next unless file_url.include?('/')
137133
file_id = file_url.split('/')[3..-1].join('/')
138134
file_id_and_timestamp = [file_timestamp, file_id].join('/')
139135
file_id_and_timestamp = CGI::unescape file_id_and_timestamp
@@ -176,11 +172,15 @@ def get_file_list_by_timestamp
176172

177173
def list_files
178174
# retrieval produces its own output
175+
@orig_stdout = $stdout
176+
$stdout = $stderr
179177
files = get_file_list_by_timestamp
178+
$stdout = @orig_stdout
180179
puts "["
181-
files.each do |file|
180+
files[0...-1].each do |file|
182181
puts file.to_json + ","
183182
end
183+
puts files[-1].to_json
184184
puts "]"
185185
end
186186

lib/wayback_machine_downloader/archive_api.rb

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,38 @@
1+
require 'json'
2+
require 'uri'
3+
14
module ArchiveAPI
25

36
def get_raw_list_from_api url, page_index
4-
request_url = "https://web.archive.org/cdx/search/xd?url="
5-
request_url += url
6-
request_url += parameters_for_api page_index
7+
request_url = URI("https://web.archive.org/cdx/search/xd")
8+
params = [["output", "json"], ["url", url]]
9+
params += parameters_for_api page_index
10+
request_url.query = URI.encode_www_form(params)
711

8-
URI.open(request_url).read
12+
begin
13+
json = JSON.parse(URI(request_url).open.read)
14+
if (json[0] <=> ["timestamp","original"]) == 0
15+
json.shift
16+
end
17+
json
18+
rescue JSON::ParserError
19+
[]
20+
end
921
end
1022

1123
def parameters_for_api page_index
12-
parameters = "&fl=timestamp,original&collapse=digest&gzip=false"
13-
if @all
14-
parameters += ""
15-
else
16-
parameters += "&filter=statuscode:200"
24+
parameters = [["fl", "timestamp,original"], ["collapse", "digest"], ["gzip", "false"]]
25+
if !@all
26+
parameters.push(["filter", "statuscode:200"])
1727
end
1828
if @from_timestamp and @from_timestamp != 0
19-
parameters += "&from=" + @from_timestamp.to_s
29+
parameters.push(["from", @from_timestamp.to_s])
2030
end
2131
if @to_timestamp and @to_timestamp != 0
22-
parameters += "&to=" + @to_timestamp.to_s
32+
parameters.push(["to", @to_timestamp.to_s])
2333
end
2434
if page_index
25-
parameters += "&page=#{page_index}"
35+
parameters.push(["page", page_index])
2636
end
2737
parameters
2838
end

lib/wayback_machine_downloader/tidy_bytes.rb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ def tidy_bytes(force = false)
7070
if is_unused || is_restricted
7171
bytes[i] = tidy_byte(byte)
7272
elsif is_cont
73-
# Not expecting contination byte? Clean up. Otherwise, now expect one less.
73+
# Not expecting continuation byte? Clean up. Otherwise, now expect one less.
7474
conts_expected == 0 ? bytes[i] = tidy_byte(byte) : conts_expected -= 1
7575
else
7676
if conts_expected > 0

0 commit comments

Comments
 (0)