Skip to content

Commit cd43632

Browse files
committed
Check for updates based on modified date for scrapers without release
1 parent 6ee1693 commit cd43632

20 files changed

+186
-115
lines changed

docs/Scraper-Reference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,7 @@ More information about how filters work is available on the [Filter Reference](.
187187

188188
## Keeping scrapers up-to-date
189189

190-
In order to keep scrapers up-to-date the `get_latest_version(options, &block)` method should be overridden by all scrapers that define the `self.release` attribute. This method should return the latest version of the documentation that is being scraped. The result of this method is periodically reported in a "Documentation versions report" issue which helps maintainers keep track of outdated documentations.
190+
In order to keep scrapers up-to-date the `get_latest_version(options, &block)` method should be overridden. If `self.release` is defined, this should return the latest version of the documentation. If `self.release` is not defined, it should return the Epoch time when the documentation was last modified. If the documentation will never change, simply return `1.0.0`. The result of this method is periodically reported in a "Documentation versions report" issue which helps maintainers keep track of outdated documentations.
191191

192192
To make life easier, there are a few utility methods that you can use in `get_latest_version`:
193193
* `fetch(url, options, &block)`

lib/docs/core/doc.rb

Lines changed: 103 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,6 @@ def store_meta(store)
152152
end
153153
end
154154

155-
156155
def initialize
157156
raise NotImplementedError, "#{self.class} is an abstract class and cannot be instantiated." if self.class.abstract
158157
end
@@ -164,5 +163,108 @@ def build_page(id, &block)
164163
def build_pages(&block)
165164
raise NotImplementedError
166165
end
166+
167+
def get_scraper_version(opts, &block)
168+
if self.class.method_defined?(:options) and !options[:release].nil?
169+
block.call options[:release]
170+
else
171+
# If options[:release] does not exist, we return the Epoch timestamp of when the doc was last modified in DevDocs production
172+
fetch_json('https://devdocs.io/docs.json', opts) do |json|
173+
items = json.select {|item| item['name'] == self.class.name}
174+
items = items.map {|item| item['mtime']}
175+
block.call items.max
176+
end
177+
end
178+
end
179+
180+
# Should return the latest version of this documentation
181+
# If options[:release] is defined, it should be in the same format
182+
# If options[:release] is not defined, it should return the Epoch timestamp of when the documentation was last updated
183+
# If the docs will never change, simply return '1.0.0'
184+
def get_latest_version(options, &block)
185+
raise NotImplementedError
186+
end
187+
188+
# Returns whether or not this scraper is outdated.
189+
#
190+
# The default implementation assumes the documentation uses a semver(-like) approach when it comes to versions.
191+
# Patch updates are ignored because there are usually little to no documentation changes in bug-fix-only releases.
192+
#
193+
# Scrapers of documentations that do not use this versioning approach should override this method.
194+
#
195+
# Examples of the default implementation:
196+
# 1 -> 2 = outdated
197+
# 1.1 -> 1.2 = outdated
198+
# 1.1.1 -> 1.1.2 = not outdated
199+
def is_outdated(scraper_version, latest_version)
200+
scraper_parts = scraper_version.to_s.split(/\./).map(&:to_i)
201+
latest_parts = latest_version.to_s.split(/\./).map(&:to_i)
202+
203+
# Only check the first two parts, the third part is for patch updates
204+
[0, 1].each do |i|
205+
break if i >= scraper_parts.length or i >= latest_parts.length
206+
return true if latest_parts[i] > scraper_parts[i]
207+
return false if latest_parts[i] < scraper_parts[i]
208+
end
209+
210+
false
211+
end
212+
213+
private
214+
215+
#
216+
# Utility methods for get_latest_version
217+
#
218+
219+
def fetch(url, options, &block)
220+
headers = {}
221+
222+
if options.key?(:github_token) and url.start_with?('https://api.github.com/')
223+
headers['Authorization'] = "token #{options[:github_token]}"
224+
end
225+
226+
options[:logger].debug("Fetching #{url}")
227+
228+
Request.run(url, { headers: headers }) do |response|
229+
if response.success?
230+
block.call response.body
231+
else
232+
options[:logger].error("Couldn't fetch #{url} (response code #{response.code})")
233+
block.call nil
234+
end
235+
end
236+
end
237+
238+
def fetch_doc(url, options, &block)
239+
fetch(url, options) do |body|
240+
block.call Nokogiri::HTML.parse(body, nil, 'UTF-8')
241+
end
242+
end
243+
244+
def fetch_json(url, options, &block)
245+
fetch(url, options) do |body|
246+
block.call JSON.parse(body)
247+
end
248+
end
249+
250+
def get_npm_version(package, options, &block)
251+
fetch_json("https://registry.npmjs.com/#{package}", options) do |json|
252+
block.call json['dist-tags']['latest']
253+
end
254+
end
255+
256+
def get_latest_github_release(owner, repo, options, &block)
257+
fetch_json("https://api.github.com/repos/#{owner}/#{repo}/releases/latest", options, &block)
258+
end
259+
260+
def get_github_tags(owner, repo, options, &block)
261+
fetch_json("https://api.github.com/repos/#{owner}/#{repo}/tags", options, &block)
262+
end
263+
264+
def get_github_file_contents(owner, repo, path, options, &block)
265+
fetch_json("https://api.github.com/repos/#{owner}/#{repo}/contents/#{path}", options) do |json|
266+
block.call(Base64.decode64(json['content']))
267+
end
268+
end
167269
end
168270
end

lib/docs/core/scraper.rb

Lines changed: 0 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -132,35 +132,6 @@ def options
132132
end
133133
end
134134

135-
def get_latest_version(options, &block)
136-
raise NotImplementedError
137-
end
138-
139-
# Returns whether or not this scraper is outdated.
140-
#
141-
# The default implementation assumes the documentation uses a semver(-like) approach when it comes to versions.
142-
# Patch updates are ignored because there are usually little to no documentation changes in bug-fix-only releases.
143-
#
144-
# Scrapers of documentations that do not use this versioning approach should override this method.
145-
#
146-
# Examples of the default implementation:
147-
# 1 -> 2 = outdated
148-
# 1.1 -> 1.2 = outdated
149-
# 1.1.1 -> 1.1.2 = not outdated
150-
def is_outdated(scraper_version, latest_version)
151-
scraper_parts = scraper_version.split(/\./).map(&:to_i)
152-
latest_parts = latest_version.split(/\./).map(&:to_i)
153-
154-
# Only check the first two parts, the third part is for patch updates
155-
[0, 1].each do |i|
156-
break if i >= scraper_parts.length or i >= latest_parts.length
157-
return true if latest_parts[i] > scraper_parts[i]
158-
return false if latest_parts[i] < scraper_parts[i]
159-
end
160-
161-
false
162-
end
163-
164135
private
165136

166137
def request_one(url)
@@ -231,62 +202,6 @@ def additional_options
231202
{}
232203
end
233204

234-
#
235-
# Utility methods for get_latest_version
236-
#
237-
238-
def fetch(url, options, &block)
239-
headers = {}
240-
241-
if options.key?(:github_token) and url.start_with?('https://api.github.com/')
242-
headers['Authorization'] = "token #{options[:github_token]}"
243-
end
244-
245-
options[:logger].debug("Fetching #{url}")
246-
247-
Request.run(url, { headers: headers }) do |response|
248-
if response.success?
249-
block.call response.body
250-
else
251-
options[:logger].error("Couldn't fetch #{url} (response code #{response.code})")
252-
block.call nil
253-
end
254-
end
255-
end
256-
257-
def fetch_doc(url, options, &block)
258-
fetch(url, options) do |body|
259-
block.call Nokogiri::HTML.parse body, nil, 'UTF-8'
260-
end
261-
end
262-
263-
def fetch_json(url, options, &block)
264-
fetch(url, options) do |body|
265-
json = JSON.parse(body)
266-
block.call json
267-
end
268-
end
269-
270-
def get_npm_version(package, options, &block)
271-
fetch_json("https://registry.npmjs.com/#{package}", options) do |json|
272-
block.call json['dist-tags']['latest']
273-
end
274-
end
275-
276-
def get_latest_github_release(owner, repo, options, &block)
277-
fetch_json("https://api.github.com/repos/#{owner}/#{repo}/releases/latest", options, &block)
278-
end
279-
280-
def get_github_tags(owner, repo, options, &block)
281-
fetch_json("https://api.github.com/repos/#{owner}/#{repo}/tags", options, &block)
282-
end
283-
284-
def get_github_file_contents(owner, repo, path, options, &block)
285-
fetch_json("https://api.github.com/repos/#{owner}/#{repo}/contents/#{path}", options) do |json|
286-
block.call(Base64.decode64(json['content']))
287-
end
288-
end
289-
290205
module FixInternalUrlsBehavior
291206
def self.included(base)
292207
base.extend ClassMethods

lib/docs/scrapers/c.rb

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,14 @@ class C < FileScraper
2626
Licensed under the Creative Commons Attribution-ShareAlike Unported License v3.0.
2727
HTML
2828

29+
def get_latest_version(options, &block)
30+
fetch_doc('https://en.cppreference.com/w/Cppreference:Archives', options) do |doc|
31+
link = doc.at_css('a[title^="File:"]')
32+
date = link.content.scan(/(\d+)\./)[0][0]
33+
block.call DateTime.strptime(date, '%Y%m%d').to_time.to_i
34+
end
35+
end
36+
2937
private
3038

3139
def file_path_for(*)

lib/docs/scrapers/chef.rb

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,9 +49,8 @@ class Chef < UrlScraper
4949
end
5050

5151
def get_latest_version(options, &block)
52-
fetch_doc('https://docs-archive.chef.io/', options) do |doc|
53-
cell = doc.at_css('.main-archives > tr:nth-child(2) > td:nth-child(2)')
54-
block.call cell.content.sub(/Chef Client /, '')
52+
fetch_doc('https://downloads.chef.io/chef', options) do |doc|
53+
block.call doc.at_css('h1.product-heading > span').content.strip
5554
end
5655
end
5756
end

lib/docs/scrapers/cpp.rb

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,15 @@ class Cpp < FileScraper
3434
Licensed under the Creative Commons Attribution-ShareAlike Unported License v3.0.
3535
HTML
3636

37+
# Same as get_latest_version in lib/docs/scrapers/c.rb
38+
def get_latest_version(options, &block)
39+
fetch_doc('https://en.cppreference.com/w/Cppreference:Archives', options) do |doc|
40+
link = doc.at_css('a[title^="File:"]')
41+
date = link.content.scan(/(\d+)\./)[0][0]
42+
block.call DateTime.strptime(date, '%Y%m%d').to_time.to_i
43+
end
44+
end
45+
3746
private
3847

3948
def file_path_for(*)

lib/docs/scrapers/haskell.rb

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ class Haskell < UrlScraper
1010

1111
html_filters.push 'haskell/entries', 'haskell/clean_html'
1212

13-
options[:container] = ->(filter) { filter.subpath.start_with?('users_guide') ? '.body' : '#content' }
13+
options[:container] = ->(filter) {filter.subpath.start_with?('users_guide') ? '.body' : '#content'}
1414

1515
options[:only_patterns] = [/\Alibraries\//, /\Ausers_guide\//]
1616
options[:skip_patterns] = [
@@ -70,9 +70,10 @@ class Haskell < UrlScraper
7070
end
7171

7272
def get_latest_version(options, &block)
73-
fetch_doc('https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/', options) do |doc|
74-
label = doc.at_css('.related > ul > li:last-child').content
75-
block.call label.scan(/([0-9.]+)/)[0][0]
73+
fetch_doc('https://downloads.haskell.org/~ghc/latest/docs/html/', options) do |doc|
74+
links = doc.css('a').to_a
75+
versions = links.map {|link| link['href'].scan(/ghc-([0-9.]+)/)}
76+
block.call versions.find {|version| !version.empty?}[0][0]
7677
end
7778
end
7879
end

lib/docs/scrapers/http.rb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ class Http < Mdn
77

88
html_filters.push 'http/clean_html', 'http/entries', 'title'
99

10+
options[:mdn_tag] = 'HTTP'
11+
1012
options[:root_title] = 'HTTP'
1113
options[:title] = ->(filter) { filter.current_url.host == 'tools.ietf.org' ? false : filter.default_title }
1214
options[:container] = ->(filter) { filter.current_url.host == 'tools.ietf.org' ? '.content' : nil }

lib/docs/scrapers/markdown.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,5 +13,9 @@ class Markdown < UrlScraper
1313
&copy; 2004 John Gruber<br>
1414
Licensed under the BSD License.
1515
HTML
16+
17+
def get_latest_version(options, &block)
18+
block.call '1.0.0'
19+
end
1620
end
1721
end

lib/docs/scrapers/mdn/css.rb

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ class Css < Mdn
66

77
html_filters.push 'css/clean_html', 'css/entries', 'title'
88

9+
options[:mdn_tag] = 'CSS'
10+
911
options[:root_title] = 'CSS'
1012

1113
options[:skip] = %w(/CSS3 /Media/Visual /paged_media /Media/TV /Media/Tactile)

0 commit comments

Comments
 (0)