Skip to content

Commit 094e5e2

Browse files
committed
feat: update another scraping variants exercise to be about JS
1 parent 6fb500b commit 094e5e2

File tree

2 files changed

+89
-50
lines changed

2 files changed

+89
-50
lines changed

sources/academy/webscraping/scraping_basics_javascript2/11_scraping_variants.md

Lines changed: 88 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -348,69 +348,108 @@ Is this the end? Maybe! In the next lesson, we'll use a scraping framework to bu
348348

349349
<Exercises />
350350

351-
### Build a scraper for watching Python jobs
351+
### Build a scraper for watching npm packages
352352

353-
You're able to build a scraper now, aren't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria:
353+
You can build a scraper now, can't you? Let's build another one! From the registry at [npmjs.com](https://www.npmjs.com/), scrape information about npm packages that match the following criteria:
354354

355-
- Tagged as "Database"
356-
- Posted within the last 60 days
355+
- Have the keyword "llm" (as in _large language model_)
356+
- Updated within the last two years ("2 years ago" is okay; "3 years ago" is too old)
357357

358-
For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprint.html#pprint.pp) to print a dictionary containing the following data:
358+
Print an array of the top 5 packages with the most dependents. Each package should be represented by an object containing the following data:
359359

360-
- Job title
361-
- Company
362-
- URL to the job posting
363-
- Date of posting
360+
- Name
361+
- Description
362+
- URL to the package detail page
363+
- Number of dependents
364+
- Number of downloads
364365

365366
Your output should look something like this:
366367

367-
```py
368-
{'title': 'Senior Full Stack Developer',
369-
'company': 'Baserow',
370-
'url': 'https://www.python.org/jobs/7705/',
371-
'posted_on': datetime.date(2024, 9, 16)}
372-
{'title': 'Senior Python Engineer',
373-
'company': 'Active Prime',
374-
'url': 'https://www.python.org/jobs/7699/',
375-
'posted_on': datetime.date(2024, 9, 5)}
376-
...
368+
```js
369+
[
370+
{
371+
name: 'langchain',
372+
url: 'https://www.npmjs.com/package/langchain',
373+
description: 'Typescript bindings for langchain',
374+
dependents: 735,
375+
downloads: 3938
376+
},
377+
{
378+
name: '@langchain/core',
379+
url: 'https://www.npmjs.com/package/@langchain/core',
380+
description: 'Core LangChain.js abstractions and schemas',
381+
dependents: 730,
382+
downloads: 5994
383+
},
384+
...
385+
]
377386
```
378387

379-
You can find everything you need for working with dates and times in Python's [`datetime`](https://docs.python.org/3/library/datetime.html) module, including `date.today()`, `datetime.fromisoformat()`, `datetime.date()`, and `timedelta()`.
380-
381388
<details>
382389
<summary>Solution</summary>
383390

384-
After inspecting the job board, you'll notice that job postings tagged as "Database" have a dedicated URL. We'll use that as our starting point, which saves us from having to scrape and check the tags manually.
385-
386-
```py
387-
from pprint import pp
388-
import httpx
389-
from bs4 import BeautifulSoup
390-
from urllib.parse import urljoin
391-
from datetime import datetime, date, timedelta
392-
393-
today = date.today()
394-
jobs_url = "https://www.python.org/jobs/type/database/"
395-
response = httpx.get(jobs_url)
396-
response.raise_for_status()
397-
soup = BeautifulSoup(response.text, "html.parser")
398-
399-
for job in soup.select(".list-recent-jobs li"):
400-
link = job.select_one(".listing-company-name a")
401-
402-
time = job.select_one(".listing-posted time")
403-
posted_at = datetime.fromisoformat(time["datetime"])
404-
posted_on = posted_at.date()
405-
posted_ago = today - posted_on
406-
407-
if posted_ago <= timedelta(days=60):
408-
title = link.text.strip()
409-
company = list(job.select_one(".listing-company-name").stripped_strings)[-1]
410-
url = urljoin(jobs_url, link["href"])
411-
pp({"title": title, "company": company, "url": url, "posted_on": posted_on})
391+
After inspecting the registry, you'll notice that packages with the keyword "llm" have a dedicated URL. Also, changing the sorting dropdown results in a page with its own URL. We'll use that as our starting point, which saves us from having to scrape the whole registry and then filter by keyword or sort by the number of dependents.
392+
393+
```js
394+
import * as cheerio from 'cheerio';
395+
396+
async function download(url) {
397+
const response = await fetch(url);
398+
if (response.ok) {
399+
const html = await response.text();
400+
return cheerio.load(html);
401+
} else {
402+
throw new Error(`HTTP ${response.status}`);
403+
}
404+
}
405+
406+
const listingURL = "https://www.npmjs.com/search?page=0&q=keywords%3Allm&sortBy=dependent_count";
407+
const $ = await download(listingURL);
408+
409+
const $promises = $("section").map(async (i, element) => {
410+
const $card = $(element);
411+
412+
const details = $card
413+
.children()
414+
.first()
415+
.children()
416+
.last()
417+
.text()
418+
.split("");
419+
const updatedText = details[2].trim();
420+
const dependents = parseInt(details[3].replace("dependents", "").trim());
421+
422+
if (updatedText.includes("years ago")) {
423+
const yearsAgo = parseInt(updatedText.replace("years ago", "").trim());
424+
if (yearsAgo > 2) {
425+
return null;
426+
}
427+
}
428+
429+
const $link = $card.find("a").first();
430+
const name = $link.text().trim();
431+
const url = new URL($link.attr("href"), listingURL).href;
432+
const description = $card.find("p").text().trim();
433+
434+
const downloadsText = $card
435+
.children()
436+
.last()
437+
.text()
438+
.replace(",", "")
439+
.trim();
440+
const downloads = parseInt(downloadsText);
441+
442+
return { name, url, description, dependents, downloads };
443+
});
444+
445+
const data = await Promise.all($promises.get());
446+
console.log(data.filter(item => item !== null).splice(0, 5));
412447
```
413448

449+
Since the HTML doesn't contain any descriptive classes, we must rely on its structure. We're using [`.children()`](https://cheerio.js.org/docs/api/classes/Cheerio#children) to carefully navigate the HTML element tree.
450+
451+
For items older than 2 years, we return `null` instead of an item. Before printing the results, we use [.filter()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/filter) to remove these empty values and [.splice()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/splice) the array down to just 5 items.
452+
414453
</details>
415454

416455
### Find the shortest CNN article which made it to the Sports homepage

sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -308,7 +308,7 @@ Is this the end? Maybe! In the next lesson, we'll use a scraping framework to bu
308308

309309
### Build a scraper for watching Python jobs
310310

311-
You're able to build a scraper now, aren't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria:
311+
You can build a scraper now, can't you? Let's build another one! Python's official website has a [job board](https://www.python.org/jobs/). Scrape the job postings that match the following criteria:
312312

313313
- Tagged as "Database"
314314
- Posted within the last 60 days

0 commit comments

Comments
 (0)