"Gumo" (蜘蛛) is Japanese for "spider".
A web-crawler (get it?) and scraper that extracts data from a family of nested dynamic webpages with added enhancements to assist in knowledge mining applications. Written in NodeJS.
- Crawl hyperlinks present on the pages of any domain and its subdomains.
- Scrape meta-tags and body text from every page.
- Store entire sitemap in a GraphDB (currently supports Neo4J).
- Store page content in ElasticSearch for easy full-text lookup.
- Node.js ≥ 24.0.0 (LTS). Pinned in
package.json(engines) and.nvmrcfor nvm users. - Neo4j 4.0+ when using the graph (constraint syntax requires it).
-
Use Node 24+ (e.g.
nvm useif you have nvm and the repo’s.nvmrc). -
Install dependencies (uses
package-lock.jsonfor reproducible installs):npm install
Or in CI:
npm ci.
From code:
// 1: import the module
const gumo = require('gumo')
// 2: instantiate the crawler
let cron = new gumo()
// 3: call the configure method and pass the configuration options
cron.configure({
'neo4j': { // replace with your details or remove if not required
'url' : 'neo4j://localhost',
'user' : 'neo4j',
'password' : 'gumo123'
},
'elastic': { // replace with your details or remove if not required
'url' : 'http://localhost:9200',
'index' : 'myIndex'
},
'crawler': {
'url': 'https://www.example.com',
}
});
// 4: start crawling
cron.insert()Note: The config params passed to cron.configure above are the default values. See Configuration for all options.
When using Gumo as a dependency (e.g. require('gumo') with no config.json in your project), in-package defaults are used so the module loads; pass your Elasticsearch, Neo4j, and crawler settings via configure() before calling insert().
| Script | Description |
|---|---|
npm run dev |
Run the crawler (node index.js). |
npm run lint |
Run ESLint on the project (see eslint.config.js). |
npm test |
Run tests (placeholder until tests are added). |
CI runs on GitHub Actions (Node 24, lint + test) on push/PR to main/master.
The behavior of the crawler can be customized by passing a custom configuration object to the config() method. The following are the attributes which can be configured:
| Attribute ( * - Mandatory ) | Type | Accepted Values | Description | Default Value | Default Behavior |
|---|---|---|---|---|---|
| * crawler.url | string | Base URL to start scanning from | "" (empty string) | Module is disabled | |
| crawler.Cookie | string | Cookie string to be sent with each request (useful for pages that require auth) | "" (empty string) | Cookies will not be attached to the requests | |
| crawler.saveOutputAsHtml | string | "Yes"/"No" | Whether or not to store scraped content as HTML files in the output/html/ directory | "No" | Saving output as HTML files is disabled |
| crawler.saveOutputAsJson | string | "Yes"/"No" | Whether or not to store scraped content as JSON files in the output/json/ directory | "No" | Saving output as JSON files is disabled |
| crawler.maxRequestsPerSecond | int | range: 1 to 5000 | The maximum number of requests to be sent to the target in one second | 5000 | |
| crawler.maxConcurrentRequests | int | range: 1 to 5000 | The maximum number of concurrent connections to be created with the host at any given time | 5000 | |
| crawler.whiteList | Array(string) | If populated, only these URLs will be traversed | [] (empty array) | All URLs with the same hostname as the "url" attribute will be traversed | |
| crawler.blackList | Array(string) | If populated, these URLs will ignored | [] (empty array) | ||
| crawler.depth | int | range: 1 to 999 | Depth up to which nested hyperlinks will be followed | 3 | |
| * elastic.url | string | URI of the ElasticSearch instance to connect to | "http://localhost:9200" | ||
| * elastic.index | string | The name of the ElasticSearch index to store results in | "myIndex" | ||
| * neo4j.url | string | The URI of a running Neo4J instance (uses the Bolt driver to connect) | "neo4j://localhost" | ||
| * neo4j.user | string | Neo4J server username | "neo4j" | ||
| * neo4j.password | string | Neo4J server password | "gumo123" |
Page content is stored with the URL and a hash. The index is set via the elastic.index config (or config.json). If the index does not exist, it is created. Gumo uses the official @elastic/elasticsearch client; each page is indexed with id = hash and document = the page object (no separate type field).
The sitemap of all the traversed pages is stored in a convenient graph. The following structure of nodes and relationships is followed:
- Label: Page
- Properties:
| Property Name | Type | Description |
|---|---|---|
| pid | String | UID generated by the crawler which can be used to uniquely identify a page across ElasticSearch and GraphDB |
| link | String | URL of the current page |
| parent | String | URL of the page from which the current page was accessed (typically only used while creating relationships) |
| title | String | Page title as it appears in the page header |
| Name | Direction | Condition |
|---|---|---|
| links_to | (a)-[r1:links_to]->(b) | b.link = a.parent |
| links_from | (b)-[r2:links_from]->(a) | b.link = a.parent |
See CHANGELOG.md for version history and upgrading notes (e.g. Node 24, Elasticsearch client, Neo4j driver in v2.0.0).
- Make it executable from CLI
- Enable to send config parameters while invoking the gumo
- CI (GitHub Actions, Node 24, lint + test)
- Write more tests
