Skip to content

Commit d3ae165

Browse files
herohemantedivm
authored andcommitted
docs(README): update supplement crawl instructions and content structure
- Added instructions for crawling pages tagged as 'supplement' - Updated content structure to include multiple content types - Clarified post-processing details for supplements
1 parent 565bf94 commit d3ae165

File tree

1 file changed

+11
-2
lines changed

1 file changed

+11
-2
lines changed

README.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,9 +32,15 @@ To crawl the International Hub for SCP Items and save to a custom location:
3232
scrapy crawl scp_int -o scp_international_items.json
3333
```
3434

35+
To crawl pages tagged as `supplement` and save to a custom location:
36+
37+
```bash
38+
scrapy crawl scp_supplement -o scp_supplement.json
39+
```
40+
3541
## Raw Content Structure
3642

37-
There are two types of content downloaded- SCP Items and SCP Tales.
43+
There are multiple types of content downloaded (Items, Tales, GOI formats, and Supplements).
3844

3945
All content (both SCP Items and Tales) contain the following:
4046

@@ -66,6 +72,7 @@ The crawler generates a series of json files containing an array of objects repr
6672
| scp_titles.json | Main | Title | scp |
6773
| scp_hubs.json | Main | Hub | scp |
6874
| scp_tales.json | Main | Tale | scp |
75+
| scp_supplement.json | Main | Supplement | scp |
6976
| scp_int.json | International | Item | scp_int |
7077
| scp_int_titles.json | International | Title | scp_int |
7178
| scp_int_tales.json | International | Tale | scp_int |
@@ -76,7 +83,9 @@ To regenerate all files run `make fresh`.
7683

7784
## Post Processed Data
7885

79-
The postproc system takes the Titles, Hubs, Items, and Tales and uses them to generate a comprehensive set of objects. It combines and cross references data and expands on the data already there.
86+
The postproc system takes Titles, Hubs, Items, Tales, GOI, and Supplements and uses them to generate a comprehensive set of objects. It combines and cross references data and expands on the data already there.
87+
88+
Supplements are written to `data/processed/supplement/` and include additional fields like `parent_scp` and `parent_tale`.
8089

8190

8291
## Content Licensing

0 commit comments

Comments
 (0)