Skip to content

Seperate profile parsing from scrapers/profiles.go #81

@SoggyRhino

Description

@SoggyRhino

Currently scrapers/profiles.go also does the parsing which does not match our design.

Here is what I am proposing

  • Update scrapers/profiles.go to save .html files similar to scrapers/coursebook.go
    • add /professors to outDir
    • save profiles as {fist}-{last}.html
  • Create a parser/profiles.go
    • Copy all of the parsing logic into here, modified to use goquery instead of chromedp
  • Update flags in main.go
  • Bonus
    • Add resume support to scraper
    • Add a unit test for the parser
  • Side effects
    • parser.go uses utils.GetAllFilesWithExtension which would create an issue if the proposed /poffessors is added so we might consider scraping coursebook into outDir/coursebook/... instead.
Sample dir structure: 

 outDir (ie data)
    ├───coursebook
    │   ├───24f
    │   │   └───cp_acct
    │   │           acct2301.001.24f.html
    │   │           acct2301.002.24f.html
    │   │           ...
    │   │    ...
    └───professors
            first-last.html
            ...

I haven't worked with the profiles scraper very much but there does not seem to be any technical reason why this should not be possible.

If this is added as a task I don't mind working on it but if someone is interested feel free.

Metadata

Metadata

Assignees

Labels

L2A task suitable for someone who is comfortable helping with implementing features.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions