Roadmap

Data Crawler Path

Background
1. Familiarize with CNN-based screenshot processing (differences to natural image processing)
2. Validate capabilities of DeepMind's graph nets library
3. Potential loss terms for relative/absolute ranking
4. Do more research on papers that were published in the field
5. Read about set to vec techniques
Prototype. Development of a pagerank estimator that is a single CNN making a prediction based on a single screenshot. (Requirement: dataset v1 with "domain, screenshot, rank")
1. Split into train, validation, test set
2. Weighting of webpages (the dataset contains more low ranked websites, e.g. ranked 100k-200k than 10k-20k)
3. Finding an architecture that works reasonably well.
4. The achieved architectures serves as sort of a baseline.
Graph Network. Development of a graph network that takes a domain graph as its input. (Requirement: dataset v2 with "domain, screenshots, link graph structure, rank")
1. Delve deep into the graph nets library.
2. Implement a graph network for testing purposes, working on a toy dataset.
3. Implement a graph network for the actual problem at hand.
UI for an inference website
Paper. Less than 10 pages summarizing the ML aspects of the project

Week 8 (starting Feb 18th): Implement dataset v2 class and required data structures. Train on v1 with the loss from Burges et al. (2015): Learning to Rank using Gradient Descent.
Week 9 (starting Feb 25th): Implement graph nets library for PyTorch and write a unit test to ensure it is working properly. An inspiration for such toy tasks can be found in those demo notebooks.
Week 10 (starting Mar 4th): Continue with the work from week 9.
Week 11 (starting Mar 11th): Implement the graph network that processes dataset v2 and train on it.
Week 12 (starting Mar 18th): Run more trainings and experiment with network modifications, preprocessing methods, etc.
Week 13 (starting Mar 25th): Runs
Week 14 (starting Apr 1st): Filter visualization, interpretation
Week 15 (starting Apr 8th): Documentation backbone and first +20 pages
Week 16 (starting Apr 15th): Documentation +20 pages
Week 17 (starting Apr 22nd): Documentation
Week 18 (starting Apr 29th): Documentation
Week 19 (starting May 6th): Safety margin
Week 20 (starting May 13th): Safety margin
Submission May 20th

#1 [normal]: Train with non-discrete ground truth matrix
#3 [time-intense]: pre-train vs. fine-tune vs. end-to-end
#5 [fast]: GN with averaging, GN with max pooling, GN with 1 core block, GN with 3 core blocks (w and w/o weight sharing)
#4 [fast]: best of (2.) with different b values
#3 [fast]: best of (2.) with different edge choices, namely fully-connected, bi-directional, default