Skip to content

Roadmap

Timo Denk edited this page Apr 15, 2019 · 10 revisions

Data Crawler Path

  1. Write Basic Datacrawler
    • implement datamodule-system
  2. Create Dataset Version 1
  3. Create Dataset Version 2
  4. Create Setup for Inference Website

ML Path

  1. Background
    1. Familiarize with CNN-based screenshot processing (differences to natural image processing)
    2. Validate capabilities of DeepMind's graph nets library
    3. Potential loss terms for relative/absolute ranking
    4. Do more research on papers that were published in the field
    5. Read about set to vec techniques
  2. Prototype. Development of a pagerank estimator that is a single CNN making a prediction based on a single screenshot. (Requirement: dataset v1 with "domain, screenshot, rank")
    1. Split into train, validation, test set
    2. Weighting of webpages (the dataset contains more low ranked websites, e.g. ranked 100k-200k than 10k-20k)
    3. Finding an architecture that works reasonably well.
    4. The achieved architectures serves as sort of a baseline.
  3. Graph Network. Development of a graph network that takes a domain graph as its input. (Requirement: dataset v2 with "domain, screenshots, link graph structure, rank")
    1. Delve deep into the graph nets library.
    2. Implement a graph network for testing purposes, working on a toy dataset.
    3. Implement a graph network for the actual problem at hand.
  4. UI for an inference website
  5. Paper. Less than 10 pages summarizing the ML aspects of the project

Late February 2019 Update

  • Week 8 (starting Feb 18th): Implement dataset v2 class and required data structures. Train on v1 with the loss from Burges et al. (2015): Learning to Rank using Gradient Descent.
  • Week 9 (starting Feb 25th): Implement graph nets library for PyTorch and write a unit test to ensure it is working properly. An inspiration for such toy tasks can be found in those demo notebooks.
  • Week 10 (starting Mar 4th): Continue with the work from week 9.
  • Week 11 (starting Mar 11th): Implement the graph network that processes dataset v2 and train on it.
  • Week 12 (starting Mar 18th): Run more trainings and experiment with network modifications, preprocessing methods, etc.
  • Week 13 (starting Mar 25th): Runs
  • Week 14 (starting Apr 1st): Filter visualization, interpretation
  • Week 15 (starting Apr 8th): Documentation backbone and first +20 pages
  • Week 16 (starting Apr 15th): Documentation +20 pages
  • Week 17 (starting Apr 22nd): Documentation
  • Week 18 (starting Apr 29th): Documentation
  • Week 19 (starting May 6th): Safety margin
  • Week 20 (starting May 13th): Safety margin
  • Submission May 20th

Training Runs

  1. #1 [normal]: Train with non-discrete ground truth matrix
  2. #3 [time-intense]: pre-train vs. fine-tune vs. end-to-end
  3. #5 [fast]: GN with averaging, GN with max pooling, GN with 1 core block, GN with 3 core blocks (w and w/o weight sharing)
  4. #4 [fast]: best of (2.) with different b values
  5. #3 [fast]: best of (2.) with different edge choices, namely fully-connected, bi-directional, default

Clone this wiki locally