Hola chicos I hope you're well
Here's a quick guide to stuff we'll do:
I need everyone of you to make an account here: https://cds.climate.copernicus.eu/ and retrieve your API key Once you do that, you will have to create a .cdsapirc file (in your home directory), that looks like this url: https://cds.climate.copernicus.eu/api key:
so that you'll be able to query data from the Copernicus Climate Data Store (CDS), not to be confused with the Coperniucs Climate Change Service (CDS) to see how to query data from CDS, see the playground.ipynb notebook, where I'm trying a bunch of stuff
Questions:
- How to make the compute not super expensive ? Should we focus on a smaller specific geography ?
Concerns:
- Make sure one method is novel: a non-standard training loss, transfer learning, regularization, or such
- Cite packages
- I'm scared we won't get any meaningful results... Could we still get a full grades ? Wind speed prediction might be too chaotic...
TO DO:
- Check the literature, the current state of the art, etc
- Pre-process data.
- Build 5 models
- Build common evaluation metrics on which to test each model. Figures, etc. Like a 5-subplot plot for each. The losses, the results, etc
Maybe as a step one, I can build an RNN from start to finish, including data preprocessing and post processing, then use that for all 4 other models
TO DO: Define the task formally: "We predict 10m wind speed using past 24h ERA5 at time horizon: 1-hour ? Maybe do autoregressive rollouts ?
Data: TO DO:
- What we could do later is make a grid around the globe at a higher resolution, doesn't have to be ALL the data point that era5 has.
Eval: TO DO:
- Figure out what evaluation metrics you want per model
- Create a standardized evaluation 'score card' for all models
- Then write some code to show the results side by side
Models: TO DO:
- Figure out what 5 models we want (RNN ? Ensemble ?)
- Figure out what the baseline model iscd
Hence, the next steps are:
- Define the task formally in your notebook: “We predict next-hour 10m wind speed using past 24h ERA5 variables at location X.”
- Build the dataset (single DataFrame used by all models).
- Implement Baseline (Persistence) and evaluate it.
- Implement Linear Regression + Random Forest + XGBoost → plug into shared eval.
- Implement MLP.
- (If time) Implement LSTM or 1D CNN as the 5th model.
- Create the scorecard table + 1–2 plots.
- Mirror this into Overleaf sections (Models + Evaluation).
Latest:
- make a requirements.txt file for the requirements like dask etc.