Google Summer of Code 2025: Quartz Solar - New data source in ML model Discussion Thread #26
Replies: 12 comments 41 replies
-
|
Hi @AUdaltsova @Sukh-P, Thank you for organising this year’s event and putting this document together for the GSOC program. I am interested in your project and would like to know further details.
I look forward to hearing from you. Thank you for your time. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @AUdaltsova and @Sukh-P, Thank you for sharing this fascinating project on improving solar energy forecasting models. Building on the discussion so far, I'd like to offer some additional thoughts and questions: Regarding auxiliary data sources:
On the technical approach:
I have experience implementing similar forecasting models with PyTorch, particularly focusing on:
I'd be happy to elaborate on any of these approaches or discuss other aspects of the project that might be helpful. The scaling aspect is particularly interesting - determining how generalizable these models can be across different climate zones. Looking forward to learning more about this project! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @AUdaltsova I am keen on working on this project and thought that the idea was really fascinating. I do have some questions, before proceeding: How would we structure an ablation study to measure the independent contribution of dust levels and nearby site information to solar energy forecasting performance? What data preprocessing methods would be suitable for dealing with noisy or missing dust level data before it is used in the model? How do we adapt the current PyTorch-based PVNet model to include more features without substantially raising computational overhead? And, what extent of prior experience in data analytics is necessary for effectively contributing to this project? |
Beta Was this translation helpful? Give feedback.
-
|
Hi @AUdaltsova and @Sukh-P, Thank you for sharing this exciting project! I’ve been going through the details, and I have a few questions to better understand the challenges and potential directions: Since PVNet relies on a substantial amount of data for training, have you considered any techniques to address potential data limitations, such as data augmentation or transfer learning? Would these approaches be relevant in this context? Given the goal of evaluating the impact of new data sources, what criteria are used to determine whether a new feature is valuable enough to justify its inclusion in the model? Is there a structured process for feature selection, or is it more experimental and iterative? When evaluating the impact of additional data, do you typically analyze its effects across different time scales (e.g., short-term vs. long-term forecasting), or is the focus primarily on immediate predictive accuracy? Given that the project involves assessing new data sources, how do you typically balance adding complexity to the model versus maintaining interpretability? Are there any techniques you prioritize to ensure that new inputs provide meaningful contributions without overfitting? Have you considered domain adaptation techniques to improve model transferability across different geographic regions or weather conditions? Given that solar energy forecasting can be highly location-dependent, what approaches have been most effective in ensuring model generalization? I’m really interested in this project and would love to contribute meaningfully. Looking forward to your insights! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @AUdaltsova , Thank you so much for the thorough and insightful response! I really appreciate you taking the time to explain the project’s approach and considerations in such detail. It’s great to gain a deeper understanding of the thought process behind data selection and model evaluation. If there’s anything I can do to assist or contribute, I’d be more than happy to help. Just let me know! |
Beta Was this translation helpful? Give feedback.
-
|
Hey @AUdaltsova and @Sukh-P, This project sounds really exciting! I’ve worked with ML and data processing in Python, using pandas, scikit-learn, and some PyTorch for my research in solid state batteries, and I’d love to get more involved. I’m particularly interested in the ablation study aspect—understanding how different auxiliary features influence prediction precision. A few questions to get a better sense of the approach: Are there any existing insights on which auxiliary features (e.g., dust, neighboring site data) have shown promise in preliminary experiments? |
Beta Was this translation helpful? Give feedback.
-
|
Hello everyone, my name is Catherine, I am from China, and I am currently a second-year master student in the Cognitive Science Lab. My research focuses on various deep learning models based on EEG signals, and most of my work is based on Pytorch. I am familiar with Python, Transformers, TCN, and LSTM (the above models are replicated because of the need for baseline algorithm comparison). I have started to explore the PVNet repository to understand the existing framework and familiarize myself with its structure. |
Beta Was this translation helpful? Give feedback.
-
|
Hello @AUdaltsova , @Sukh-P, I hope you’re well. I’m Anamta Khan, a Data Analytics and ML Engineer with experience in developing ML models using PyTorch and building cloud-based ETL pipelines. I’ve been exploring your Quartz Solar project and am excited about contributing to this open source initiative. I’m particularly enthusiastic about enhancing solar energy forecasts by integrating auxiliary data sources (e.g., dust levels, neighboring site data) to boost model precision. My background in ML and data analysis, along with familiarity in Python, positions me well for this challenge. I have a few quick questions: • Could you elaborate on the current data ingestion process with xarray and any challenges you've encountered, particularly with handling auxiliary data sources like dust levels or neighboring site information? Could you offer any guidance on the best way to approach mentors or engage further on GitHub? I'm eager to contribute and would appreciate any advice to ensure my efforts align with the project's goals. Thank you! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @AUdaltsova and @Sukh-P I’m Praneeth, an ML developer, with experience in model evaluation, data-pipelines and applied projects. So naturally, with its demand to experiment with data sources and perform an ablation study, I'm drawn to this idea. For context, some of my past work includes:
I’m excited about this project’s use of algorithms that I love designing to drive climate action. I’ve reviewed the idea and the codebase and have a few thoughts to discuss:
If there is any lack of clarity in these questions I am happy to elaborate. Thank you for the FAQs and past discussions: these have let me understand the idea better so that I could focus more on the approach to building a solution in the above questions. Looking forward to your insights and to taking this project ahead! |
Beta Was this translation helpful? Give feedback.
-
|
Hi mentors! 👋 I’ve gone through the PVNet repo, read the ICLR paper, and explored past discussions on this project. I’ve also submitted my GSoC 2025 proposal focused on integrating auxiliary data sources like dust concentration or neighbouring site features into PVNet, followed by ablation studies to evaluate their impact. I'd appreciate any feedback on the proposal direction or ideas you think I should emphasize more. A few questions I had while exploring: For dust data, do you think MERRA2 AOD is a promising candidate for this project, or should I prioritize feature engineering from satellite sources like CAMS instead? Thanks so much for your time and support 🙏 |
Beta Was this translation helpful? Give feedback.
-
|
Hi everyone, Thanks for all the helpful information shared so far — it’s been really insightful to follow along! @AUdaltsova, your explanation of how new data sources are integrated using separate encoders in PVNet’s late-fusion architecture makes things a lot clearer. It’s great to know that we can focus on normalization and sensible preprocessing without needing to rework the core architecture too much. @praneethsuresh, your idea of exploring and comparing feature engineering techniques sounds promising. I think this could really help uncover hidden patterns in the data and provide a deeper understanding of how each input contributes to the model’s performance. I’m also looking into some feature design possibilities and would be happy to collaborate or exchange findings as we go. Looking forward to experimenting more with the data and learning together! Best regards, |
Beta Was this translation helpful? Give feedback.
-
Google Summer of Code 2025 applications are now closed.We are currently reviewing all applications. Contributors will be announced 8 May 2025. Thank you! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This space is for you to ask any questions you have about this project. We're here to provide clarifications and help you understand the project's goals, scope, and requirements. Feel free to ask about anything that interests you!
Please note that this discussion is for questions and clarifications, not for formal applications.
Project Description
Adding new data sources usually gives a boost to the predictive power of our models, and finding innovative ways of extracting information from them can be even more beneficial. We would like to explore ways to improve our solar energy forecast with an ablation study of how much data on features like dust or neighbouring sites can contribute to the precision of the model. The project can be scaled depending on time constraints.
Expected Outcome
A comparative analysis of the effects of auxiliary data sources on the forecast quality.
Other Key Information
pytorch. Data analysis skills would be highly beneficial. All our data usesxarrayFAQ
Data collection is part of the project, and both dust and neighbouring sites data are there as an example, and not at all a required experiment. Some feature engineering can be done with our existing data (e.g., you probably won't need to download anything to add information about neighbouring sites), but generally exploring options and looking for available (preferably open-source) datasets is part of the work. We will be able to provide the data PVNet already uses, but not the new data (though we might have similar download scripts if it's a provider we've worked with before).
Here's the list of providers we currently use:
Providers we've tried before and will not revisit/out of scope of this project:
Providers we've looked into before that could be revisited here:
ICON DWD (Europe) (Semi-started - loader implemented and some data collected)
ICON DWD (Global) (! Extra challenging, as it uses an isohedral projection we are currently not set up for, will require a lot of work to implement)
MERRA2 AOD (! Provided the NWPs we are using are not incorporating aerosol information into their irradiance calculation: we will expect it to be less helpful if they do, because that means we're already getting this information another way)
What should I look for when selecting a dataset for PVNet?
The list above is full of NWP providers since we already know that adding new ones very likely will boost PVNet's performance, but don't feel like you can only propose NWPs! This project can also operate on collected data if you choose to focus on feature engineering instead (in fact, both examples given are actually feature engineering and will not require data collection - dust is extracted from satellite data, and neighbouring sites is PV (photovoltaic) data that we have to supply for truths anyway)
If you want to get a better idea about PVNet, how it operates, and what data could be useful for it, PVNet's README is a good place to start, and I'd recommend the ICLR paper on it, as it explains the data set-up quite well.
Important non-obvious consideration: we operate a live forecasting service, which means we will need the data to be provided live as well! That doesn't necessarily mean in real-time (there will always be some delay between live data collection and when it gets to us), but some things are just not available at the rate we need them: e g, our UK model doesn't get recent PV history because PVLive service we use for it only provides it the next day (so if you want to tinker with PV feature engineering, it will probably lean towards extracting coarse patterns that will still be relevant a day or more later (see adjuster project), and not immediate exchange of information). Sometimes we can use different providers for training and live services, for example if the live provider's archive is not very accessible (e g CAMS) and we can replace it with a provider with a good archive but no live service (e g MERRA2).
Ideally, you would use PVNet v4+ (ocf-data-sampler compatible). This will make the results of your study immediately comparable to our previous work. We might resort to a small self-built model for a proof-of-concept approach but we would strongly prefer for PVNet to be used if at all possible.
PVNet is a large model and requires a lot of data to train! We might be able to give you access to our server, so you will not have to train on your local machine.
Generally speaking, architecture experiments are outside of the scope of this project - if this is something you want to do, we have other projects that might appeal to you more, or you can come up with your own! That being said, depending on the data source you might need to fiddle with PVNet's encoders to extract the most out of it, so some architecture work can be included, but PVNet's general framework should stay the same.
I'd expect we'll need to train PVNet with and without the new source(s) and compare performance. Our main metric is MAE, which is calculated on all forecast horizons. If you want to take a different approach to the ablation study, this can absolutely be discussed - more interpretability is always great, and we are open to any ideas you might have!
Since the project requires data collection and manipulation, you should be able to compile a dataset, check its value ranges, identify noise/outliers etc. Knowledge of anything more complex is not necessary, as this project should not deal with data augmentation and interpolation - we would prefer data sources that already provide enough data and coverage. We use
xarrayfor our data, so having at least basic familiarity with it would be greatThere are currently no issues associated with this project specifically, but we always welcome contributions! Please read our OS guidelines carefully first
Beta Was this translation helpful? Give feedback.
All reactions