Google Summer of Code 2025: Quartz Solar - New data source in ML model Discussion Thread #26

AUdaltsova · 2025-02-27T10:25:49Z

AUdaltsova
Feb 27, 2025
Maintainer

This space is for you to ask any questions you have about this project. We're here to provide clarifications and help you understand the project's goals, scope, and requirements. Feel free to ask about anything that interests you!

Please note that this discussion is for questions and clarifications, not for formal applications.

Project Description

Adding new data sources usually gives a boost to the predictive power of our models, and finding innovative ways of extracting information from them can be even more beneficial. We would like to explore ways to improve our solar energy forecast with an ablation study of how much data on features like dust or neighbouring sites can contribute to the precision of the model. The project can be scaled depending on time constraints.

Expected Outcome

A comparative analysis of the effects of auxiliary data sources on the forecast quality.

Other Key Information

Expected Size: 175hrs
Skills: Familiarity with training ML models, preferably with pytorch. Data analysis skills would be highly beneficial. All our data uses xarray
Difficulty level: Medium-Hard
Related Reading: https://github.com/openclimatefix/open-data-pvnet, https://github.com/openclimatefix/PVNet
Potential mentors: @AUdaltsova, @Sukh-P

FAQ

Will OCF provide all datasets for this project?

Data collection is part of the project, and both dust and neighbouring sites data are there as an example, and not at all a required experiment. Some feature engineering can be done with our existing data (e.g., you probably won't need to download anything to add information about neighbouring sites), but generally exploring options and looking for available (preferably open-source) datasets is part of the work. We will be able to provide the data PVNet already uses, but not the new data (though we might have similar download scripts if it's a provider we've worked with before).

What are the data sources you've tried before so I don't waste time looking into them?

Here's the list of providers we currently use:

ECMWF (IFS)
MetOffice (UKV)
EUMETSAT

Providers we've tried before and will not revisit/out of scope of this project:

GFS
GenCast/GraphCast
Ensemble NWPs (e g ECMWF EPS, GEFS)

Providers we've looked into before that could be revisited here:

ICON DWD (Europe) (Semi-started - loader implemented and some data collected)
ICON DWD (Global) (! Extra challenging, as it uses an isohedral projection we are currently not set up for, will require a lot of work to implement)
MERRA2 AOD (! Provided the NWPs we are using are not incorporating aerosol information into their irradiance calculation: we will expect it to be less helpful if they do, because that means we're already getting this information another way)
What should I look for when selecting a dataset for PVNet?

The list above is full of NWP providers since we already know that adding new ones very likely will boost PVNet's performance, but don't feel like you can only propose NWPs! This project can also operate on collected data if you choose to focus on feature engineering instead (in fact, both examples given are actually feature engineering and will not require data collection - dust is extracted from satellite data, and neighbouring sites is PV (photovoltaic) data that we have to supply for truths anyway)

If you want to get a better idea about PVNet, how it operates, and what data could be useful for it, PVNet's README is a good place to start, and I'd recommend the ICLR paper on it, as it explains the data set-up quite well.

Important non-obvious consideration: we operate a live forecasting service, which means we will need the data to be provided live as well! That doesn't necessarily mean in real-time (there will always be some delay between live data collection and when it gets to us), but some things are just not available at the rate we need them: e g, our UK model doesn't get recent PV history because PVLive service we use for it only provides it the next day (so if you want to tinker with PV feature engineering, it will probably lean towards extracting coarse patterns that will still be relevant a day or more later (see adjuster project), and not immediate exchange of information). Sometimes we can use different providers for training and live services, for example if the live provider's archive is not very accessible (e g CAMS) and we can replace it with a provider with a good archive but no live service (e g MERRA2).

Will I build my own model?

Ideally, you would use PVNet v4+ (ocf-data-sampler compatible). This will make the results of your study immediately comparable to our previous work. We might resort to a small self-built model for a proof-of-concept approach but we would strongly prefer for PVNet to be used if at all possible.

But PVNet is massive, how can I work with it?

PVNet is a large model and requires a lot of data to train! We might be able to give you access to our server, so you will not have to train on your local machine.

What if I want to experiment with architecture?

Generally speaking, architecture experiments are outside of the scope of this project - if this is something you want to do, we have other projects that might appeal to you more, or you can come up with your own! That being said, depending on the data source you might need to fiddle with PVNet's encoders to extract the most out of it, so some architecture work can be included, but PVNet's general framework should stay the same.

How do we measure the impact of new data sources?

I'd expect we'll need to train PVNet with and without the new source(s) and compare performance. Our main metric is MAE, which is calculated on all forecast horizons. If you want to take a different approach to the ablation study, this can absolutely be discussed - more interpretability is always great, and we are open to any ideas you might have!

What level of data analysis skills is expected?

Since the project requires data collection and manipulation, you should be able to compile a dataset, check its value ranges, identify noise/outliers etc. Knowledge of anything more complex is not necessary, as this project should not deal with data augmentation and interpolation - we would prefer data sources that already provide enough data and coverage. We use xarray for our data, so having at least basic familiarity with it would be great

Are there any issues related to this project I can contribute to?

There are currently no issues associated with this project specifically, but we always welcome contributions! Please read our OS guidelines carefully first

celestinelindarto · 2025-02-28T15:00:09Z

celestinelindarto
Feb 28, 2025

Hi @AUdaltsova @Sukh-P,

Thank you for organising this year’s event and putting this document together for the GSOC program. I am interested in your project and would like to know further details.

Are the additional features such as dust and neighbouring sites supplied from Open Climate Fix or will it be sourced from an open-source database?
What metric is used to measure the performance of a model?
What model(s) is/are primarily needed for this project?
I noticed that the difficulty level for this project is medium-hard and the project timeline can be scaled accordingly as per the project description above. Does it mean that the project may be upgraded into a large-sized project (according to Google Summer of Code’s classification)? Is it possible to discuss the length of time required to complete the project?
Is there any ongoing issues with this particular project that I can potentially evaluate and contribute to?

I look forward to hearing from you.

Thank you for your time.

1 reply

AUdaltsova Mar 3, 2025
Maintainer Author

Hi @celestinelindarto, thanks for your interest!

I expect all new data will need to be sourced from open-source providers. We might have some related download scripts if we've worked with that provider before
We usually use MAE for performance assessment
Ideally, you would use PVNet (v4, so with ocf-data-sampler as the batch-creating pipeline), so the results are immediately comparable to our previous work. There might be a couple of pain points with this, as PVNet's quite large and requires a lot of data to train, so it might not be feasible to run locally - we could provide access to our servers instead to make that work. Alternatively, this would also work as a proof-of-concept project on a relatively small model you put together yourself instead, if we decide it's easier than trying to use PVNet
In terms of difficulty and intended time frame, this was meant more as "we can be flexible with what to explore and how many options to try". I expect, lets say, 'minimum scope' version of this (testing one hypothesis - one new source of data or one new way of processing existing data) to fit well into a medium-sized project; depending on how fast you work, this time frame can fit more experimentation, or we can agree to extend to a 'large' project. If you have concerns on fitting all the work into a medium project, we can also increase the timeline.
I don't think we have any ongoing issues related to this unfortunately, but I think open-data-pvnet could be a decent starting point, as a lot of the work done there on plugging in new sources into PVNet would be relevant here

Hope this is helpful! Let me know if you have other questions

yuvraajnarula · 2025-02-28T16:31:00Z

yuvraajnarula
Feb 28, 2025

Hi @AUdaltsova and @Sukh-P,

Thank you for sharing this fascinating project on improving solar energy forecasting models. Building on the discussion so far, I'd like to offer some additional thoughts and questions:

Regarding auxiliary data sources:

Have you considered satellite-derived aerosol optical depth measurements as a proxy for dust/particulate matter? This might provide more consistent global coverage than ground stations.
What spatial resolution are you targeting for neighboring sites? I've worked with models that benefit from incorporating data at multiple scales (1km, 5km, 10km radii).

On the technical approach:

Is there interest in exploring transformer-based architectures that can effectively capture the temporal dependencies in solar irradiance patterns while incorporating spatial correlations?
Would you prefer traditional ablation studies or more interpretable methods like SHAP values or integrated gradients for feature importance analysis?

I have experience implementing similar forecasting models with PyTorch, particularly focusing on:

Multivariate time series forecasting with environmental variables
Transfer learning techniques to adapt models across different geographical regions
Uncertainty quantification in forecasts, which could be valuable for energy grid planning

I'd be happy to elaborate on any of these approaches or discuss other aspects of the project that might be helpful. The scaling aspect is particularly interesting - determining how generalizable these models can be across different climate zones.

Looking forward to learning more about this project!

18 replies

yuvraajnarula Mar 24, 2025

Hi again!

I conducted my research, but I don't believe it meets the criteria you've set. However, I did find that NASA GEOS-5 and CAMS do meet the requirements. As for the revisiting aspect, I will include additional information in the proposal.

yuvraajnarula Mar 25, 2025

Hi @AUdaltsova,

I reviewed additional data providers based on your criteria and found the following options:

NOAA NCEP GFS
- Incorporates aerosol effects in radiation schemes.
- Provides all four types of radiation flux while considering aerosols.
JMA Global Spectral Model
- Includes aerosol radiative forcing.
- Calculates specified radiation variables while accounting for aerosol interactions.
NCAR CESM
- Offers detailed treatments of aerosol radiation.
- Provides calculations for all four radiation flux types, integrating Aerosol Optical Depth (AOD).

Please let me know if you and the team find these options beneficial!

AUdaltsova Mar 25, 2025
Maintainer Author

Hi again @yuvraajnarula!

Thanks for still looking into options, but let me rephrase: so, MERRA2 + CAMS are aerosol sources we would be happy with, you don't need to look into different ones. The issue is, our model already gets some weather information from ECMWF and UKV, including several radiation variables, which basically means variables that tell us how much sunshine there is at the surface. Now, ECMWF and UKV might have various approaches to calculating those, which might or might not take into account the aerosols in the atmosphere. If when they do the calculations the aerosol data is not included, adding MERRA2 AOD to the model would be super useful. If they already include that though, there will be no additional information for the model, so we wouldn't expect MERRA2 AOD data to be useful.

All of this applies if you want to work with aerosol data.

If you don't, you're free to do feature engineering or explore other sources, but it's important that they have:

Coverage of the UK
Archive of data we can download that's more than just a couple of days, so you have something to train with (JMA doesn't have that for example)
Ideally, a live product that gets updated every day (that can get tricky! But it's ok to try to use different sources for training and in production, for example, JMA updates their data 4 times a day, so that would've been a good production source if they had the variables we need (they don't provide any irradiance data at all, so ordinarily we wouldn't pick it as a basic NWP source, as irradiance variables tend to be the most useful in solar power prediction)

Hope that makes sense! Let me know if any of this is unclear

yuvraajnarula Mar 25, 2025

Thank you for the detailed feedback! I understand that focusing on MERRA-2 and CAMS is the priority, especially given the context of how ECMWF and UKV handle radiation variables. I'll refocus the proposal to emphasize MERRA-2 AOD and CAMS data, and evaluate their added value over the existing weather inputs. Thanks again for clarifying—I’ll update the proposal accordingly.

yuvraajnarula May 9, 2025

Hey @AUdaltsova
I would like to express my gratitude for the insightful guidance you provided during this GSoC period. I am eager to understand where I may have fallen short or what my shortcomings were, as this will serve as a valuable lesson for my future endeavors.

Thank you for your assistance.

ajitashwath · 2025-03-03T19:19:46Z

ajitashwath
Mar 3, 2025

Hi @AUdaltsova

I am keen on working on this project and thought that the idea was really fascinating. I do have some questions, before proceeding:

How would we structure an ablation study to measure the independent contribution of dust levels and nearby site information to solar energy forecasting performance?

What data preprocessing methods would be suitable for dealing with noisy or missing dust level data before it is used in the model?

How do we adapt the current PyTorch-based PVNet model to include more features without substantially raising computational overhead?

And, what extent of prior experience in data analytics is necessary for effectively contributing to this project?

1 reply

AUdaltsova Mar 4, 2025
Maintainer Author

Hi @ajitashwathr10! It's great to hear you are interested!

I'd expect you'll need to train the model without any adjustments, and then a version for each new source and possibly one with both of them combined. This way, we can directly compare performance and clearly see how sources contribute to it. Additionally, you can perform some more complicated assessments, such as feature importance etc
PVNet is set up to pick up only the samples where the data from all requested data sources is available, so I would expect no interpolation to be involved, and if there are portions of the data that are very noisy they can be dropped as well, but if you have other ideas we can always do something different!
PVNet currently uses separate encoders for each data source which are relatively easy to add. You're right that it would increase compute, and if a new source supplies relatively little data compared to what usually goes into an encoder, that can cause overfitting, so depending on the data there will be room to experiment a bit and maybe chuck out parts to decrease the number of parameters
You will be expected to asses any new data that you put into the model so some basic data skills would be really good to have (just on the level of being able to compile data into one dataset, checking value ranges, looking for noise/outliers). It's always better if you have manipulated and cleaned data before and can come up with how to best approach the dataset you end up with on your own, but we will be able to help to some extent. No overly complicated data processing techniques are expected, but general understanding of the features the data has and how that can translate into what comes out of the model would be really good to have (e g, we normalise all data to have 0 mean, 1 std distribution before it goes into the model, so all values end up more or less in the same range; what could happen if the original variable has a really long tail on one side?)

Hope this is helpful! Ping me if you have any more questions

leoheim · 2025-03-03T19:56:18Z

leoheim
Mar 3, 2025

Hi @AUdaltsova and @Sukh-P,

Thank you for sharing this exciting project! I’ve been going through the details, and I have a few questions to better understand the challenges and potential directions:

Since PVNet relies on a substantial amount of data for training, have you considered any techniques to address potential data limitations, such as data augmentation or transfer learning? Would these approaches be relevant in this context?

Given the goal of evaluating the impact of new data sources, what criteria are used to determine whether a new feature is valuable enough to justify its inclusion in the model? Is there a structured process for feature selection, or is it more experimental and iterative?

When evaluating the impact of additional data, do you typically analyze its effects across different time scales (e.g., short-term vs. long-term forecasting), or is the focus primarily on immediate predictive accuracy?

Given that the project involves assessing new data sources, how do you typically balance adding complexity to the model versus maintaining interpretability? Are there any techniques you prioritize to ensure that new inputs provide meaningful contributions without overfitting?

Have you considered domain adaptation techniques to improve model transferability across different geographic regions or weather conditions? Given that solar energy forecasting can be highly location-dependent, what approaches have been most effective in ensuring model generalization?

I’m really interested in this project and would love to contribute meaningfully. Looking forward to your insights!

1 reply

AUdaltsova Mar 4, 2025
Maintainer Author

Hi @leoheim, thanks for your interest!

There can be room for this, but I think the core part of the project could get quite bulky already, so I think we would prioritize sources that already provide enough data coverage rather than trying to incorporate something that has additional challenges. We can try if you have an idea for a really good source that would be worth it, but generally I would say this is outside of the scope
Main assessment method would be comparing MAE on different forecast horizons. We don't have a rigid selection procedure, it's more iterative. I'm very familiar with the urge to conduct exhaustive experimentation but we might be limited by the project's time frame here :)
Again, I wouldn't say we have a specific procedure that tells us a particular source is worth it, mostly we look at performance improvements, some example plots, and quantile fitting (PVNet is based on quantile loss and produces several quantiles, typically seven, and we use the median as the point prediction. If the fit of quantiles is off on validation that can be a sign of overfitting/botched normalization/other problems with the model)
We currently do not transfer the model across different locations and prefer to retrain with specific regional data. I think we will try to generalize at some point and personally, I find it a very promising avenue for research, but it would probably fall outside this project's scope as well

Hope that's helpful! Let me know if you want to chat some more.

leoheim · 2025-03-04T19:35:32Z

leoheim
Mar 4, 2025

Hi @AUdaltsova ,

Thank you so much for the thorough and insightful response! I really appreciate you taking the time to explain the project’s approach and considerations in such detail. It’s great to gain a deeper understanding of the thought process behind data selection and model evaluation.

If there’s anything I can do to assist or contribute, I’d be more than happy to help. Just let me know!

0 replies

Samarpan1122 · 2025-03-11T17:57:31Z

Samarpan1122
Mar 11, 2025

Hey @AUdaltsova and @Sukh-P,

This project sounds really exciting! I’ve worked with ML and data processing in Python, using pandas, scikit-learn, and some PyTorch for my research in solid state batteries, and I’d love to get more involved.

I’m particularly interested in the ablation study aspect—understanding how different auxiliary features influence prediction precision. A few questions to get a better sense of the approach:

Are there any existing insights on which auxiliary features (e.g., dust, neighboring site data) have shown promise in preliminary experiments?
Will the study involve a baseline model comparison to quantify improvements, and if so, what is the current baseline?
What kind of spatial-temporal resolution does the new data have, and how will it be integrated into the model?
Are there any specific challenges in incorporating these new data sources into the existing pipeline?
I’d love to be a contributor—what would be the best way to get started? Looking forward to your thoughts!
Thanks

1 reply

AUdaltsova Mar 18, 2025
Maintainer Author

Hi @Samarpan1122! Glad to hear you find the project interesting!

We find both avenues promising, but you are definitely not limited to dust and neighbours, those are mostly there as examples
For an "apples to apples" comparison I'd expect you'll need to train both a version of a model with and without the new source(s)
Collecting the data is part of the project, so the resolution will depend on the chosen source
Challenges will be very source-specific as well, PVNet is quite well set up for adding new weather data so that should be fairly smooth, anything else might require fiddling with encoders and such, but very hard to say without an actual data source - and if it's something we haven't touched before at all, also hard to anticipate.
It's great that you want to contribute! We are always excited to see new people get involved. I suggest you start with our contributions section here, as we ask for people to follow some guidelines to help us make everything run smoothly

Hope that's helpful! Sorry I couldn't be more specific. Let me know if you have further questions!

CaiRuinhan · 2025-03-19T08:36:37Z

CaiRuinhan
Mar 19, 2025

Hello everyone, my name is Catherine, I am from China, and I am currently a second-year master student in the Cognitive Science Lab. My research focuses on various deep learning models based on EEG signals, and most of my work is based on Pytorch. I am familiar with Python, Transformers, TCN, and LSTM (the above models are replicated because of the need for baseline algorithm comparison). I have started to explore the PVNet repository to understand the existing framework and familiarize myself with its structure.
Based on the project description and the replies I can see, I think it is a multimodal problem? Add more data to improve the accuracy of training. However, given that PVNet relies on a large amount of data, it cannot achieve real-time monitoring prediction, which is similar to the real-time psychological field assessment based on EEG signals that I am currently studying. I think it can be pre-trained based on existing data, and then fine-tuned with a small amount of existing data to speed up the prediction. Of course, this is a very preliminary idea, I don’t know if it can meet your needs? I think PVNet is a very meaningful work, so I hope to contribute to GSoC 2025. I hope that after getting your reply, I can write a detailed proposal for one of them and make a PR attempt. Thank you!

3 replies

AUdaltsova Mar 19, 2025
Maintainer Author

Hi @CaiRuinhan! it's great to hear you are interested, and that you are looking into PVNet already.

PVNet is a multi-model late-fusion network, so each data source uses its own encoder, which makes it relatively easy to add new data. I'm sorry, I'm not sure what you mean by "PVNet cannot achieve real-time monitoring prediction", could you please clarify that for me? Just in case this answers your question, we currently do not fine-tune, PVNet is trained from scratch for each new application, including data source modifications. But if you have a cool idea on how to approach new data sources differently, I'm excited to hear it!

CaiRuinhan Mar 19, 2025

"PVNet cannot achieve real-time monitoring prediction" is based on the network structure that I imagined at the beginning, which extracts common features from multi-source data. The addition of new data sources may need to adapt to the model. Thank you very much for your explanation. Now I understand that each data source has its own encoder. We use ablation experiments to explore the impact of some data sources on prediction, so as to try to improve the prediction performance. New data sources have different contributions to prediction performance. A simple idea to vote? Emm I also have some ideas that may be to design a common feature extraction network through adversarial training? There is also contrastive learning. This may reduce some parameters.

AUdaltsova Mar 19, 2025
Maintainer Author

Ok good to know the confusion is clarified! I'm afraid alternative ways of training the model and architecture changes are outside of the scope of this project, this is mostly about using our existing framework to test if a new source is useful or not. I've added some FAQ in the discussion description (top of this page) this morning which could be useful if you want to take a look, and don't hesitate to ask if you have any questions!

MsAnamtaKhan · 2025-03-22T16:28:37Z

MsAnamtaKhan
Mar 22, 2025

Hello @AUdaltsova , @Sukh-P,

I hope you’re well. I’m Anamta Khan, a Data Analytics and ML Engineer with experience in developing ML models using PyTorch and building cloud-based ETL pipelines. I’ve been exploring your Quartz Solar project and am excited about contributing to this open source initiative.

I’m particularly enthusiastic about enhancing solar energy forecasts by integrating auxiliary data sources (e.g., dust levels, neighboring site data) to boost model precision. My background in ML and data analysis, along with familiarity in Python, positions me well for this challenge.

I have a few quick questions:

• Could you elaborate on the current data ingestion process with xarray and any challenges you've encountered, particularly with handling auxiliary data sources like dust levels or neighboring site information?
• Are there specific performance benchmarks or evaluation metrics in place for assessing the impact of these additional data sources on the model's precision?
• In terms of project scope, are there particular areas or aspects that you feel could benefit most from community contributions?

Could you offer any guidance on the best way to approach mentors or engage further on GitHub? I'm eager to contribute and would appreciate any advice to ensure my efforts align with the project's goals.

Thank you!

1 reply

AUdaltsova Mar 24, 2025
Maintainer Author

Hi @MsAnamtaKhan! Great to see you here, thanks for moving everything to discussions!

Could you elaborate on the current data ingestion process with xarray and any challenges you've encountered, particularly with handling auxiliary data sources like dust levels or neighboring site information?

So that would really depend on the data source you choose to work with. Generally all of our data is stored in .zarr files and we use xarray to operate it because it's usually 5D (or at least NWP data is - 2D for space, and, because it's a forecast, it also has forecast initialisation times and forecast horizons, and all of that is provided for several variables - complicated stuff!). Every data provider has its quirks, but it's hard to anticipate without looking at the data. Some things we've run into in the past, for example, are having to use different normalisation strategies due to some variables having unexpected distributions, having to cut off parts of data because we're not set up to deal with it yet (e g ICON Europe has hourly data up to some point in the forecast and then switches to 3-hourly, and as of right now we can't use the 3-hourly data because we are set up to work with a constant time resolution per provider; ICON Global is created on an isohedral grid, which is not something we can use either - all our slicing assumes "square" grinding), etc.

Are there specific performance benchmarks or evaluation metrics in place for assessing the impact of these additional data sources on the model's precision?

I'd expect we'll need to train PVNet with and without the new source(s) and compare performance. Our main metric is MAE, which is calculated on all forecast horizons. If you want to take a different approach to the ablation study, this can absolutely be discussed - more interpretability is always great, and we are open to any ideas you might have!

In terms of project scope, are there particular areas or aspects that you feel could benefit most from community contributions?

So generally this is an end-to-end project that involves data collection, adding code to ocf-data-sampler and possibly PVNet as well to work with the new data, and quite a bit of model training. Some of these things can be done collectively/by different people, eg one possible avenue here is introducing ICON DWD to our pipeline, for which we have some data already and one of our contributors has created a loader for it as well. But maybe I've misunderstood your question, could you elaborate please?

Hope that's helpful, let me know if you have further questions!

Praneeth-Suresh · 2025-04-01T13:22:23Z

Praneeth-Suresh
Apr 1, 2025

Hi @AUdaltsova and @Sukh-P

I’m Praneeth, an ML developer, with experience in model evaluation, data-pipelines and applied projects. So naturally, with its demand to experiment with data sources and perform an ablation study, I'm drawn to this idea. For context, some of my past work includes:

Building ML alternatives to NWP models for weather prediction.
Analyzing neural network sensitivity (in the context of medical imaging)
Data collection and workflow construction for AI-driven geopolitical research

I’m excited about this project’s use of algorithms that I love designing to drive climate action.

I’ve reviewed the idea and the codebase and have a few thoughts to discuss:

Would a small evaluator model (to predict new data’s impact before retraining PVNet) be useful?

I envision this to reduce compute usage in training the large PVNet model multiple times.
How much long-term value would there be in a robust pipeline to estimate the probability new data sources improve PVNet’s performance?
Currently, I infer the model is to be judged on MAE (relatively outlier insensitive). Would optimization for extreme weather events as well be a welcome development?

If there is any lack of clarity in these questions I am happy to elaborate. Thank you for the FAQs and past discussions: these have let me understand the idea better so that I could focus more on the approach to building a solution in the above questions.

Looking forward to your insights and to taking this project ahead!

2 replies

AUdaltsova Apr 7, 2025
Maintainer Author

Hi @Praneeth-Suresh! Thanks for your interest!

We would prefer to retrain PVNet and treat a smaller model as a fallback if for some reason using PVNet is not possible during the project.
That's an interesting idea but would really depend on what you have in mind. We have fairly complicated requirements for new sources (you can find the details in the FAQ above!), so typically having to choose what to try out from a multitude of sources is not really an issue we have.
Accounting for extreme/rare events would be very welcome, yes! PVNet is currently not very good at picking up on snowfall since it's quite rare, especially in the UK, so definitely some room for improvement here.

Hope this is helpful! Let me know if you have further questions.

Praneeth-Suresh Apr 7, 2025

Thanks @AUdaltsova, that was some very useful feedback!

On the basis of this, I'm looking to update my proposal to focus less on pre-evaluating data sources. Rather, more important is doing a more detailed analysis with the available data sources. For this, I am looking into feature engineering that can be performed.

Having considered this, I want to compare and contrast feature engineering techniques as one subtask of my project. From previous work on feature engineering, I know this is a powerful technique to obtain unique insights into the role played by different data sources.

I hope this is a welcome line of pursuit. I would love to clarify any ambiguities with this idea.

SushreeSwain · 2025-04-04T15:15:22Z

SushreeSwain
Apr 4, 2025

Hi mentors! 👋

I’ve gone through the PVNet repo, read the ICLR paper, and explored past discussions on this project. I’ve also submitted my GSoC 2025 proposal focused on integrating auxiliary data sources like dust concentration or neighbouring site features into PVNet, followed by ablation studies to evaluate their impact.

I'd appreciate any feedback on the proposal direction or ideas you think I should emphasize more.

A few questions I had while exploring:

For dust data, do you think MERRA2 AOD is a promising candidate for this project, or should I prioritize feature engineering from satellite sources like CAMS instead?
Regarding neighbouring PV sites, is it okay to simulate a small cluster of sites for experimentation, or do you recommend sticking to real datasets only?
I saw mention of ocf-data-sampler — is there a minimal example or best-practice setup you recommend for working with PVNet v4+ for such experiments?
Are there any data latency constraints I should be particularly careful about when choosing new sources, especially for training vs. inference?
If anything else in my plan feels off or misaligned with OCF’s goals, I’d love to learn and improve. I’m excited to dive deeper and contribute meaningfully!

Thanks so much for your time and support 🙏
@AUdaltsova @Sukh-P

13 replies

AUdaltsova Apr 16, 2025
Maintainer Author

Hi @SushreeSwain, it's great that you're continuing to explore PVNet! I'm afraid I'm not sure what you mean by using ResNet-18 as the backbone, we have some ResNet-based encoders but that's about it. You can read more on how PVNet works in this ICLR paper we have, I find it generally quite helpful in understanding how different components of it fit together.

While we always encourage people to take our stuff and run whatever experiments they fancy, specifically this GSoC project is really not for architecture experiments. I have a feeling you maybe meant this as a more general question though and not a GSoC-related one? Which is great and please do continue to reach out if you want to discuss PVNet, I'd just ask you to create a new discussion page for that, as this one is just for the project ;) Thanks!

SushreeSwain Apr 16, 2025

Thanks for the clarification! Yes, I meant it more as a general curiosity related to the GSoC project. I'm not suggesting any changes, but I'm just trying to understand how certain architectural elements (like encoder types) might influence the overall performance when integrating new data sources.

I’ll have a look at the ICLR paper again to get a better understanding of the pipeline. Appreciate the pointer, and I’ll make sure to move any non-GSoC discussions to a new thread. Thanks again!
@AUdaltsova

SushreeSwain Apr 24, 2025

Hi mentors!

I have checked all the information and references for this project idea. I would like to know if you would suggest anything else to get more information on this project!

Thank you!
@AUdaltsova @Sukh-P

AUdaltsova Apr 24, 2025
Maintainer Author

Hi @SushreeSwain, that's good to hear! Would you mind specifying what you looked at already so I don't duplicate?

Also, if you've familiarised yourself a bit with our code base, I think it would be a good idea to try and contribute to ocf-data-sampler and PVNet (maybe you've done that?). In terms of further material, if you haven't worked with NWP data before at all I think it could be good to look into that, maybe ECMWF user guide (there is a lot there! definitely don't need to read all of it, and it's very much just a suggestion off the top of my head, but I find it useful sometimes), and also getting your hands dirty is useful, you can try and use stuff we have on huggingface or streaming from dynamical.org

Hope that's useful! Let me know if you have other questions, but yeah, I think contributing to some of our open to contribution issues would be a great next step.

SushreeSwain Apr 25, 2025

Great to hear that! I've mostly been exploring the code and getting a better understanding of how PVNet works, especially in terms of this project. I’d be glad to contribute whenever I come across something that aligns with my skill set. I’ll take some time to properly go through the existing work and see where I can be genuinely helpful. I’m planning to look into the NWP data first and build from there.

@AUdaltsova @Sukh-P

Shruti2301 · 2025-04-08T03:31:23Z

Shruti2301
Apr 8, 2025

Hi everyone,

Thanks for all the helpful information shared so far — it’s been really insightful to follow along!

@AUdaltsova, your explanation of how new data sources are integrated using separate encoders in PVNet’s late-fusion architecture makes things a lot clearer. It’s great to know that we can focus on normalization and sensible preprocessing without needing to rework the core architecture too much.

@praneethsuresh, your idea of exploring and comparing feature engineering techniques sounds promising. I think this could really help uncover hidden patterns in the data and provide a deeper understanding of how each input contributes to the model’s performance. I’m also looking into some feature design possibilities and would be happy to collaborate or exchange findings as we go.

Looking forward to experimenting more with the data and learning together!

Best regards,
Shruti Mandaokar

0 replies

emlweb · 2025-04-24T15:29:32Z

emlweb
Apr 24, 2025
Maintainer

Google Summer of Code 2025 applications are now closed.

We are currently reviewing all applications. Contributors will be announced 8 May 2025. Thank you!

0 replies

Google Summer of Code 2025: Quartz Solar - New data source in ML model Discussion Thread #26

Uh oh!

Uh oh!

AUdaltsova Feb 27, 2025 Maintainer

Project Description

Expected Outcome

Other Key Information

FAQ

Replies: 12 comments · 41 replies

Uh oh!

Uh oh!

AUdaltsova Mar 3, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AUdaltsova Mar 25, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AUdaltsova Mar 4, 2025 Maintainer Author

Uh oh!

Uh oh!

AUdaltsova Mar 4, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

AUdaltsova Mar 18, 2025 Maintainer Author

Uh oh!

Uh oh!

AUdaltsova Mar 19, 2025 Maintainer Author

Uh oh!

Uh oh!

AUdaltsova Mar 19, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

AUdaltsova Mar 24, 2025 Maintainer Author

Uh oh!

Uh oh!

AUdaltsova Apr 7, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

AUdaltsova Apr 16, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

AUdaltsova Apr 24, 2025 Maintainer Author

Uh oh!

AUdaltsova
Feb 27, 2025
Maintainer

Replies: 12 comments 41 replies

AUdaltsova Mar 3, 2025
Maintainer Author

AUdaltsova Mar 25, 2025
Maintainer Author

AUdaltsova Mar 4, 2025
Maintainer Author

AUdaltsova Mar 4, 2025
Maintainer Author

AUdaltsova Mar 18, 2025
Maintainer Author

AUdaltsova Mar 19, 2025
Maintainer Author

AUdaltsova Mar 19, 2025
Maintainer Author

AUdaltsova Mar 24, 2025
Maintainer Author

AUdaltsova Apr 7, 2025
Maintainer Author

AUdaltsova Apr 16, 2025
Maintainer Author

AUdaltsova Apr 24, 2025
Maintainer Author