question about how to avoid overfitting #995

qizzhao · 2025-07-24T07:32:39Z

qizzhao
Jul 24, 2025

The equations , generated by Symbolic Regression model, perform well on in sample dataset and have r-square about 0.1 . But they perform bad on out sample dataset. Is there any way that I could improve the generalization of the model. Currently, I have 2 ideas:

add validation dataset in loss function and evaluate performance based on both train and valid dataset. (if possible)
adjust model parameter. For example, add early stop or lower max-depth, etc..

If anyone have any good ideas on improvement of model's generalization, please give me some hints please. Thank you.

gm89uk · 2025-07-24T09:29:18Z

gm89uk
Jul 24, 2025

Going for simpler equations with good fit goes a long way. I added a small parsimony to my search to encourage this (around a 10th of the lowest expected loss of the most complex equation). I found that unless you have specific domain knowledge and justification for complex operators, they tend to over fit and sticking to more simple operators generalise better.

0 replies

qizzhao · 2025-07-25T06:00:53Z

qizzhao
Jul 25, 2025
Author

Hi, thanks for the reply. I am now looking if we could consider the evaluation/performance of validation dataset during the training rather than after training. For example, in each iteration/epoch, we could use train dataset to fit the model, then we test performance on validation dataset. If the loss of validation dataset decrease, go and train next iteration/epoch. If not, stop the training to avoid overfit.

Is there any ways that I could reproduce this thought?

Thank you

5 replies

ma-sadeghi Jul 26, 2025

I think this can be done using the saved_state keyword in Options:

Run SR for N iterations
Save the state → S(N)
Compute validation loss; stop if it increases
Resume the search from S(N) for M more iterations
Repeat from step 2

pukpr Jul 26, 2025

Lasso regularization is one way to avoid over-fitting, apparently
https://en.wikipedia.org/wiki/Lasso_(statistics)

I haven't tried it but since it was introduced for geophysics, I intend to experiment with LR on an expansive set of oceanic and atmospheric climate indices, where I am finding encouraging cross-validation results via multiple regression =>
http://imageshack.com/a/mLK17/1

I am optimistic that a systematic analysis in the selection and reduction of model parameters will make the results even more striking. Since the multiple regression is a quick computation, should be able to weed through many possibilities

This recent paper is also interesting, though it doesn't mention lasso or cross-validation, but it does point to some ideas of how to perform an exhaustive selection:
https://arxivlens.com/PaperView/Details/exhaustive-symbolic-regression-and-model-selection-by-minimum-description-length-9160-1900cdc0

qizzhao Jul 28, 2025
Author

I think this can be done using the saved_state keyword in Options:

Run SR for N iterations

Save the state → S(N)

Compute validation loss; stop if it increases

Resume the search from S(N) for M more iterations

Repeat from step 2

Thank you. I try to implement your suggestions. Is there anything wrong with the coding below?
Input : X::matrix{Float64}, y::Vector{Float64}, options::SymbolicRegression.Options, niterations::Int64
Pseudo Code:

valid_loss = Inf

for iteration = 1:niterations
    if  ( iteration == 1 )  
          state =  EquationSearch(X, y, options; return_state=true)
          valid_loss =  custom_way_to_calculate_valid_loss( state, X, y )
    else
          state =  EquationSearch(X, y, options; return_state=true, saved_state=state)
          valid_loss_  =  custom_way_to_calculate_valid_loss( state, X, y )
              
          if valid_loss_ < valid_loss 
                 valid_loss  = valid_loss_
          else
                print_early_stop_message....
                break
          end
    end
end

qizzhao Jul 28, 2025
Author

Lasso regularization is one way to avoid over-fitting, apparently https://en.wikipedia.org/wiki/Lasso_(statistics)

I haven't tried it but since it was introduced for geophysics, I intend to experiment with LR on an expansive set of oceanic and atmospheric climate indices, where I am finding encouraging cross-validation results via multiple regression => http://imageshack.com/a/mLK17/1

I am optimistic that a systematic analysis in the selection and reduction of model parameters will make the results even more striking. Since the multiple regression is a quick computation, should be able to weed through many possibilities

This recent paper is also interesting, though it doesn't mention lasso or cross-validation, but it does point to some ideas of how to perform an exhaustive selection: https://arxivlens.com/PaperView/Details/exhaustive-symbolic-regression-and-model-selection-by-minimum-description-length-9160-1900cdc0

yes, lasso is a good idea to handle overfitting. I will also implement this later. Thank you a lot.

ma-sadeghi Jul 29, 2025

I think this can be done using the saved_state keyword in Options:

Run SR for N iterations

Save the state → S(N)

Compute validation loss; stop if it increases

Resume the search from S(N) for M more iterations

Repeat from step 2

Thank you. I try to implement your suggestions. Is there anything wrong with the coding below? Input : X::matrix{Float64}, y::Vector{Float64}, options::SymbolicRegression.Options, niterations::Int64 Pseudo Code:
valid_loss = Inf

for iteration = 1:niterations
    if  ( iteration == 1 )  
          state =  EquationSearch(X, y, options; return_state=true)
          valid_loss =  custom_way_to_calculate_valid_loss( state, X, y )
    else
          state =  EquationSearch(X, y, options; return_state=true, saved_state=state)
          valid_loss_  =  custom_way_to_calculate_valid_loss( state, X, y )
              
          if valid_loss_ < valid_loss 
                 valid_loss  = valid_loss_
          else
                print_early_stop_message....
                break
          end
    end
end

Seems okay to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about how to avoid overfitting #995

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

question about how to avoid overfitting #995

Uh oh!

qizzhao Jul 24, 2025

Replies: 2 comments · 5 replies

Uh oh!

gm89uk Jul 24, 2025

Uh oh!

qizzhao Jul 25, 2025 Author

Uh oh!

ma-sadeghi Jul 26, 2025

Uh oh!

pukpr Jul 26, 2025

Uh oh!

qizzhao Jul 28, 2025 Author

Uh oh!

qizzhao Jul 28, 2025 Author

Uh oh!

ma-sadeghi Jul 29, 2025

qizzhao
Jul 24, 2025

Replies: 2 comments 5 replies

gm89uk
Jul 24, 2025

qizzhao
Jul 25, 2025
Author

qizzhao Jul 28, 2025
Author

qizzhao Jul 28, 2025
Author