Skip to content

Conversation

@david-thrower
Copy link
Owner

@david-thrower david-thrower commented Apr 3, 2025

Added optima from the hyperparameter optimization study on Mar 30 and replaced the default optimizer with AdamW.

Key changes (Phishing Detection NLP proof of concept):

  • Added positional embedding between the tokenizer ->  embedding and Cerebros- Expanded embedding dimensionality to 23
  • Set the other hyperparameters to the best ones we know thus far, per the hyperparameter optimization study done on GCP Vertex AI.
  • Added a baseline standard GPT2 backbone classifier run to compare against the Cerebros model.

Key Changes (Global)

  • Made AdamW the default optimizer.

Results (Phishing Detection NLP proof of concept):

  • Cerebros model gets: 0.942 val_binary_accuracy in one run (https://github.com/david-thrower/cerebros-core-algorithm-alpha/actions/runs/14231758300/job/39883656818), 0.957 val_binary_accuracy in the other run, (https://github.com/david-thrower/cerebros-core-algorithm-alpha/actions/runs/14229331076/job/39876301001), which approximately matches what we got with the same parameters in CCP Vertex AI Hyperparameter Tuning 0.959.
  • GPT2 (gpt2_base_en) takes 43.50 minutes to fine - tune one (pre-trained) model for 3 epochs in an 8 CPU environment, to reach a max 0.9428 val_binary_accuracy. We are limited to a sequence length of 96 in this environment with GPT2.
  • Cerebros takes about 1 minute and change per epoch with a sequence length of 1024 in the same 8 CPU environment, and takes, on average, about 25 minutes in total per model, to train a model from a cold start for 15 epochs, and we reach 0.942 val_binary_accuracy on the least favorable run of the 3 we have run with these parameters and 0.96 val_binary_accuracy on the remaining 2, showing compaeable accuracy as well.
  • Most likely, the timing for the Cerebros model can be reduced around another 1/3 further, as one of the 2 CICD runs observed the best val_binary_accuracy was observed at 6 epochs, the other at 10 epochs, and the run completed on GCP found it at 7 epochs.
  • Currently, this is running 15 epochs (which takes 25 min). Ideally we would want to use an early stopping callback, however there is a structural issue with Keras that prevents us from using an early stopping callback within a multiprocessing context.
  • I will need to see what the mean number of epoch, where we got the best val_bianry_accuracy at (and st dev), from a larger sample, and decide from there what to set the epochs to. I think it is a safe bet we can probably reduce it to at least 11. With the sample of 3 trials, the mean number of epochs where val_bianry_accuracy maxes out + 2 st dev, comes out to 11.06, so its a fairly safe bet that we probably will see the optimal result by the 11th or 12th epoch usually in future trials with these hyperparameter settings.
  • We may be able to cut epochs further, as the run where we observed the best val_bianry_accuracy may be an outlier itself.
import numpy as np
# Number of epochs for each of the 3 trials, where we got the best val_bianry_accuracy:
epochs = np.array([6,7,10])
print(epochs)
array([ 6,  7, 10])
# Mean number of epochs where val_bianry_accuracy maxes out
av = epochs.mean()
print(av)
# 7.666666666666667

# St deviation for number of epochs where val_bianry_accuracy maxes out
sd = epochs.std()

# Mean + 2 st deviation for number of epochs where val_bianry_accuracy maxes out
print(av + 2 * sd)
# 11.066013009061857

# We should expect an optimal value at or below ~ 11 epochs around 97% of the time 
# if this small sample aligns with the global distribution

# Note that the run where the optima was found at 10 epochs may be an outlier and the distribution 
# may be more favorable potentially (only 1.4 st dev above the mean, but abberant considering a
# sample of 3, expected in only 23% of samples of 3):

residuals = (epochs - av) / sd 
print(residuals)
array([-0.98058068, -0.39223227,  1.37281295])

Conclusions:

  • The CICD tests associated with this update support the claim that the NPL Cerebros model scales at linear O(n) timing with regard to sequence length while attention - mechanism transformers scale at O(n) ** 2 timing with sequence length. This is evidenced by the differences in performance between the Cerebros model and comparably sized GPT model:
  • Cerebros took 1.6 minutes on average to complete each epoch on the same data set which GPT2 required 14.5 minutes on average to compete each epoch. This is despite the advantage given to GPT2 in sequence length to make the experiment realistic to complete in this 8 CPU environment:
    • Cerebros processed 10.7 times the sequence length, specifically a 1024 sequence length, whereas the GPT model was limited to 96.
  • In summary, a comparably sized Cerebros model completed each epoch 9 X as fast in the same 8 CPU environment, despite processing 10.7 times the sequence length. Further with linear change in completion time compared with other Cerebros trails at a sequence length of 750 (https://github.com/david-thrower/cerebros-core-algorithm-alpha/blob/154-benchmark-inference-times---cerebros-model-vs-original-gpt-2/phishing_email_detection_gpt2.py, https://github.com/david-thrower/cerebros-core-algorithm-alpha/actions/runs/14014742901/job/39238988138) and no degradation in accuracy performance, collectively this supports a claim of linear sequence length complexity timing or O(n) timing.

Next Steps:

  • Make a second hyperparameter optimization study using multivariate TPE which may find a better optima.

  • Optimize the weight decay for AdamW

  • Explore a larger embedding output dimensionality search space in the follow - up hyperparameter optimization study. We may be able to afford to go up to 50, 100 + ... We are at 30% - 40% memory pressure, and are completing epochs in under 2 min, so this can probably be expanded considerably before we run into the trade - off between time, memory, CPU requirements and the contribution to accuracy of higher dimension embeddings.

  • Add the AdamW weight_decay or the optimizer itself to the Cerebros init args.

Comment temporarily disable time-consuming workflows. Comment out BERT based text classification workflow possibly permanently, as this is obsolete.
Add branch to workflow.
Added a baseline fine tuning of the full GPT2 to compare against Cerebros text classifier.
Amendments to Cerebros model.
Reduce seq length to accelerate job completion.
Up timeout to 300 min.
Correct history indexing error.
Temporary test to fast forward to cerebros model.
Comment out an artifact of GPT test so we this can lint and run.
Fix errors from trying to work too fast ...
Re-corrected the metrics BinaryAccuracy to correct AI introduced error.
Correct metric to rank by (binary accuracy) ...
Uncomment out GPT test ...
Upped number of trials to 5.
Make seq len 750, fix typo.
Added branch to the workflow...
Added a positional embedding and a LayerNorm to the text embedding.
Missed position embedding in copy and paste ...
Synchronize embedding dim across embeddings.
Corrected import of PositionEmbedding.
Remove layernorm, concat instead of add.
Try addition to merge embeddings without LayerNorm
Restore optimal run with position embedding. Reduce max levels to fit the optimal run and reduce overhead. Test this to see if it works. if successful, add back the commented out comparison and PR. Then open an issue to optimize the params around this new model. We may need to run this on Katib to optimize the hyperparameters, as the model is fundamentally different than the original and can probably be optimized considerably.
Hard set levels to the known optimum.
Corrected hard set on levels to correct optima.
Restore the best model yet.
Add back the CICD test for image CLS. Prepare for PR.
Comment out workflows that we don't need in dev. Delete permanantly disused workflows
Made AdamW the default optimizer. We need to parameterize this and an optional hyperparameter for the weight_decay.
Test with default params with AdamW.
Combined best hyperparams from the hyperparameter optimization study with AdamW optimizer.
Add branch to workflow to make it start.
Add back all to be used workflows.
Added back the GPT baseline model for comparison.
Optimize NPL workflow for time's sake.
@david-thrower david-thrower linked an issue Apr 3, 2025 that may be closed by this pull request
2 tasks
@david-thrower david-thrower merged commit f683fb8 into main Apr 12, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

try-NLP-optima-from-2025-03-30-study-with-adamw

2 participants