160 try nlp optima from 2025 03 30 study with adamw #161

david-thrower · 2025-04-03T03:38:57Z

Added optima from the hyperparameter optimization study on Mar 30 and replaced the default optimizer with AdamW.

Key changes (Phishing Detection NLP proof of concept):

Added positional embedding between the tokenizer -> embedding and Cerebros- Expanded embedding dimensionality to 23
Set the other hyperparameters to the best ones we know thus far, per the hyperparameter optimization study done on GCP Vertex AI.
Added a baseline standard GPT2 backbone classifier run to compare against the Cerebros model.

Key Changes (Global)

Made AdamW the default optimizer.

Results (Phishing Detection NLP proof of concept):

Cerebros model gets: 0.942 val_binary_accuracy in one run (https://github.com/david-thrower/cerebros-core-algorithm-alpha/actions/runs/14231758300/job/39883656818), 0.957 val_binary_accuracy in the other run, (https://github.com/david-thrower/cerebros-core-algorithm-alpha/actions/runs/14229331076/job/39876301001), which approximately matches what we got with the same parameters in CCP Vertex AI Hyperparameter Tuning 0.959.
GPT2 (gpt2_base_en) takes 43.50 minutes to fine - tune one (pre-trained) model for 3 epochs in an 8 CPU environment, to reach a max 0.9428 val_binary_accuracy. We are limited to a sequence length of 96 in this environment with GPT2.
Cerebros takes about 1 minute and change per epoch with a sequence length of 1024 in the same 8 CPU environment, and takes, on average, about 25 minutes in total per model, to train a model from a cold start for 15 epochs, and we reach 0.942 val_binary_accuracy on the least favorable run of the 3 we have run with these parameters and 0.96 val_binary_accuracy on the remaining 2, showing compaeable accuracy as well.
Most likely, the timing for the Cerebros model can be reduced around another 1/3 further, as one of the 2 CICD runs observed the best val_binary_accuracy was observed at 6 epochs, the other at 10 epochs, and the run completed on GCP found it at 7 epochs.
Currently, this is running 15 epochs (which takes 25 min). Ideally we would want to use an early stopping callback, however there is a structural issue with Keras that prevents us from using an early stopping callback within a multiprocessing context.
I will need to see what the mean number of epoch, where we got the best val_bianry_accuracy at (and st dev), from a larger sample, and decide from there what to set the epochs to. I think it is a safe bet we can probably reduce it to at least 11. With the sample of 3 trials, the mean number of epochs where val_bianry_accuracy maxes out + 2 st dev, comes out to 11.06, so its a fairly safe bet that we probably will see the optimal result by the 11th or 12th epoch usually in future trials with these hyperparameter settings.
We may be able to cut epochs further, as the run where we observed the best val_bianry_accuracy may be an outlier itself.

import numpy as np
# Number of epochs for each of the 3 trials, where we got the best val_bianry_accuracy:
epochs = np.array([6,7,10])
print(epochs)
array([ 6,  7, 10])
# Mean number of epochs where val_bianry_accuracy maxes out
av = epochs.mean()
print(av)
# 7.666666666666667

# St deviation for number of epochs where val_bianry_accuracy maxes out
sd = epochs.std()

# Mean + 2 st deviation for number of epochs where val_bianry_accuracy maxes out
print(av + 2 * sd)
# 11.066013009061857

# We should expect an optimal value at or below ~ 11 epochs around 97% of the time 
# if this small sample aligns with the global distribution

# Note that the run where the optima was found at 10 epochs may be an outlier and the distribution 
# may be more favorable potentially (only 1.4 st dev above the mean, but abberant considering a
# sample of 3, expected in only 23% of samples of 3):

residuals = (epochs - av) / sd 
print(residuals)
array([-0.98058068, -0.39223227,  1.37281295])

Conclusions:

The CICD tests associated with this update support the claim that the NPL Cerebros model scales at linear O(n) timing with regard to sequence length while attention - mechanism transformers scale at O(n) ** 2 timing with sequence length. This is evidenced by the differences in performance between the Cerebros model and comparably sized GPT model:
Cerebros took 1.6 minutes on average to complete each epoch on the same data set which GPT2 required 14.5 minutes on average to compete each epoch. This is despite the advantage given to GPT2 in sequence length to make the experiment realistic to complete in this 8 CPU environment:
- Cerebros processed 10.7 times the sequence length, specifically a 1024 sequence length, whereas the GPT model was limited to 96.
In summary, a comparably sized Cerebros model completed each epoch 9 X as fast in the same 8 CPU environment, despite processing 10.7 times the sequence length. Further with linear change in completion time compared with other Cerebros trails at a sequence length of 750 (https://github.com/david-thrower/cerebros-core-algorithm-alpha/blob/154-benchmark-inference-times---cerebros-model-vs-original-gpt-2/phishing_email_detection_gpt2.py, https://github.com/david-thrower/cerebros-core-algorithm-alpha/actions/runs/14014742901/job/39238988138) and no degradation in accuracy performance, collectively this supports a claim of linear sequence length complexity timing or O(n) timing.

Next Steps:

Make a second hyperparameter optimization study using multivariate TPE which may find a better optima.
Optimize the weight decay for AdamW
Explore a larger embedding output dimensionality search space in the follow - up hyperparameter optimization study. We may be able to afford to go up to 50, 100 + ... We are at 30% - 40% memory pressure, and are completing epochs in under 2 min, so this can probably be expanded considerably before we run into the trade - off between time, memory, CPU requirements and the contribution to accuracy of higher dimension embeddings.
Add the AdamW weight_decay or the optimizer itself to the Cerebros init args.

Comment temporarily disable time-consuming workflows. Comment out BERT based text classification workflow possibly permanently, as this is obsolete.

Add branch to workflow.

Added a baseline fine tuning of the full GPT2 to compare against Cerebros text classifier.

Forgot to add dropout.

Amendments to Cerebros model.

Reduce seq length to accelerate job completion.

Up timeout to 300 min.

Correct history indexing error.

Temporary test to fast forward to cerebros model.

Comment out an artifact of GPT test so we this can lint and run.

Fix errors from trying to work too fast ...

Re-corrected the metrics BinaryAccuracy to correct AI introduced error.

Correct metric to rank by (binary accuracy) ...

Uncomment out GPT test ...

Upped number of trials to 5.

Make seq len 750, fix typo.

Try 1024 seq len.

Added branch to the workflow...

Added a positional embedding and a LayerNorm to the text embedding.

Missed position embedding in copy and paste ...

Synchronize embedding dim across embeddings.

Corrected import of PositionEmbedding.

Remove layernorm, concat instead of add.

Try addition to merge embeddings without LayerNorm

Restore optimal run with position embedding. Reduce max levels to fit the optimal run and reduce overhead. Test this to see if it works. if successful, add back the commented out comparison and PR. Then open an issue to optimize the params around this new model. We may need to run this on Katib to optimize the hyperparameters, as the model is fundamentally different than the original and can probably be optimized considerably.

Hard set levels to the known optimum.

Corrected hard set on levels to correct optima.

Restore the best model yet.

Add back the CICD test for image CLS. Prepare for PR.

Comment out workflows that we don't need in dev. Delete permanantly disused workflows

Made AdamW the default optimizer. We need to parameterize this and an optional hyperparameter for the weight_decay.

Test with default params with AdamW.

Combined best hyperparams from the hyperparameter optimization study with AdamW optimizer.

Add branch to workflow to make it start.

Add back all to be used workflows.

Added back the GPT baseline model for comparison.

Optimize NPL workflow for time's sake.

david-thrower added 30 commits March 22, 2025 14:16

Update automerge.yml

30164c7

Comment temporarily disable time-consuming workflows. Comment out BERT based text classification workflow possibly permanently, as this is obsolete.

Update automerge.yml

8904966

Add branch to workflow.

Update phishing_email_detection_gpt2.py

c7e8b30

Added a baseline fine tuning of the full GPT2 to compare against Cerebros text classifier.

Update phishing_email_detection_gpt2.py

b790e64

Update phishing_email_detection_gpt2.py

15ec9c2

Forgot to add dropout.

Update phishing_email_detection_gpt2.py

0cfb488

Amendments to Cerebros model.

Update phishing_email_detection_gpt2.py

6f86959

Reduce seq length to accelerate job completion.

Update automerge.yml

830a2dc

Up timeout to 300 min.

Update phishing_email_detection_gpt2.py

407f90c

Correct history indexing error.

Update phishing_email_detection_gpt2.py

d5bdbce

Temporary test to fast forward to cerebros model.

Update phishing_email_detection_gpt2.py

d8db0f1

Comment out an artifact of GPT test so we this can lint and run.

Update phishing_email_detection_gpt2.py

014b3c3

Fix errors from trying to work too fast ...

Update phishing_email_detection_gpt2.py

0b67f88

Re-corrected the metrics BinaryAccuracy to correct AI introduced error.

Update phishing_email_detection_gpt2.py

a480dfd

Correct metric to rank by (binary accuracy) ...

Update phishing_email_detection_gpt2.py

0e72e61

Uncomment out GPT test ...

Update phishing_email_detection_gpt2.py

3cd5945

Upped number of trials to 5.

Update phishing_email_detection_gpt2.py

6a9e88d

Make seq len 750, fix typo.

Update phishing_email_detection_gpt2.py

f24a858

Try 1024 seq len.

Update automerge.yml

4e15756

Added branch to the workflow...

Update phishing_email_detection_gpt2.py

9a4db15

Added a positional embedding and a LayerNorm to the text embedding.

Update phishing_email_detection_gpt2.py

59cfa23

Missed position embedding in copy and paste ...

Update phishing_email_detection_gpt2.py

d928a54

Synchronize embedding dim across embeddings.

Update phishing_email_detection_gpt2.py

3c25a22

Corrected import of PositionEmbedding.

Update phishing_email_detection_gpt2.py

88a1bd5

Remove layernorm, concat instead of add.

Update phishing_email_detection_gpt2.py

42d9c4f

Try addition to merge embeddings without LayerNorm

Update phishing_email_detection_gpt2.py

cdb4455

Hard set levels to the known optimum.

Update phishing_email_detection_gpt2.py

048eb1b

Corrected hard set on levels to correct optima.

Update phishing_email_detection_gpt2.py

b800cf7

Restore the best model yet.

Update automerge.yml

7930a2d

Add back the CICD test for image CLS. Prepare for PR.

david-thrower added 8 commits March 30, 2025 16:06

Update automerge.yml

e6ae27c

Comment out workflows that we don't need in dev. Delete permanantly disused workflows

Update neural_network_future.py

0eab09e

Made AdamW the default optimizer. We need to parameterize this and an optional hyperparameter for the weight_decay.

Update phishing_email_detection_gpt2.py

8939f3c

Test with default params with AdamW.

Update phishing_email_detection_gpt2.py

966f714

Combined best hyperparams from the hyperparameter optimization study with AdamW optimizer.

Update automerge.yml

9724e9d

Add branch to workflow to make it start.

Update automerge.yml

380928d

Add back all to be used workflows.

Update phishing_email_detection_gpt2.py

9323f5f

Added back the GPT baseline model for comparison.

Update phishing_email_detection_gpt2.py

f683fb8

Optimize NPL workflow for time's sake.

david-thrower linked an issue Apr 3, 2025 that may be closed by this pull request

try-NLP-optima-from-2025-03-30-study-with-adamw #160

Closed

2 tasks

david-thrower requested a review from sashakolpakov April 3, 2025 03:39

david-thrower merged commit f683fb8 into main Apr 12, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

160 try nlp optima from 2025 03 30 study with adamw #161

160 try nlp optima from 2025 03 30 study with adamw #161

Uh oh!

david-thrower commented Apr 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

160 try nlp optima from 2025 03 30 study with adamw #161

160 try nlp optima from 2025 03 30 study with adamw #161

Uh oh!

Conversation

david-thrower commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added optima from the hyperparameter optimization study on Mar 30 and replaced the default optimizer with AdamW.

Key changes (Phishing Detection NLP proof of concept):

Key Changes (Global)

Results (Phishing Detection NLP proof of concept):

Conclusions:

Next Steps:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

david-thrower commented Apr 3, 2025 •

edited

Loading