Skip to content

Commit 8fdd2a6

Browse files
Cydralpenguinpeedavisking
authored
Add a first example for the new Dlib layers to build a transform-type network (#3041)
* Add a first example of how to use the new Dlib layers to build a Transformer-type network. * Replace deprecated pkgutil.find_loader() (#3043) That method has been deprecated in Python 3.12 and will be removed from Python 3.14. Replace it with a direct call to `importlib.util.find_spec()`, which `pkgutil.find_loader()` was wrapping around. * simplify code a little --------- Co-authored-by: Sandro <[email protected]> Co-authored-by: Davis King <[email protected]>
1 parent d6706a5 commit 8fdd2a6

File tree

4 files changed

+1232
-0
lines changed

4 files changed

+1232
-0
lines changed

examples/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,7 @@ add_example(dnn_metric_learning_on_images_ex)
146146
add_gui_example(dnn_dcgan_train_ex)
147147
add_gui_example(dnn_yolo_train_ex)
148148
add_gui_example(dnn_self_supervised_learning_ex)
149+
add_example(slm_basic_train_ex)
149150
add_gui_example(3d_point_cloud_ex)
150151
add_example(bayes_net_ex)
151152
add_example(bayes_net_from_disk_ex)

examples/slm_basic_train_ex.cpp

Lines changed: 352 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,352 @@
1+
/*
2+
This program demonstrates a minimal example of a Very Small Language Model (VSLM)
3+
using dlib's deep learning tools. It includes two modes:
4+
5+
1) --train : Train a small Transformer-based language model on a character-based
6+
corpus extracted from "slm_data.h" (named shakespeare_text).
7+
8+
2) --generate: Generate new text from a trained model, given an initial prompt
9+
extracted from "slm_data.h" (named shakespeare_prompt).
10+
11+
The "slm_dels.h" header is expected to provide a comprehensive Transformer
12+
definition with the following key elements:
13+
- A configurable transformer_config
14+
- The use of classification_head to output a single token
15+
- The network_type<true> or network_type<false> for training vs inference
16+
- The typical dlib constructs (input<matrix<int>>, etc.)
17+
18+
Character-level tokenization is used here. Each character is directly transformed
19+
into an integer token. The model attempts to learn the sequence of characters in
20+
shakespeare_text. Then you can ask the model to generate new text from a short
21+
prompt.
22+
23+
This model is intentionally kept small (few neurons/parameters) to ensure
24+
simplicity and efficiency. As a result, it may not generalize well to unseen
25+
patterns or concepts. However, it effectively illustrates the principle of
26+
attention and the ability to perfectly memorize and reproduce sequences from
27+
the training data. This makes it a useful educational tool for understanding
28+
the mechanics of Transformer models, even if it lacks the capacity for
29+
sophisticated language understanding.
30+
*/
31+
32+
#include <iostream>
33+
#include <string>
34+
#include <vector>
35+
#include <algorithm>
36+
#include <cmath>
37+
#include <random>
38+
#include <dlib/data_io.h>
39+
#include <dlib/cmd_line_parser.h>
40+
#include <dlib/misc_api.h>
41+
42+
// Include Transformer definitions
43+
#include "slm_defs.h"
44+
45+
// This header "slm_data.h" is assumed to contain:
46+
// const std::string shakespeare_text;
47+
// const std::string shakespeare_prompt;
48+
#include "slm_data.h"
49+
50+
// ----------------------------------------------------------------------------------------
51+
52+
// We treat each character as a token ID in [0..255].
53+
const int MAX_TOKEN_ID = 255;
54+
const int PAD_TOKEN = 256; // an extra "pad" token if needed
55+
56+
// For simplicity, we assume each line from shakespeare_text is appended, ignoring them.
57+
std::vector<int> char_based_tokenize(const std::string& text)
58+
{
59+
std::vector<int> tokens;
60+
tokens.reserve(text.size());
61+
for (const int c : text)
62+
{
63+
tokens.push_back(std::min(c, MAX_TOKEN_ID));
64+
}
65+
return tokens;
66+
}
67+
68+
// Function to shuffle samples and labels in sync
69+
void shuffle_samples_and_labels(std::vector<dlib::matrix<int, 0, 1>>& samples, std::vector<unsigned long>& labels) {
70+
std::vector<size_t> indices(samples.size());
71+
std::iota(indices.begin(), indices.end(), 0); // Fill with 0, 1, 2, ..., N-1
72+
std::shuffle(indices.begin(), indices.end(), std::default_random_engine{});
73+
74+
// Create temporary vectors to hold shuffled data
75+
std::vector<dlib::matrix<int, 0, 1>> shuffled_samples(samples.size());
76+
std::vector<unsigned long> shuffled_labels(labels.size());
77+
78+
// Apply the shuffle
79+
for (size_t i = 0; i < indices.size(); ++i)
80+
{
81+
shuffled_samples[i] = samples[indices[i]];
82+
shuffled_labels[i] = labels[indices[i]];
83+
}
84+
85+
// Replace the original data with shuffled data
86+
samples = std::move(shuffled_samples);
87+
labels = std::move(shuffled_labels);
88+
}
89+
90+
// ----------------------------------------------------------------------------------------
91+
92+
int main(int argc, char** argv)
93+
{
94+
try
95+
{
96+
dlib::command_line_parser parser;
97+
parser.add_option("train", "Train a small transformer on the built-in Shakespeare text");
98+
parser.add_option("generate", "Generate text from a previously trained model (needs shakespeare_prompt)");
99+
parser.add_option("learning-rate", "Set the learning rate for training (default: 1e-4)", 1);
100+
parser.add_option("batch-size", "Set the mini-batch size for training (default: 64)", 1);
101+
parser.add_option("generation-length", "Set the length of generated text (default: 400)", 1);
102+
parser.add_option("alpha", "Set the initial learning rate for Adam optimizer (default: 0.004)", 1);
103+
parser.add_option("beta1", "Set the decay rate for the first moment estimate (default: 0.9)", 1);
104+
parser.add_option("beta2", "Set the decay rate for the second moment estimate (default: 0.999)", 1);
105+
parser.add_option("max-samples", "Set the maximum number of training samples (default: 50000)", 1);
106+
parser.add_option("shuffle", "Shuffle training sequences and labels before training (default: false)");
107+
parser.parse(argc, argv);
108+
109+
if (parser.number_of_arguments() == 0 && !parser.option("train") && !parser.option("generate"))
110+
{
111+
parser.print_options();
112+
return 0;
113+
}
114+
115+
// Default values
116+
const double learning_rate = get_option(parser, "learning-rate", 1e-4);
117+
const long batch_size = get_option(parser, "batch-size", 64);
118+
const int generation_length = get_option(parser, "generation-length", 400);
119+
const double alpha = get_option(parser, "alpha", 0.004); // Initial learning rate for Adam
120+
const double beta1 = get_option(parser, "beta1", 0.9); // Decay rate for the first moment estimate
121+
const double beta2 = get_option(parser, "beta2", 0.999); // Decay rate for the second moment estimate
122+
const size_t max_samples = get_option(parser, "max-samples",50000); // Default maximum number of training samples
123+
124+
// We define a minimal config for demonstration
125+
const long vocab_size = 257; // 0..255 for chars + 1 pad token
126+
const long num_layers = 3;
127+
const long num_heads = 4;
128+
const long embedding_dim = 64;
129+
const long max_seq_len = 80; // a small sequence length for the example
130+
const bool use_squeezing = false;
131+
132+
using my_transformer_cfg = transformer::transformer_config<
133+
vocab_size,
134+
num_layers,
135+
num_heads,
136+
embedding_dim,
137+
max_seq_len,
138+
use_squeezing,
139+
dlib::gelu,
140+
dlib::dropout_10
141+
>;
142+
143+
// For GPU usage (if any), set gpus = {0} for a single GPU, etc.
144+
std::vector<int> gpus{ 0 };
145+
146+
// The model file to store or load
147+
const std::string model_file = "shakespeare_lm_char_model.dat";
148+
149+
// ----------------------------------------------------------------------------------------
150+
// Train mode
151+
// ----------------------------------------------------------------------------------------
152+
if (parser.option("train"))
153+
{
154+
std::cout << "=== TRAIN MODE ===\n";
155+
156+
// 1) Prepare training data (simple approach)
157+
// We will store characters from shakespeare_text into a vector
158+
// and then produce training samples of length (max_seq_len+1),
159+
// where the last token is the label to predict from the preceding max_seq_len.
160+
auto full_tokens = char_based_tokenize(shakespeare_text);
161+
if (full_tokens.empty())
162+
{
163+
std::cerr << "ERROR: The Shakespeare text is empty. Please provide a valid training text.\n";
164+
return 0;
165+
}
166+
167+
// Calculate the maximum number of sequences
168+
size_t max_sequences = (full_tokens.size() > (size_t)max_seq_len + 1)
169+
? (full_tokens.size() - ((size_t)max_seq_len + 1))
170+
: 0;
171+
172+
// Display the size of the training text and the number of sequences
173+
std::cout << "Training text size: " << full_tokens.size() << " characters\n";
174+
std::cout << "Maximum number of sequences: " << max_sequences << "\n";
175+
176+
// Check if the text is too short
177+
if (max_sequences == 0)
178+
{
179+
std::cerr << "ERROR: The Shakespeare text is too short for training. It must contain at least "
180+
<< (max_seq_len + 1) << " characters.\n";
181+
return 0;
182+
}
183+
184+
std::vector<dlib::matrix<int, 0, 1>> samples;
185+
std::vector<unsigned long> labels;
186+
187+
// Let's create a training set of about (N) samples from the text
188+
// Each sample: [x0, x1, ..., x_(max_seq_len-1)] -> y
189+
// We'll store them in "samples" and "labels".
190+
const size_t N = (max_sequences < max_samples) ? max_sequences : max_samples;
191+
for (size_t start = 0; start < N; ++start)
192+
{
193+
dlib::matrix<int, 0, 1> seq(max_seq_len, 1);
194+
for (long t = 0; t < max_seq_len; ++t)
195+
seq(t, 0) = full_tokens[start + t];
196+
samples.push_back(seq);
197+
labels.push_back(full_tokens[start + max_seq_len]);
198+
}
199+
200+
// Shuffle samples and labels if the --shuffle option is enabled
201+
if (parser.option("shuffle"))
202+
{
203+
std::cout << "Shuffling training sequences and labels...\n";
204+
shuffle_samples_and_labels(samples, labels);
205+
}
206+
207+
// 3) Construct the network in training mode
208+
using net_type = my_transformer_cfg::network_type<true>;
209+
net_type net;
210+
if (dlib::file_exists(model_file))
211+
dlib::deserialize(model_file) >> net;
212+
213+
// 4) Create dnn_trainer
214+
dlib::dnn_trainer<net_type, dlib::adam> trainer(net, dlib::adam(alpha, beta1, beta2), gpus);
215+
trainer.set_learning_rate(learning_rate);
216+
trainer.set_min_learning_rate(1e-6);
217+
trainer.set_mini_batch_size(batch_size);
218+
trainer.set_iterations_without_progress_threshold(15000);
219+
trainer.set_max_num_epochs(400);
220+
trainer.be_verbose();
221+
222+
// 5) Train
223+
trainer.train(samples, labels);
224+
225+
// 6) Evaluate quickly on the training set
226+
auto predicted = net(samples);
227+
size_t correct = 0;
228+
for (size_t i = 0; i < labels.size(); ++i)
229+
if (predicted[i] == labels[i])
230+
correct++;
231+
double accuracy = (double)correct / labels.size();
232+
std::cout << "Training accuracy (on this sample set): " << accuracy << "\n";
233+
234+
// 7) Save the model
235+
net.clean();
236+
dlib::serialize(model_file) << net;
237+
std::cout << "Model saved to " << model_file << "\n";
238+
}
239+
240+
// ----------------------------------------------------------------------------------------
241+
// Generate mode
242+
// ----------------------------------------------------------------------------------------
243+
if (parser.option("generate"))
244+
{
245+
std::cout << "=== GENERATE MODE ===\n";
246+
// 1) Load the trained model
247+
using net_infer = my_transformer_cfg::network_type<false>;
248+
net_infer net;
249+
if (dlib::file_exists(model_file))
250+
{
251+
dlib::deserialize(model_file) >> net;
252+
std::cout << "Loaded model from " << model_file << "\n";
253+
}
254+
else
255+
{
256+
std::cerr << "Error: model file not found. Please run --train first.\n";
257+
return 0;
258+
}
259+
std::cout << my_transformer_cfg::model_info::describe() << std::endl;
260+
std::cout << "Model parameters: " << count_parameters(net) << std::endl << std::endl;
261+
262+
// 2) Get the prompt from the included slm_data.h
263+
std::string prompt_text = shakespeare_prompt;
264+
if (prompt_text.empty())
265+
{
266+
std::cerr << "No prompt found in slm_data.h.\n";
267+
return 0;
268+
}
269+
// If prompt is longer than max_seq_len, we keep only the first window
270+
if (prompt_text.size() > (size_t)max_seq_len)
271+
prompt_text.erase(prompt_text.begin() + max_seq_len, prompt_text.end());
272+
273+
// Convert prompt to a token sequence
274+
const auto prompt_tokens = char_based_tokenize(prompt_text);
275+
276+
// Put into a dlib matrix
277+
dlib::matrix<int, 0, 1> input_seq(max_seq_len, 1);
278+
// Fill with pad if prompt is shorter than max_seq_len
279+
for (long i = 0; i < max_seq_len; ++i)
280+
{
281+
if ((size_t)i < prompt_tokens.size())
282+
input_seq(i, 0) = prompt_tokens[i];
283+
else
284+
input_seq(i, 0) = PAD_TOKEN;
285+
}
286+
287+
std::cout << "\nInitial prompt:\n" << prompt_text << " (...)\n\n\nGenerated text:\n" << prompt_text;
288+
289+
// 3) Generate new text
290+
// We'll predict one character at a time, then shift the window
291+
for (int i = 0; i < generation_length; ++i)
292+
{
293+
const int next_char = net(input_seq); // single inference
294+
295+
// Print the generated character
296+
std::cout << static_cast<char>(std::min(next_char, MAX_TOKEN_ID)) << std::flush;
297+
298+
// Shift left by 1
299+
for (long i = 0; i < max_seq_len - 1; ++i)
300+
input_seq(i, 0) = input_seq(i + 1, 0);
301+
input_seq(max_seq_len - 1, 0) = std::min(next_char, MAX_TOKEN_ID);
302+
}
303+
304+
std::cout << "\n\n(end of generation)\n";
305+
}
306+
307+
return 0;
308+
}
309+
catch (std::exception& e)
310+
{
311+
std::cerr << "Exception thrown: " << e.what() << std::endl;
312+
return 1;
313+
}
314+
}
315+
316+
/*
317+
* This program demonstrates the training of a language model on about 15k sequences.
318+
* The training process produces a data file of approximately 32MB on disk.
319+
*
320+
* - Transformer model configuration:
321+
* + vocabulary size: 257
322+
* + layers: 3
323+
* + attention heads: 4
324+
* + embedding dimension: 64
325+
* + max sequence length: 80
326+
* - Number of parameters: 8,247,496
327+
*
328+
* The training cab be done using the following command line:
329+
* >./slm_basic_train_ex --train --shuffle
330+
*
331+
* After this phase, the model achieves perfect prediction accuracy (i.e acc=1).
332+
* The generation option produces text that is very close to the original training data,
333+
* as illustrated by the example below:
334+
* > Generated text:
335+
* > QUEEN ELIZABETH:
336+
* > But thou didst kill my children.
337+
* >
338+
* > KING RICHARD III:
339+
* > But in your daughter's womb I bury them:
340+
* > Where in that nest of spicery they shall breed
341+
* > Selves of themselves, to your recomforture.
342+
* >
343+
* > QUEEN ELIZABETH:
344+
* > Shall I go win my daughter to thy will?
345+
* >
346+
* > KING RICHARD III:
347+
* > And be a happy mother by the deed.
348+
* >
349+
* > QUEEN ELIZABETH:
350+
* > I go. Write to me very shortly.
351+
* > And you shall understand from me her mind.
352+
*/

0 commit comments

Comments
 (0)