Building A Transformer AI From Scratch - Part 4 Models

Published on 24 January 2026 at 13:42

“Where is the courage your leadership requires?” -Starscream

Introduction

Still searching for that transformer quote that isn't about transformer the movie. I have tried even adding in AI so the search is “Transformer AI Quote”. Google still gives me these gems.

I am trying to build the end to end implementation of a Transformer AI that underpins all the chat bots. I am writing this in C++ using vibe coding to make up for the gaps in my knowledge. Then testing said code like the devil was chasing me because the bugs.

The code I wrote is in the appendix. If your trying to get your head around AI its there for you to read.

Implemented the Model

What I did this week was take what I already had which was a tested transformer block and implement that as layers inside a Transformer Model.

What the transformer model does is handle the inputs from the text, cycle through the transformer block and manage learning. What you get a sense when building the transformer is it really is not about specifically language or generation but really learning sequence to sequence modelling where you can reasonably freeze frame this into clear step by step

I would split this into 4 areas;

Abstraction: The transformer model becomes a single object that you can make requests on and it does the “stuff”. You can call generate with encoding for the text to send and it should generate the next token. Repeatedly add the new token and cycle through the process and you would get out a sentence which you return to the user.

Tokenisation/Embeddings: This is the taking of a word and its place in the sentence and turning that into a numerical representation.

Token output: You call generate and it generates the next token, you keep generatng till you generate a end token. When lerning the end token is added at the end of a response.

Learning: The model manages the whole learning system and the scary maths called cross entropy which really just figures out the probability distributions for the next item in the sequence.

Learning:

The learning process is implemented in the backwards and step functions. The backwards function calculates the difference between the output and the target. Within this implementation the process produces the loss in the badly named sofmax (should be softmax) not the traditional cross entropy function but which actually does do the cross entropy function despite naming it sofmax.

Despite the odd naming it does the softmax and gets its derivative cross entropy in one go for the backwards pass.

What this does is exponentiates each logit (scaled by temperature) normalises the sum of all exponentiated values; makes sure all probabilities sum to 1.

softmax_out(i,j)-indicator when storing up the error will put a positive number against the correct answer and negative number. Then when that error is back propagated across all the layers and stored until step is called at which point the whole update can be added at once with gradient clipping.

The gradient clipping kicks in if the error is too large and lowers the learning rate to make sure the AI never updates in too big of steps.

This should all add up to allowing stable and smooth learning (fingers crossed).

This error is cascaded into the layers of transformer blocks sequentially in reverse order that they have run so as the AI learns each layer makes small updates to its contribution to the final output according to its real contribution and thereby they all learn.

Error is calculated in the backwards function.

void Model::backward(const Matrix& d_out)

{

//update output representations

global_norm = 0.0;

Matrix x = final_ln_.out_;

d_token_embeddings_ = d_logits_.transpose() * x;

//compute gradients for the input to the output projection

Matrix dx = d_logits_ * token_embeddings_;

dx = final_ln_.backward(dx);

//cycle through layers updating each

for (int i = num_layers_ - 1; i >= 0; --i)

{

dx = layers_[i].backward(dx);

}

//update position embedingfs

for (size_t i = 0; i < dx.rows(); ++i)

{

for (int j = 0; j < dx.cols(); ++j)

{

d_position_embeddings_(i, j) += dx(i, j);//simple gradient update

//update gradient as can then do step in a single point

global_norm += static_cast<double>(d_position_embeddings_(i, j))* static_cast<double>(d_position_embeddings_(i, j));

}

//update token representations

for (size_t i = 0; i < last_input_ids_.size(); ++i)

{

int token_id = last_input_ids_[i];

for (int j = 0; j < d_model_; ++j)

{

d_token_embeddings_(token_id, j) += dx(i, j);//simple gradient update

//update gradient as can then do step in a single point

global_norm += static_cast<double>(d_token_embeddings_(i, j))* static_cast<double>(d_token_embeddings_(i, j));

}

And applied in the step function.

void Model::step(float lr, float clip_val)

{

//update output projection

//update transformer layers

for (auto& layer : layers_)

{

layer.step(lr);

global_norm += layer.gradient_norm();

}

global_norm += final_ln_.gradient_norm();

float clip_scale = 1.0f;

if (global_norm > clip_val)

{

clip_scale = clip_val / static_cast<float>(global_norm);

}

//update token embeddings (and tied output projection)

for (size_t i = 0; i < token_embeddings_.rows(); ++i)

{

for (int j = 0; j < token_embeddings_.cols(); ++j)

{

token_embeddings_(i, j) -= lr * clip_scale * d_token_embeddings_(i, j);

}

//update position embeddings

for (size_t i = 0; i < d_position_embeddings_.rows(); ++i)

{

for (int j = 0; j < d_position_embeddings_.cols(); ++j)

{

position_embeddings_(i, j) -= lr * clip_scale * d_position_embeddings_(i, j);

}

for (auto& layer : layers_)

{

layer.step(lr * clip_scale);

}

final_ln_.step(lr * clip_scale);

}

Token Output:

The learning that takes place above builds a softmax function head at the end. This can be thought of as a list of probabilities with the index in the list also corresponding to the index in the list of tokens.

A token represents a word or anything you might symbolically represent. You could have it such that the token !^&* means call a human operator or *&^ might be passed back to the machine and this would cause the machine to run another pre programmed piece of code. Though usually these things are a chat bot so they list the words in the vocabulary the AI is allowed to use.

Two ways exist in this implementation to choose the next word one is just select the most likely answer. The other is a temperature function that makes the model a bit more random.

if (sample)

{

//sample from distribution

std::random_device rd;

std::mt19937 gen(rd());

std::uniform_real_distribution<float> dist(0.0f, 1.0f);

float r = dist(gen);

float cumulative = 0.0f;

for (int j = 0; j < last_logits.cols(); ++j)

{

cumulative += last_logits(0, j);

if (r <= cumulative)

{

next_token = j;

break;

}

else

{

//greedy selection

float max_logit = last_logits(0, 0);

for (int j = 1; j < last_logits.cols(); ++j)

{

if (last_logits(0, j) > max_logit)

{

max_logit = last_logits(0, j);

next_token = j;

}

Temperature is a bit odd it was not what I expected it uses a cumulative and because of that it kind of ends up as a cheap random function. I have to admit it feels different than expected I thought it would pick based on lots of pseudo random number generation but when I see it realised this is much more elegant but interestingly I am uncertain if I have properly understood the implementation.

I did not think I would want to look into this area of AI but I think this is a area for a future blog post really playing with different implementations of temperature.

Note: I spotted missed using a new random device rather than the stored one in class so went back and fixed that.

Tokenisation/Embeddings:

A role that the transformer model needs to play is managing the users inputs and taking the text input and turning it into numbers.

Something which I think dimly understood but had not really properly grasped is that there is a set of numbers that represent the word and there is a set of numbers that represent the position in the input.

In a AI with size 3. Word 1 might have embedding 0.3,0.4,06. The first word in the sentence might have a value of -0.8,0.4,0.6 while the second word might have 0.7,0.5,-0.3. These values are added together so Word 1 in position 1 would sum to -05,0.8,1.2 but in the second position 1.0,0.9,0.3 even though it is the same word.

This combination creates a sort of code that combines these two factors in the input like a “hash” of the values. W hats more something interesting happens while the AI learns the updates will push semantically similar words so they are closer together to each other. i.e. if two words where similar the number representation will naturally undergo a form of word to vector and cluster close to similar words and a similar things happens in the location embeddings.

This process therefore naturally will build a vocabulary for the AI to use where similar words flow through the network in “similar ways” and more likely to activate “similar neurones” because they are mathematically similar because of how this thing works.

Matrix Model::embed_tokens(const std::vector<int>& tokens)

{

Matrix embeds(tokens.size(), d_model_);

for (size_t i = 0; i < tokens.size(); ++i)

{

for (int j = 0; j < d_model_; ++j)

{

if (tokens[i] != pad_token)//if zero traet it as a padding and do not add position embedings

{

embeds(i, j) = token_embeddings_(tokens[i], j)+ position_embeddings_(i, j);//add position and token embedings at same time to speed up process

non_pad_count_++;

}

else

{

return embeds;//just finish after hits a pad

}

return embeds;

}

Abstraction:

The final point that encapsulating all the code like this is you can run individual commands and interact with the system with distinct commands like generate. This takes in a input_ids that is converts from the users text and the AI then keeps adding its own words into the list until it selects a end token to say that it stops.

std::vector<int> Model::generate(const std::vector<int>& input_ids, int max_length, float tempreture, bool sample)

{

std::vector<int> output_ids = { 2 };//starts with start of sentence

for (int i = 0; i < max_length; ++i)

{

//prepare input (inpu_ids + output_ids_ids)

std::vector<int> current_input = input_ids;

current_input.insert(current_input.end(), output_ids.begin(), output_ids.end());

//forward pass

Matrix x = embed_tokens(current_input);

for (int j = 0; j < num_layers_; ++j)

{

x = layers_[j].forward(x);

}

x = final_ln_.forward(x);

Matrix logits = x * output_projection_;

//get logits for last token

Matrix last_logits(1, logits.cols());

for (int j = 0; j < logits.cols(); ++j)

{

last_logits(0, j) = logits(logits.rows() - 1, j)/tempreture;

}

//softmax redone as only 1 row so better to write again and not recall softmax in matrix

float max_logit = last_logits(0, 0);

for (int j = 1; j < last_logits.cols(); ++j)

{

if (last_logits(0, j) > max_logit)

{

max_logit = last_logits(0, j);

}

float sum = 0;

for (int j = 0; j < last_logits.cols(); ++j)

{

last_logits(0, j) = std::exp(last_logits(0, j) - max_logit);

sum += last_logits(0, j);

}

for (int j = 0; j < last_logits.cols(); ++j)

{

last_logits(0, j) /= sum;

}

//sample next token

int next_token;

if (sample)

{

//sample from distribution

std::uniform_real_distribution<float> dist(0.0f, 1.0f);

float r = dist(rng_);

float cumulative = 0.0f;

for (int j = 0; j < last_logits.cols(); ++j)

{

cumulative += last_logits(0, j);

if (r <= cumulative)

{

next_token = j;

break;

}

else

{

//greedy selection

float max_logit = last_logits(0, 0);

for (int j = 1; j < last_logits.cols(); ++j)

{

if (last_logits(0, j) > max_logit)

{

max_logit = last_logits(0, j);

next_token = j;

}

//add to output

output_ids.push_back(next_token);

if (next_token == 3)

{

//stops if we generate a EOS token

break;

}

return output_ids;

}

Conclusion

I am both curious and wary of vibe coding I am very much of the opinion that the operational danger and opportunities it represents has been meme’d to death but barely been considered as a benefit/disbenefit to the organisation.

What I have noted is that the code it produces in a single case of one shot learning of “please produce this function or class” seems on the whole good code and a massive time saving. That being said all the times I have seen it made mistakes the majority of times I prompted and re-prompted it with changes, fixes and upgrades. I have a hypothesis that the transformer is great at recreating training data in response to queries but lacking the real “Je ne sais quoi” of whatever consciousness in us is it cannot really handle well the multiple overlapping requests and competing design choices that a real developer does; nor does it have clean ability to do test and learn where a developer creates code runs series of unit tests to improve.

In my experience what I did was use this coding challenge in the exact way AI agents work by doing prompt engineering of two different Ai systems to get one to correct the other. The issue is that they kept finding faults in any code that the other produced for a lengthy time and while about half the improvements where instant wins they where often low level improvements the bigger wins that the AI diagnosed in the code they then often implemented where implemented poorly.

I believe it performed poorly because in the AI weights there is a “template” transformer code line that exists almost as a platonic ideal in the AI but when asked to do transformer plus add in a clip scale but also I think you have added too much of what should do in step function you have done in backwards it produces messy code without much thinking of speed.

This meant as a learning experience I just implemented most of its masking suggestions before going I’m fairly sure your doing this really badly.

Therefore the more agents you add in they have a potential chance of over complicating things. I think the issue is you are recursively calling one to criticse the others work and this behaviour in the real world does not lead to better work. Constant criticism leads to broken people. What works is iterative tests alongside gathering data alongside doing research. The issue is that AI excels at the research part but only because it has been trained on the total sum of human knowledge the other parts of work it is not necessary very good at gathering data and doing tests.

For this reason I am sort of still on the fence about AI agents I feel that a claim to a AI agent would only be possible if the AI thinks in space and time and is self aware of its thinking to critcise and self seek data that would append to its own knowledge. I just do not feel that two AI talking to each other could necessarily self terminate the conversation or plan and implement substansive tests (safely).

The vast levels of templates it has in its training data means that as a starting point it can generate top quality code as lego blocks that a developer can then put together.

I feel this weirdly entrenches my view that AI is a world changing technology but not a replacement for the human in the seat and not remove the PICNIC (Problem in chair not in computer) issue that exists in IT. It might oddly further entrench this issue where people are increasingly capable of doing things they lack sufficient understanding to fix. I think the skill of developers will change to the project planning to know each step in the process and very clear breaks in implementation to enact rigorous testing of which creates a paradox if the AI produces code that is poorly understood how do you implement the right tests?

My experience with vibe coding therefore is it is a great tool to learn where testing is applied you might even get something into production. Though there is a problem the excessive testing probably makes it faster to build with people who explicitly know what they are doing and the risks it creates are essentially uncontrollable if that excessive testing is not done.

I think this though also ignores you can code this way, you can move into building things you have limited knowledge about with a build and test cycle that previously you had little chance of doing otherwise.

In my case I already had basic linear algebra training as a data analyst. I have built plenty of code projects in my life some of which where AI so had robust coding skills and have a smattering of project management skills. In many ways building a transformer from scratch is just a next step in that process of learning and I think it would be interesting to know how much or how little of those skills would be needed or could be dispensed with until a project that uses vibe coding becomes too risky or time consuming to be worth it?

My next step is testing the model, (extensively and obsessively) and hope it does learn and reduce its error if that happens then Ex Nihilo I can build some of the most complex coding systems if your willing to recognise what you do not know and keep learning. The question that comes to mind is I think this is fine (some people might disagree) but what happens if this project was to build a nuclear bomb or anti matter?

Appendix

Model.h

#pragma once

#include <vector>

#include <string>

#include <unordered_map>

#include <random>

#include <fstream>

#include <sstream>

#include <stdexcept>

#include <algorithm>

#include <filesystem>

#include <memory>

#include "Matrix.h"

#include "TransformerBlock.h"

#include "Unit_tests.h"

#include "Layer_norm.h"

class Model

{

public:

struct Config { //really only exists to call and get config on a production network

int max_vocab_size;

int d_model;

int num_heads;

int d_ff;

int num_layers;

int max_seq_len;

};

Model(int vocab_size, int d_model, int num_heads, int d_ff, int num_layers, int max_seq_len);

//load/save model

void save(const std::string& filename);

void load(const std::string& filename);

//Tokenisation

void build_vocab(const std::vector<std::string>& texts, int min_freq = 2,int max_vocab=1000000);

int token_to_id(const std::string& token) const;

std::string id_to_token(int id) const;

std::vector<int> encode(const std::string& text) const;

std::string decode(const std::vector<int>& tokens) const;

//Forward/backward passes

float forward(const std::vector<int>& input_ids, const std::vector<int>& target_ids, float tempreture = 1.0f);

void backward(const Matrix& d_out);

void step(float lr, float clip_value);

void zero_grad();

//generation

std::vector<int> generate(const std::vector<int>& input_ids, int max_length, float tempreture = 1.0f, bool sample = true);

//Training Utilities

void record();//need to think about this one if unit test goes inside model or not//records data

const Config get_config();

private:

//model architecture

int vocab_size_;

int actual_vocab_size;

int d_model_;

int num_layers_;

int max_seq_len_;

int non_pad_count_;

double global_norm;

const int pad_token = 0;//pad empty spaves

const int unk_token = 1;//token for uknowns

const int sos_token = 2;//start of sentence

const int eos_token = 3;//emnd of sentence

//Tokenisation

std::unordered_map<std::string, int> vocab_;//havent implemented yet but thinking making entirely int maps and only storing string in sorted_token_

std::vector<std::string> sorted_tokens_;

std::unordered_map<int, std::string> reverse_vocab_;

//Model components

Matrix token_embeddings_;

Matrix d_token_embeddings_;

Matrix position_embeddings_;

Matrix d_position_embeddings_;

std::vector<TransformerBlock> layers_;

Layer_norm final_ln_;

Matrix output_projection_;

//Saved for backward pass

std::vector<int> last_input_ids_;

std::vector<int> last_target_ids_;

std::vector<Matrix> layer_outputs_;

std::vector<Matrix> layer_gradients_;

Matrix logits_;

Matrix d_logits_;

//helper functions

Matrix embed_tokens(const std::vector<int>& tokens);

float compute_loss(const Matrix& input,const std::vector<int>& target,Matrix& d_out);

float sofmax(Matrix& logits, const std::vector<int>& target_ids,float tempreture = 1.0f);

void check_sequence_length(int length) const;

Unit_tests tracker;

//RNG for generation

std::mt19937 rng_;

struct GenerationState

{

std::vector<int> input_id;

std::vector<int> output_ids;

Matrix x;

Matrix logits;

std::vector<std::pair<Matrix, Matrix>> layer_kv_caches;//key and value pairs

};

Model.cpp

#include "Model.h"

#include <iostream>

#include <cmath>

#include <numeric>

#include <random>

//#include <nlohmann/json.hpp>

Model::Model(int vocab_size, int d_model, int num_heads, int d_ff, int num_layers, int max_seq_len):

vocab_size_(vocab_size),d_model_(d_model),num_layers_(num_layers),max_seq_len_(max_seq_len),

token_embeddings_(vocab_size,d_model),position_embeddings_(max_seq_len,d_model),final_ln_(d_model),

output_projection_(d_model,vocab_size),rng_(std::random_device{}())

{

if (max_seq_len < 3)

{

//need at least 3 tokens

throw std::invalid_argument("max_seq_len must be at least 2 tokens to accept the end and start tokens");

}

//initialise token embeddings

float scale = std::sqrt(2.0f / d_model);

token_embeddings_.fill_random(-scale, scale);

//initialise position embeddings

position_embeddings_.fill_random(-scale, scale);

//create transformer layers

for (int i = 0; i < num_layers; ++i)

{

layers_.emplace_back(d_model, num_heads, d_ff, vocab_size);

}

void Model::check_sequence_length(int length) const

{

if (length > max_seq_len_)

{

throw std::out_of_range("Sequence length exceeds maximium length");

}

void Model::build_vocab(const std::vector<std::string>& texts, int min_freq,int max_vocab)

{

std::unordered_map<std::string, int> freq_map;

//count token frequencies

//for every string in texts,split it into words and counts how many times each word appears across all texts

for (const auto& text : texts)

{

std::istringstream iss(text);

std::string token;

while (iss >> token)//splits on whitespace, tabs and newlines

{

freq_map[token]++;

}

//add special tokens

freq_map["<PAD>"] = pad_token;//pad code

freq_map["<UNK>"] = unk_token;//uknown response

freq_map["<SOS>"] = sos_token;//start of sentence

freq_map["<EOS>"] = eos_token;//end of sentence

reverse_vocab_[pad_token] = "<PAD>";//pads out data

reverse_vocab_[unk_token] = "<UNK>";//unknown response

reverse_vocab_[sos_token] = "<SOS>";//start of sentence

reverse_vocab_[eos_token] = "<EOS>";//end of sentence

//add frequent tokens

int id = 4;

std::vector<const std::pair<std::string,int>> ordered(freq_map.begin(), freq_map.end());

sort(ordered.begin(), ordered.end(), [](const std::pair<std::string, int>& a, const std::pair<std::string, int>& b) {return a.second > b.second; });

for (size_t i = 0; i < ordered.size(); ++i)

{

if (ordered[i].second >= min_freq && id <= max_vocab)//ordered process ensures we use the tokens useful in orderb of frequency and can set a min frequency to ignore say words that appeared only once

{

vocab_[ordered[i].first] = id;

reverse_vocab_[id] = ordered[i].first;

id++;

}

actual_vocab_size = id;//ended up using unordered maps for hashing function for speed but sorted prior for frequency prior anyway

//also a element here of me trying multiple methods and ending up with a bit of mix, maximises speed when run but a bit unecessary

if (actual_vocab_size>max_seq_len_)

{

std::runtime_error("Vocabulary exceeds size constraints");

}

int Model::token_to_id(const std::string& token) const

{

auto it = vocab_.find(token);

if (it != vocab_.end())

{

return it->second;

}

return unk_token;//gives back a uknown response

}

std::string Model::id_to_token(int id) const

{

auto it = reverse_vocab_.find(id);

if (it != reverse_vocab_.end())

{

return it->second;

}

return "<UNK>";//gives back a uknown response

}

std::vector<int> Model::encode(const std::string& text) const

{

std::vector<int> tokens;

std::istringstream iss(text);

std::string token;

int helper;

tokens.push_back(sos_token);//SOS start of sentence

while (iss >> token)

{

helper = token_to_id(token);

tokens.push_back(helper);

}

tokens.push_back(3);//EOS start of sentence

//pad to max seq length

if (tokens.size() < max_seq_len_)

{

tokens.push_back(0);

}

//truncate if too long

if (tokens.size() > max_seq_len_)

{

tokens.resize(max_seq_len_);

}

Matrix Model::embed_tokens(const std::vector<int>& tokens)

{

Matrix embeds(tokens.size(), d_model_);

for (size_t i = 0; i < tokens.size(); ++i)

{

for (int j = 0; j < d_model_; ++j)

{

if (tokens[i] != pad_token)//if zero traet it as a padding and do not add position embedings

{

embeds(i, j) = token_embeddings_(tokens[i], j)+ position_embeddings_(i, j);//add position and token embedings at same time to speed up process

non_pad_count_++;

}

else

{

return embeds;//just finish after hits a pad

}

return embeds;

}

std::string Model::decode(const std::vector<int>& tokens) const

{

std::string text;

for (int token : tokens)

{

if (token == 0) continue;//if its a pad token ignores

if (token == 3) break;//on EOS token break

text += id_to_token(token) + " ";

}

return text;

}

float Model::sofmax(Matrix& logits, const std::vector<int>& target_ids,float tempreture)

{

tempreture = std::max(tempreture, 1e-3f);//stops explosive inputs

float scale= 1.0f / non_pad_count_;

float loss = 0.0f;

for (size_t i = 0; i < logits.rows(); ++i)//same as sequence length

{

float max_logit = logits(i, 0);

//find max for numerical stability

for (size_t j = 1; j < logits.cols(); ++j)

{

max_logit = logits(i, j);

}

//apply tempreture and compute softmax

double sum = 0.0;

for (size_t j = 0; j < logits.cols(); ++j)

{

logits(i, j) = std::exp((logits(i, j) - max_logit) / tempreture);

sum += logits(i, j);

}

//Normalise

for (size_t j = 0; j < logits.cols(); ++j)

{

logits(i, j) /= sum;

}

if (target_ids[i] != pad_token)

{

loss -= std::log(logits(i, target_ids[i]) + 1e-10f);

}

return loss;

}

float Model::forward(const std::vector<int>& input_ids, const std::vector<int>& target_ids,float tempreture=1.0f)

{

check_sequence_length(input_ids.size());

check_sequence_length(target_ids.size());

last_input_ids_ = input_ids;

last_target_ids_ = target_ids;

//token and position embeddings

Matrix x = embed_tokens(input_ids);

//forward through transformer layers

for (int i = 0; i < num_layers_; ++i)

{

x = layers_[i].forward(x);//cycle along layers outputs

}

//final layer norm

x = final_ln_.forward(x);

//output projection

logits_ = x * d_token_embeddings_.transpose();

//compute loss and get gradients

d_logits_.zero();

Matrix softmax_out = logits_;

//create mask for pad tokens

//would put a mask here but zeroes all the inputs

return sofmax(softmax_out, target_ids, tempreture);

}

float Model::compute_loss(const Matrix& input, const std::vector<int>& target, Matrix& d_out)

{

assert(input.rows() == target.size());

assert(d_out.rows() == input.rows());

assert(d_out.cols() == input.cols());

Matrix logits = input;

//compute softmax for loss calculation

Matrix softmax_out = logits;

softmax_out.softmax();

//compute loss

float loss = 0.0f;

for (size_t i = 0; i < target.size(); ++i)

{

int target_idx = target[i];

assert(target_idx >= 0 && target_idx < logits.cols());

//loss is -log(p_target)

loss -= std::log(softmax_out(i, target_idx) + 1e-10f);

}

loss /= target.size();//average loss

//compute gradients using the original logits and not softmax

d_out.zero();

for (size_t i = 0; i < target.size(); ++i)

{

int target_idx = target[i];

//gradient for this position

for (size_t j = 0; j < logits.cols(); ++j)

{

float indicator = (j == target_idx) ? 1.0f : 0.0f;

d_out(i, j) = softmax_out(i, j) - indicator;

}

//average gradients

for(size_t i = 0; i < d_out.rows(); ++i)

{

for (size_t j = 0; j < d_out.cols(); ++j)

{

d_out(i, j) /= target.size();//dout passed in by reference is now filled with the data even though not explicitly returned

}

return loss;

}

void Model::backward(const Matrix& d_out)

{

//update output representations

global_norm = 0.0;

Matrix x = final_ln_.out_;

d_token_embeddings_ = d_logits_.transpose() * x;

//compute gradients for the input to the output projection

Matrix dx = d_logits_ * token_embeddings_;

dx = final_ln_.backward(dx);

//cycle through layers updating each

for (int i = num_layers_ - 1; i >= 0; --i)

{

dx = layers_[i].backward(dx);

}

//update position embedingfs

for (size_t i = 0; i < dx.rows(); ++i)

{

for (int j = 0; j < dx.cols(); ++j)

{

d_position_embeddings_(i, j) += dx(i, j);//simple gradient update

//update gradient as can then do step in a single point

global_norm += static_cast<double>(d_position_embeddings_(i, j))* static_cast<double>(d_position_embeddings_(i, j));

}

//update token representations

for (size_t i = 0; i < last_input_ids_.size(); ++i)

{

int token_id = last_input_ids_[i];

for (int j = 0; j < d_model_; ++j)

{

d_token_embeddings_(token_id, j) += dx(i, j);//simple gradient update

//update gradient as can then do step in a single point

global_norm += static_cast<double>(d_token_embeddings_(i, j))* static_cast<double>(d_token_embeddings_(i, j));

}

void Model::step(float lr, float clip_val)

{

//update output projection

//update transformer layers

for (auto& layer : layers_)

{

layer.step(lr);

global_norm += layer.gradient_norm();

}

global_norm += final_ln_.gradient_norm();

float clip_scale = 1.0f;

if (global_norm > clip_val)

{

clip_scale = clip_val / static_cast<float>(global_norm);

}

//update token embeddings (and tied output projection)

for (size_t i = 0; i < token_embeddings_.rows(); ++i)

{

for (int j = 0; j < token_embeddings_.cols(); ++j)

{

token_embeddings_(i, j) -= lr * clip_scale * d_token_embeddings_(i, j);

}

//update position embeddings

for (size_t i = 0; i < d_position_embeddings_.rows(); ++i)

{

for (int j = 0; j < d_position_embeddings_.cols(); ++j)

{

position_embeddings_(i, j) -= lr * clip_scale * d_position_embeddings_(i, j);

}

for (auto& layer : layers_)

{

layer.step(lr * clip_scale);

}

final_ln_.step(lr * clip_scale);

}

void Model::zero_grad()

{

for (auto& layer : layers_)

{

layer.zero_grad();

}

void Model::record()

{

tracker.record("token_embeddings", d_token_embeddings_);

tracker.record("position_embeddings", d_position_embeddings_);

for (int i = 0; i < layers_.size(); ++i)

{

tracker.track_block_gradients(layers_[i], "layer " + std::to_string(i) + " ");

};

}

std::vector<int> Model::generate(const std::vector<int>& input_ids, int max_length, float tempreture, bool sample)

{

std::vector<int> output_ids = { 2 };//starts with start of sentence

for (int i = 0; i < max_length; ++i)

{

//prepare input (inpu_ids + output_ids_ids)

std::vector<int> current_input = input_ids;

current_input.insert(current_input.end(), output_ids.begin(), output_ids.end());

//forward pass

Matrix x = embed_tokens(current_input);

for (int j = 0; j < num_layers_; ++j)

{

x = layers_[j].forward(x);

}

x = final_ln_.forward(x);

Matrix logits = x * output_projection_;

//get logits for last token

Matrix last_logits(1, logits.cols());

for (int j = 0; j < logits.cols(); ++j)

{

last_logits(0, j) = logits(logits.rows() - 1, j)/tempreture;

}

//softmax redone as only 1 row so better to write again and not recall softmax in matrix

float max_logit = last_logits(0, 0);

for (int j = 1; j < last_logits.cols(); ++j)

{

if (last_logits(0, j) > max_logit)

{

max_logit = last_logits(0, j);

}

float sum = 0;

for (int j = 0; j < last_logits.cols(); ++j)

{

last_logits(0, j) = std::exp(last_logits(0, j) - max_logit);

sum += last_logits(0, j);

}

for (int j = 0; j < last_logits.cols(); ++j)

{

last_logits(0, j) /= sum;

}

//sample next token

int next_token;

if (sample)

{

//sample from distribution

std::random_device rd;

std::mt19937 gen(rd());

std::uniform_real_distribution<float> dist(0.0f, 1.0f);

float r = dist(gen);

float cumulative = 0.0f;

for (int j = 0; j < last_logits.cols(); ++j)

{

cumulative += last_logits(0, j);

if (r <= cumulative)

{

next_token = j;

break;

}

else

{

//greedy selection

float max_logit = last_logits(0, 0);

for (int j = 1; j < last_logits.cols(); ++j)

{

if (last_logits(0, j) > max_logit)

{

max_logit = last_logits(0, j);

next_token = j;

}

//add to output

output_ids.push_back(next_token);

if (next_token == 3)

{

//stops if we generate a EOS token

break;

}

return output_ids;

}

void Model::save(const std::string& filename)

{

std::ofstream out_file(filename, std::ios::binary);

uint32_t version;

const uint32_t version = 1;

out_file.write(reinterpret_cast<const char*> (&version), sizeof(version));//set version for saving process

if (!out_file)

{

std::cerr << "Failed to open file: " << filename << std::endl;

return;

}

//save model architecture

uint32_t vocab_size_local = vocab_size_;

out_file.write(reinterpret_cast<const char*>(&vocab_size_local), sizeof(vocab_size_local));

uint32_t d_model_local = d_model_;

out_file.write(reinterpret_cast<const char*>(&d_model_local), sizeof(d_model_local));

uint32_t num_layers_local = num_layers_;

out_file.write(reinterpret_cast<const char*>(&num_layers_local), sizeof(num_layers_local));

uint32_t max_seq_len_local = max_seq_len_;

out_file.write(reinterpret_cast<const char*>(&max_seq_len_local), sizeof(max_seq_len_local));

//save vocabulary size

uint32_t vocab_size = vocab_.size();

out_file.write(reinterpret_cast<const char*>(&vocab_size), sizeof(vocab_size));

for (const auto& pair : vocab_)

{

uint32_t token_len = pair.first.length();

out_file.write(reinterpret_cast<const char*>(&token_len), sizeof(vocab_size));

out_file.write(pair.first.c_str(), token_len);

out_file.write(reinterpret_cast<const char*> (&pair.second), sizeof(pair.second));

}

//save parameters uses custom matrix save function

token_embeddings_.save(out_file);

position_embeddings_.save(out_file);

output_projection_.save(out_file);

//save layers

for (auto& layer : layers_)

{

layer.save(out_file);

}

void Model::load(const std::string& filename)

{

std::ifstream in_file(filename, std::ios::binary);

if (!in_file)

{

std::cerr << "Failed to open file: " << filename << std::endl;

return;

}

uint32_t version, vocab_size_local, d_model_local, num_layers_local, max_seq_len_local;

in_file.read(reinterpret_cast<char*>(&version), sizeof(version));

if (version != 1)

{

throw std::runtime_error("mismatched versions, encountered unsupported version");

}

in_file.read(reinterpret_cast<char*>(&vocab_size_local), sizeof(vocab_size_local));

vocab_size_ = static_cast<size_t>(vocab_size_local);

in_file.read(reinterpret_cast<char*>(&d_model_local), sizeof(d_model_local));

d_model_ = static_cast<size_t>(d_model_local);

in_file.read(reinterpret_cast<char*>(&num_layers_local), sizeof(num_layers_local));

num_layers_ = static_cast<size_t>(num_layers_local);

in_file.read(reinterpret_cast<char*>(&max_seq_len_local), sizeof(max_seq_len_local));

max_seq_len_ = static_cast<size_t>(max_seq_len_local);

//load vocabulary size

uint32_t vocab_size;

in_file.read(reinterpret_cast<char*>(&vocab_size), sizeof(uint32_t));

vocab_size_ = vocab_size;

//load vocabulary (token->id)

vocab_.clear();

reverse_vocab_.clear();

for (size_t i = 0; i < vocab_size_; ++i)

{

uint32_t token_len;

in_file.read(reinterpret_cast<char*>(&token_len), sizeof(token_len));

std::string token(token_len, '\0');

in_file.read(&token[0], token_len);

size_t id;

in_file.read(reinterpret_cast<char*>(&id), sizeof(id));

vocab_[token] = id;

reverse_vocab_[id] = token;

}

//load parameters uses custom matrix load function

token_embeddings_.load(in_file);

position_embeddings_.load(in_file);

output_projection_.load(in_file);

//save layers

for (auto& layer : layers_)

{

layer.load(in_file);

}

void Matrix::load(std::ifstream& in_file)

{

uint32_t version, row_local, col_local;

in_file.read(reinterpret_cast<char*>(&version), sizeof(version));

if (version != 1)

{

throw std::runtime_error("mismatched versions, encountered unsupported version");

}

in_file.read(reinterpret_cast<char*>(&row_local), sizeof(row_local));

rows_ = static_cast<size_t>(row_local);

in_file.read(reinterpret_cast<char*>(&col_local), sizeof(col_local));

cols_ = static_cast<size_t>(col_local);

if (rows_ > max_size / cols_)

{

throw std::runtime_error("Matrix size overflow");

}

data_.resize(rows_ * cols_);

in_file.read(reinterpret_cast<char*>(data_.data()), data_.size() * sizeof(float));

if (!in_file.good())

{

throw std::runtime_error("Failed to read matrix data");

}

« Previous Building A Transformer AI From Scratch - Part 3 Testing Building A Transformer AI From Scratch - Part 5 Its a debug party Next »