Building A Transformer AI From Scratch - Part 2

Published on 13 January 2026 at 09:34

“Why are we fighting to save the humans? They’re a primitive and violent race.” - Ironhide

I cannot find a quote about the transformer AI architecture without first finding a quote from a transformer; so your getting an opening quote from a transformer who I believe is also a AI.

I got the backwards pass built, code at bottom in appendix.

I had to spends some time cycling through stuff, AI advice and writing a lot of asserts and tests to get this finally working. The issue for me was making mistakes with the matrix multiplication.

The linear algebra can fail if number of columns does not match when adding two values and can also fail if the number of rows does not match the columns of the first one. So I decided to work out the dimensions at runtime and included limited amount of print to terminal debug information, then forgot to write down what every matrix dimensions ought to be when I wrote it.

This creates an interesting bug where the dimensions are wrong they cascade and change the sizing of other matrixes so its not where the error is raised that the error starts.

Consequently I spent about double the time I spent writing it cross referencing between AI and trying to vibe code my way out of it; consequently applied every fix they suggested to discover I really should have noted how different the numbers for feed forward to the main transformer block before I started.

This was a mistake on my part and it shows the difficulty of learning a new technique and both how vibe coding (which I used extensively) is a speed up and a liability which makes it great where there are gaps in knowledge but make sure you have enough of the basics to grind through the problem; as a lesson I learned was systematic commenting of code was the quickest solution (I rewrote it multiple times till I did this).

Though I did it the full forward and backward implementation of the Transformer Architecture and next step would be to look at encoder and architecture. I also have a stack of papers that read or referenced at the end that at some point I need to cross reference and sit down and read better than the glances and AI summaries I used in the beginning. So first I need to start looking at understanding all the aspects of what I coded. This is just a dump of the first design that worked(ish) then onto testing and understanding later.

Aim for this design

I have not properly started testing this and so it is still a toy model used to show the principals inside transformer AI for the backward pass or how it learns. Though it needs to run…

A brief explanation

I wanted to start with a really simple explanation I wish I got ages ago and why I wanted to build this to show code to someone who like the past me only ever got directed to formulas in books.

I feel that formulas are quintessentially the best way to record information to be passed down intergenerationally but kind of hard to make out if your a coder and not a mathematician exactly what is being meant. I have rarely found its 50/50 if I can extrapolate maths formula to code and I have regularly read a book that touches on data, AI or code and struggled with the formulas.

Therefore hopefully if only one person reads this goes to the code appendix I have put at the end and goes oh, I get it now then that will be worth it.

What was already built the forward pass

The transformer Ai works in two phases the forward pass and the backward pass. The forward calculates each layer of the output from the bottom. The whole design is envisioned as a block of linear algebra (admittedly split into multiple “heads” or parts looking at the same inputs but still).

During the forward pass and one of the reason why I split up the architecture is that it needs to store the various intermediary calculations to do the backwards pass. i.e. when learning you need to know the input (what you did) and the amount its wrong. Splitting the parts up mean the parts of the multiple heads store their own inputs.

So I split up previous design into: Multiheaded Attention, Feed forward, Transformer block and Layer norm.

Back Propagation Explained Badly

How AI works by “Gradient Descent” this is the mathematical process that determines the “gradients”. Gradients then applies the back propagation using the chain rule.

New weight=Old weight – learning rate * gradient descent.

The maths can feel intense but the intuition is beautiful. It is relatively easy with an output to know the direction of travel (is your guess bigger or smaller than what the true value is). Bigger increase the number and back propagate a positive number or if the guess is higher back propagate a negative number.

It is not quite that simple as the maths of gradient descent applies the learning and how to update the weights. Back propagation computes those values.

In the architecture I built this is managed in that each the different objects (Multiheaded Attention, Feed forward, Transformer block and Layer norm) have a forward function that takes the input multiplies it by the weights and applies an activation function and passes out an output.

A nice representation of this mentally is that learning is like being on a mountain you know if you need to go down the mountain or up but back propagation tries to calculate whether you should run, walk or throw yourself of that cliff to get down the cliff to the places that the AI now works better. Its pat calculus, part metaphorically mountaineering and part uses a derivative to estimate that speed.

Yay derivatives...

Oh yes activation functions

An activation function takes the maths and tries to make it non linear

So the reason that I have said we use multi headed attention is to have multiple different matrixes take the same input and apply different maths. This also complicates the backward pass because of derivatives…

In practice when building a AI you use set tried and tested activation functions that you know aid in the design your attempting to build.

One part of gradient descent is multiplying the value as its passed back by the activation function derivative. This has the affect that its easier to write individual functions for each of the heads that applies the backwards and forward passes.

Back to my code…

The next section I am going to put descriptions of the objects in use and names for matrixes etc.

Code Architecture

-Multiheaded Attention: This is the core idea that a set of heads using different techniques can perform better and “attend” to different aspects of the input. Project input into query, key, value spaces them computes attention scores and applies the to value and combines attention heads and projects back to model dimension.

-Feed forward:Forwards on the data into the Multi Headed attention. Applies two linear transformations with ReLu activation in between (ReLu means rectified linear unit and adds in non linearity by changing anything less than zero to zero). First expands dimension to d_ff, then contracts back to d_model.

Layer norm: Normalises activation to have zero mean and unit variance and applies learnable scale and shift parameters.

Transformer Block: Combines attention and feed forward with residual connections applies layer normalisation before and after each sub layer

All this is in some sense represented by different Matrixes through the code. The name of all is below.

Matrixes used in Attention

wQ_ - query weight matrix / Projects input to query apace for attention - ( d_model,d_model)

Wk_ -Key Weight Matrix / Projects input to key space for attention - (d_model,d_model)

Wv_ - Value Weight Matrix / Project input to value space for attention - (d_model,d_model)

Wo_ - Output Weight Matrix / Projects input to value space for attention - (d_model,d_model)

dWq_ - Query Weight Gradient / Gradient of query weight matrix - (d_model,d_model)

dWk_ - Gradient of key weight matrix / Gradient of key weight matrix - (d_model,d_model)

dWv_ -Value weight gradient / Gradient of weight value matrix - (d_model,d_model)

dWo_ -Output weight gradient / Gradient of output weight matrix - (d_model,d_model)

x_ - input matrix / stores input for backward pass - (seq_len,d_head) -assigned on calculation

Q_ - Query Matrix / Query projections before splitting - (seq_len,d_head) -assigned on calculation

K_ - Key Matrix / Key projections before splitting - (seq_len,d_head) -assigned on calculation

V_ - Value Matrix / Value projections before splitting - (seq_len,d_head) -assigned on calculation

attn_out_ - Attention output / Final attention output - (seq_len,d_head) -assigned on calculation

Ah_combined_ / Combined attention heads - (seq_len,d_head) -assigned on calculation

Qh_ - Query heads / Vector of query head matrices - Each (seq_len,d_head) -assigned on calculation

Kh_ - Key heads / Vector of value head matrices - Each (seq_len,d_head) -assigned on calculation

Vh_ - Value Heads / Vector of attention head matrices - Each (seq_len,d_head) -assigned on calculation

scores_ - Attention scores / Vector of attention score matrices - Each (seq_len,seq_len); -assigned on calculation

Ah_ - Vector of attention head outputs / vector of attention head outputs - Each (seq_len,d_head) -assigned on calculation

logits_ - Vector of attention logits / Vector of attention logits - Each (seq_len,seq_len) -assigned on calculation

softmax_out_ - Vector of softmax attention scores / Vector of softmaxed attention scores - Each (seq_len,seq_len) -assigned on calculation

Matrices used in feed forward

W1_ - First Weight Matrix / First Linear transformation - (d_model,d_ff)(8,16)

W2_ - Second weight matrix / Secon Linear transformation -(d_ff,d_model)(16,8)

dW1_ - First weight gradient / Gradient of first weight matrix - (d_model,d_ff)(8,16)

dW2_ - Second weight gradient / Gradient of second weight matrix - (d_ff,d_model)(16,8)

x_ - input matrix / stores input for backward pass - (seq_len,d_model)(4,8)

h_ - hidden matrix / intermediate hidden representative - (seq_len,d_model) (4,8)

b1_ - First Bias / bias for first linear transformation - (1,d_ff) (1,16)

b2_ -Second bias / bias for second linear transformation - (1,d_model) (1,8)

db1_ -First bias gradient / Gradient of first bias - (1,d_ff) (1,16)

db2_ -Second bias gradient / Gradient of second bias - (1,d_model) (1,8)

Matrices used in layer normalisation

gamma_ - gamma / learnable scale parameter - (1,d_model)

beta_ beta / learnable shift parameter - (1,d_model)

dgamma_ - gamma gradient / gradient for gamma - (1,d_model)

dbeta_ - beta gradient / gradient for beta - (1,d_model)

x_hat_ - Normalised input / normalised input for backwards pass - (seq_len,d_model) - not initialised created first time calculation is done and assigned to variable

Matrices used in transformer block

x_ - Input Matrix / stores input for backward pass - (seq_len,d_model)

y_ - Intermediate output / output after attention + residual - (seq_len,d_model)

z_ - Final output / final output after FFN + residual - (seq_len,d_model)

Values used in code for sizes and what I sent them too

d_model: model dimension - think size - 8

d_ff: feed forward dimension - 16

d_head head dimension

seq_legnth -4

num_heads - Number of attention heads -2

Final Thoughts

Its been on my bucket list for a while build a whole Transformer AI. Next step take a break by doin something else then find a way to apply it somewhere and start using it which I think is where you would actually learn something…

But boxed ticked I can build a AI….

Glossary

Logits: means a raw output from the transformer they are typically before the softmax function and are part of the inputs into the multiheaded part of attention.

K: the key, represents token matching

V: the vector, information that gets passed forward

Q: the query, what is being looked for in the process.

Code Appendix

Attention.h

#pragma once

#include "Matrix.h"

#include <vector>

class MultiHeadAttention

{

public:

MultiHeadAttention(int d_model, int bum_heads);

Matrix forward(const Matrix& x);

Matrix backward(const Matrix& d_out);

void step(float lr);

void zero_grad();

Matrix softmax_backward(const Matrix& d_out, const Matrix& softmax_out);

private:

int d_model_, num_heads_, d_head_;

Matrix Wq_, Wk_, Wv_, Wo_;

Matrix dWq_, dWk_, dWv_, dWo_;

Matrix x_, Q_, K_, V_, attn_out_,Ah_combined_;

std::vector<Matrix> Qh_, Kh_, Vh_, scores_, Ah_,logits_,softmax_out_;

float scale_;

}

;

Attention.cpp

#include "Attention.h"

#include <algorithm>

static std::vector<Matrix> split(const Matrix& x, int num_heads)

{

int seq_len = x.rows(), d_model = x.cols(), d_head = d_model / num_heads;

std::vector<Matrix> heads(num_heads, Matrix(seq_len, d_head));

for (int h = 0; h < num_heads; ++h)

{

for (int i = 0; i < seq_len; ++i)

{

for (int j = 0; j < d_head; ++j)

{

heads[h](i, j) = x(i, h * d_head + j);

}

return heads;

}

static Matrix combine(const std::vector<Matrix>& heads)

{

int num_heads = heads.size(), seq_len = heads[0].rows(), d_head = heads[0].cols();

Matrix combined(seq_len, num_heads * d_head);

for (int h = 0; h < num_heads; ++h)

{

for (int i = 0; i < seq_len; ++i)

{

for (int j = 0; j < d_head; ++j)

{

combined(i, h * d_head + j) = heads[h](i, j);

}

return combined;

}

MultiHeadAttention::MultiHeadAttention(int d_model, int num_heads):

d_model_(d_model), num_heads_(num_heads),d_head_(d_model/num_heads),

Wq_(d_model,d_model),Wk_(d_model, d_model),Wv_(d_model,d_model),Wo_(d_model, d_model),

dWq_(d_model, d_model),dWk_(d_model, d_model),dWv_(d_model, d_model),dWo_(d_model, d_model),

scale_(1.0f/std::sqrt(d_head_))

{

//Xavier initialisation

float scale = std::sqrt(2.0f / static_cast<float>(d_model + d_model));

Wq_.fill_random(-scale, scale);

Wk_.fill_random(-scale, scale);

Wv_.fill_random(-scale, scale);

Wo_.fill_random(-scale, scale);

//wQ_ - query weight matrix / pROJECTS INPUT TO QUERY SPACE FOR ATTENTION - ( d_model,d_model) -CHECKED

//Wk_ -Key Weight Matrix /PROJECTS INPUT TO KEY SPACE FOR ATTENTION - (d_model,d_model) -CHECKED

//Wv_ - Value Weight Matrix / PROKECT INPUT TO VALUE SPACE FOR ATTENTION - (d_model,d_model) -CHECKED

//Wo_ - Output Weight Matrix / Projects input to value space for attention - (d_model,d_model) -CHECKED

//dWq_ - Query Weight Gradient / Gradient of query weight matrix - (d_model,d_model) -CHECKED

//dWk_ - Gradient of key weight matrix / Gradient of key weight matrix - (d_model,d_model) -CHECKED

//dWv_ -Value weight gradient / Gradient of weight value matrix - (d_model,d_model) -CHECKED

//dWo_ -Output weight gradient / Gradient of output weight matrix - (d_model,d_model) -CHECKED

//x_ - input matrix / stores input for backward pass - (seq_len,d_head) -assigned on calculation

//Q_ - Query Matrix / Query projections before splitting - (seq_len,d_head) -assigned on calculation

//K_ - Key Matrix / Key projections before splitting - (seq_len,d_head) -assigned on calculation

//V_ - Value Matrix / Value projections before splitting - (seq_len,d_head) -assigned on calculation

//attn_out_ - Attention output / Final attention output - (seq_len,d_head) -assigned on calculation

// Ah_combined_ / Combined attention heads - (seq_len,d_head) -assigned on calculation

//Qh_ - Query heads / Vector of query head matrices - Each (seq_len,d_head) -assigned on calculation

//Kh_ - Key heads / Vector of value head matrices - Each (seq_len,d_head) -assigned on calculation

//Vh_ - Value Heads / Vector of attention head matrices - Each (seq_len,d_head) -assigned on calculation

//scores_ - Attention scores / Vector of attention score matrices - Each (seq_len,seq_len); -assigned on calculation

//Ah_ - Vector of attention head outputs / vector of attention head outputs - Each (seq_len,d_head) -assigned on calculation

//logits_ - Vector of attention logits / Vector of attention logits - Each (seq_len,seq_len) -assigned on calculation

//softmax_out_ - Vector of softmax attention scores / Vector of softmaxed attention scores - Each (seq_len,seq_len) -assigned on calculation

//d_model: model dimension - think size - 8

//d_ff: feed forward dimension - 16

//d_head head dimension

//seq_legnth -4

//num_heads - Number of attention heads -2

}

Matrix MultiHeadAttention::forward(const Matrix& x)

{

x_ = x;

Q_ = x * Wq_;

K_ = x * Wk_;

V_ = x * Wv_;

Qh_=split(Q_, num_heads_);

Kh_ = split(K_, num_heads_);

Vh_ = split(V_, num_heads_);

Matrix s;

scores_.clear();

Ah_.clear();

logits_.clear();

softmax_out_.clear();

for (int h = 0; h < num_heads_; ++h)

{

s = Qh_[h] * Kh_[h].transpose();

//apply causal mask

for (size_t i = 0; i < s.rows(); ++i)

{

for (size_t j = i + 1; j < s.cols(); ++j)

{

s(i, j) = -1e9f;

}

for (size_t i = 0; i < s.rows(); ++i)

{

for (size_t j = 0; j < s.cols(); ++j)

{

s(i, j) *= scale_;

}

logits_.push_back(s);

Matrix softmax = s;

softmax.softmax();

softmax_out_.push_back(softmax);

scores_.push_back(softmax);

Ah_.push_back(softmax* Vh_[h]);

}

Ah_combined_ = combine(Ah_);

attn_out_ = Ah_combined_ * Wo_;

return attn_out_;

}

Matrix MultiHeadAttention::backward(const Matrix& d_out)

{

Matrix d_attn_out = d_out * Wo_.transpose();

std::vector<Matrix> dAh = split(d_attn_out, num_heads_);

std::vector<Matrix> dQh(num_heads_, Matrix(Qh_[0].rows(), Qh_[0].cols()));

std::vector<Matrix> dKh(num_heads_, Matrix(Kh_[0].rows(), Kh_[0].cols()));

std::vector<Matrix> dVh(num_heads_, Matrix(Vh_[0].rows(), Vh_[0].cols()));

float sum;

//precompute all softmax backward passes

std::vector<Matrix> d_softmax(num_heads_);

for (int h = 0; h < num_heads_; ++h)

{

d_softmax[h] = softmax_backward(dAh[h] * Vh_[h].transpose(), softmax_out_[h]);

}

//compute gradieny for Q,K, V

for (int h = 0; h < num_heads_; ++h)

{

dQh[h] = d_softmax[h].scale_multiplication(Kh_[h].transpose(), scale_);

dKh[h]=d_softmax[h].transpose().scale_multiplication(Qh_[h], scale_);

dVh[h] = scores_[h].transpose() * dAh[h];

}

//combine gradients

Matrix dQ = combine(dQh);

Matrix dK = combine(dKh);

Matrix dV = combine(dVh);

//Cache friendly accumulation

float scale = 1.0f / static_cast<float>(x_.rows());

for (size_t k = 0; k < x_.cols(); ++k)

{

for(size_t j=0;j<dWq_.cols();++j)

{

sum = 0.0f;

for (size_t i = 0; i < x_.rows(); ++i)

{

sum += x_(i, k) * dK(i, j);

}

dWq_(k, j)+=sum* scale;

}

//gradient for Wk

for (size_t k = 0; k < x_.cols(); ++k)

{

for (size_t j = 0; j < dWk_.cols(); ++j)

{

sum = 0.0f;

for (size_t i = 0; i < x_.rows(); ++i)

{

sum += x_(i, k) * dK(i, j);

}

dWk_(k, j) += sum * scale;

}

//gradient for Wv

for (size_t k = 0; k < x_.cols(); ++k)

{

for (size_t j = 0; j < dWv_.cols(); ++j)

{

sum = 0.0f;

for (size_t i = 0; i < x_.rows(); ++i)

{

sum += x_(i, k) * dK(i, j);

}

dWv_(k, j) += sum * scale;

}

dWo_ + Ah_combined_.transpose().scale_multiplication(d_out,scale);

Matrix d_x = dQ * Wq_.transpose() + dK * Wk_.transpose() + dV * Wv_.transpose();

return d_x;

}

void MultiHeadAttention::step(float lr)

{

for (size_t i = 0; i < Wq_.rows(); ++i)

{

for (size_t j = 0; j < Wq_.cols(); ++j)

{

Wq_(i, j) -= lr * dWq_(i, j);

Wk_(i,j) -= lr * dWk_(i, j);

Wv_(i, j) -= lr * dWv_(i, j);

Wo_(i, j) -= lr * dWo_(i, j);

}

Matrix MultiHeadAttention::softmax_backward(const Matrix& d_out, const Matrix& softmax_out)

{

Matrix dx(softmax_out.rows(), softmax_out.cols());

float sum;

for (size_t i = 0; i < softmax_out.rows(); ++i)

{

for (size_t j = 0; j < softmax_out.cols(); ++j)

{

sum = 0.0f;

for (size_t k = 0; k < softmax_out.cols(); ++k)

{

sum += d_out(i, k) * softmax_out(i, k) * ((j == k) - softmax_out(i, j));

}

dx(i, j) = sum;

}

return dx;

}

void MultiHeadAttention::zero_grad()

{

dWq_.zero();

dWk_.zero();

dWv_.zero();

dWo_.zero();

}

feedforward.h

#pragma once

#include "Matrix.h"

class feedforward

{

public:

feedforward(int d_model, int d_ff);

Matrix forward(const Matrix& x);

Matrix backward(const Matrix& d_out);

void step(float lr);

void zero_grad();

private:

Matrix W1_, W2_, dW1_, dW2_;

Matrix x_, h_;

Matrix b1_, b2_;//bias terms

Matrix db1_, db2_;

};

feedforward.cpp

#include "feedforward.h"

#include <algorithm>

feedforward::feedforward(int d_model, int d_ff)

: W1_(d_model, d_ff), W2_(d_ff, d_model), dW1_(d_model, d_ff),dW2_(d_ff, d_model),

b1_(1, d_ff),b2_(1, d_model),db1_(1, d_ff),db2_(1, d_model)

{

float scale1 = std::sqrt(2.0f / (d_model + d_ff));

W1_.fill_random(-scale1, 01);

W2_.fill_random(-scale1, scale1);

b1_.fill_random(0.0, 0.0);

b2_.fill_random(0.0, 0.0);

//W1_ - First Weight Matrix / FIRST LINEAR TRANSFORMATION - (d_model,d_ff)(8,16)

//W2_ - Second weight matrix / SECOND LINEAR TRANSFORMATION -(d_ff,d_model)(16,8)

//dW1_ - First weight gradient / GRADIENT OF FIRST WEIGHT MATRIX - (d_model,d_ff)(8,16)

//dW2_ - Second weight gradient / GRADIENT OF SECOND WEIGHT MATRIX - (d_ff,d_model)(16,8)

//x_ - input matrix / stores input for backward pass - (seq_len,d_model)(4,8)

//h_ - hidden matrix / intermediate hidden prpresentation - (seq_len,d_model) - I think this is inverted, (4,8)

//b1_ - First Bias / bias for first linear transformation - (1,d_ff) (1,16)

//b2_ -Second bias / bias for second linear transformation - (1,d_model) (1,8)

//db1_ -First bias gradient / Gradient of first bias - (1,d_ff) (1,16)

//db2_ -Second bias gradient / Gradient of second bias - (1,d_model) (1,8)

;

//d_model: model dimension - think size - 8

//d_ff: feed forward dimension - 16

//d_head head dimension

//seq_legnth -4

//num_heads - Number of attention heads -2

}

void feedforward::zero_grad()

{

dW1_.zero();

dW2_.zero();

db1_.zero();

db2_.zero();

}

Matrix feedforward::forward(const Matrix& x)

{

x_ = x;

h_ = x * W1_;

h_.broadcast_add(b1_);

for (size_t i = 0; i < h_.rows(); ++i)

{

for (size_t j = 0; j < h_.cols(); ++j)

{

h_(i, j) = std::max(0.0f, h_(i, j));

}

Matrix out = h_ *W2_;

out.broadcast_add(b2_);

return out;

}

Matrix feedforward::backward(const Matrix& d_out)

{

float scale= 1.0f / d_out.rows();

float sum;

Matrix d_h = d_out * W2_.transpose();

for (size_t i = 0; i < d_h.rows(); ++i)

{

for (size_t j = 0; j < d_h.cols(); ++j)

{

d_h(i, j) *= (h_(i, j) > 0) ? 1.0f : 0.0f;

}

dW2_ = dW2_ + h_.transpose().scale_multiplication(d_out, scale);

for (int j = 0; j < db2_.cols(); ++j)

{

sum = 0.0f;

for (int i = 0; i < d_out.rows(); ++i)

{

sum += d_out(i,j);

}

db2_(0, j) += sum * scale;

}

dW1_= dW1_ + x_.transpose().scale_multiplication(d_h, scale);

for (int j = 0; j < db1_.cols(); ++j)

{

sum = 0.0f;

for (int i = 0; i < d_h.rows(); ++i)

{

sum += d_h(i, j);

}

db1_(0, j) += sum * scale;

}

Matrix d_x = d_h * W1_.transpose();

return d_x;

}

void feedforward::step(float lr)

{

for (size_t i = 0; i < W1_.rows(); ++i)

{

for (size_t j = 0; j < W1_.cols(); ++j)

{

W1_(i, j) -= lr * dW1_(i, j);

}

for (size_t i = 0; i < W2_.rows(); ++i)

{

for (size_t j = 0; j < W2_.cols(); ++j)

{

W2_(i, j) -= lr * dW2_(i, j);

}

for (size_t j = 0; j < b1_.cols(); ++j)

{

b1_(0, j) -= lr * db1_(0, j);

}

for (size_t j = 0; j < b2_.cols(); ++j)

{

b2_(0, j) -= lr * db2_(0, j);

}

Layer norm.h

#pragma once

#include "matrix.h"

class Layer_norm

{

public:

Layer_norm(int dim);

Matrix forward(const Matrix& x);

Matrix backward(const Matrix& d_out);

void step(float lr);

void zero_grad();

private:

int dim_;

Matrix gamma_, beta_;

Matrix dgamma_, dbeta_;

std::vector<float> mean_, variance_;

Matrix x_hat_;

};

Layer norm.cpp

#include "Layer_norm.h"

#include <cmath>

Layer_norm::Layer_norm(int dim) :dim_(dim), gamma_(1, dim), beta_(1, dim), dgamma_(1, dim), dbeta_(1, dim)

{

gamma_.fill_random(1.0, 1.0);

beta_.fill_random(0.0, 0.0);

//gamma_ - gamma / learnable scale parameter - (1,d_model)

// //beta_ beta / learnable shift parameter - (1,d_model)

//dgamma_ - gamma gradient / gradient for gamma - (1,d_model)

//dbeta_ - beta gradient / gradient for beta - (1,d_model)

//x+hat+ - Normalised inpuy / normalised input for backwards pass - (seq_len,d_model) - not initialised created first time calculation is done and assigned to variable

//d_model: model dimension - think size - 8

//d_ff: feed forward dimension - 16

//d_head head dimension

//seq_legnth -4

//num_heads - Number of attention heads -2

}

Matrix Layer_norm::forward(const Matrix& x)

{

int rows = x.rows();

int cols = x.cols();

assert(cols == dim_);

mean_.resize(rows);

variance_.resize(rows);

x_hat_ = Matrix(rows, cols);

float m;

float v;

float diff;

for (int i = 0; i < rows; ++i)

{

m = 0.0f;

for (int j = 0; j < cols; ++j)

{

m += x(i, j);

}

m /= cols;

mean_[i]=m;

v = 0.0f;

for (int j = 0; j < cols; j++)

{

diff = (x(i, j) - m);

v += diff * diff;

}

v /= cols;

variance_[i] = v;

float inv_std = 1.0f / std::sqrt(v + 1e-6f);

for (int j = 0; j < cols; ++j)

{

x_hat_(i, j) = (x(i, j) - m) * inv_std;

}

Matrix out(rows, cols);

for (int i = 0; i < rows; ++i)

{

for (int j = 0; j < cols; ++j)

{

out(i, j) = gamma_(0, j) * x_hat_(i, j) + beta_(0, j);

}

return out;

}

Matrix Layer_norm::backward(const Matrix& d_out)

{

int rows = d_out.rows();

int cols = d_out.cols();

float sum1;

float sum2;

float dxhat;

float inv_std;

Matrix dx(rows, cols);

for (int i = 0; i < rows; ++i)

{

inv_std = 1.0f / std::sqrt(variance_[i] + 1e-6f);

sum1 = 0.0f;

sum2 = 0.0f;

for (int j = 0; j < cols; ++j)

{

dxhat = d_out(i, j) * gamma_(0, j);

sum1 += dxhat;

sum2 += dxhat * x_hat_(i, j);

}

for (int j = 0; j < cols; ++j)

{

dxhat = d_out(i, j) * gamma_(0, j);

dx(i, j) = inv_std * (dxhat - sum1 / cols - x_hat_(i, j) * sum2 / cols);

}

for (int j = 0; j < cols; ++j)

{

dgamma_(0, j) += d_out(i, j) * x_hat_(i, j);

dbeta_(0, j) += d_out(i, j);

}

return dx;

}

void Layer_norm::step(float lr)

{

for (int j = 0; j < dim_; ++j)

{

gamma_(0, j) -= lr * dgamma_(0, j);

beta_(0, j) -= lr * dbeta_(0, j);

}

void Layer_norm::zero_grad()

{

dgamma_.zero();

dbeta_.zero();

}

Matrix.h

#pragma once

#include <vector>

#include <cstddef>

#include <cmath>

#include <iostream>

#include <algorithm>

#include <cassert>

#include <random>

#include <stdexcept>

class Matrix

{

public:

Matrix();//default constructor

Matrix(size_t rows, size_t cols);

Matrix(const Matrix& other);//copy constructor

float& operator()(size_t i, size_t j);

const float& operator()(size_t i, size_t j) const;

size_t rows() const;

size_t cols() const;

Matrix& operator=(const Matrix& other);

Matrix operator*(const Matrix& other) const;

Matrix operator*(float other);

Matrix operator+(const Matrix& other) const;

Matrix operator-(const Matrix& other) const;

void print() const;

void softmax();

void layer_norm();

void fill_random(float min, float max);

Matrix transpose() const;

Matrix scale_multiplication(const Matrix& other,float weight) const;

void zero();

void broadcast_add(const Matrix& bias);

private:

size_t rows_, cols_;

std::vector<float> data_;

static constexpr float eps_ = 1e-6f;

};

Matrix.cpp

#include "Matrix.h"

//default constructor

Matrix::Matrix() :rows_(0), cols_(0) {}

//constructor with dimensions

Matrix::Matrix(size_t rows, size_t cols) : rows_(rows), cols_(cols), data_(rows* cols, 0.0f){}

//copy constructor

Matrix::Matrix(const Matrix& other) :rows_(other.rows_), cols_(other.cols_), data_(other.data_) {}

//Assignment constructor

Matrix& Matrix::operator=(const Matrix& other)

{

if (this != &other)

{

rows_ = other.rows_;

cols_ = other.cols_;

data_ = other.data_;

}

return *this;

}

float& Matrix::operator()(size_t i, size_t j)

{

#ifndef NDEBUG

assert(i < rows_ && j < cols_);

#endif

return data_[i * cols_ + j];

}

const float& Matrix::operator()(size_t i, size_t j) const

{

#ifndef NDEBUG

assert(i < rows_ && j < cols_);

#endif

return data_[i * cols_ + j];

}

size_t Matrix::rows() const

{ return rows_; }

size_t Matrix::cols() const

{ return cols_; }

Matrix Matrix::operator*(const Matrix& other) const

{

assert(cols_ == other.rows_);

Matrix result(rows_, other.cols_);

float sum;

for (size_t i = 0; i < rows_; ++i)

{

for (size_t j = 0; j < other.cols_; ++j)

{

sum = 0.0;

for (size_t k = 0; k < cols_; ++k)

{

sum += (*this)(i, k) * other(k, j);

}

result(i, j) = sum;

}

return result;

}

Matrix Matrix::scale_multiplication(const Matrix& other, float weight) const

{

assert(cols_ == other.rows_);

Matrix result(rows_, other.cols_);

float sum;

for (size_t i = 0; i < rows_; ++i)

{

for (size_t j = 0; j < other.cols_; ++j)

{

sum = 0.0;

for (size_t k = 0; k < cols_; ++k)

{

sum += (*this)(i, k) * other(k, j);

}

result(i, j) = sum* weight;

}

return result;

}

Matrix Matrix::operator*(float other)

{

Matrix result(rows_, cols_);

for (size_t i = 0; i < data_.size(); ++i)

{

result.data_[i] = data_[i] * other;

}

return result;

}

Matrix Matrix::operator+(const Matrix& other) const

{

assert(rows_ == other.rows_ && cols_ == other.cols_);

Matrix result(rows_, cols_);

for (size_t i = 0; i < data_.size(); ++i)

{

result.data_[i] = data_[i] + other.data_[i];

}

return result;

}

Matrix Matrix::operator-(const Matrix& other) const

{

assert(rows_ == other.rows_ && cols_==other.cols_);

Matrix result(rows_, cols_);

for (size_t i = 0; i < data_.size(); ++i)

{

result.data_[i] = data_[i] - other.data_[i];

}

return result;

}

void Matrix::print() const

{

for (size_t i = 0; i < rows_; ++i)

{

for (size_t j = 0; j < cols_; ++j)

{

std::cout << (*this)(i, j) << " ";

}

std::cout << std::endl;

}

void Matrix::softmax()

{

float max_val;

int helper;

int helper2=0;

float sum;

for (size_t i = 0; i < rows_; ++i)

{

helper = helper2;

max_val = data_[helper];

for (size_t j = 1; j < cols_; ++j)

{

helper += 1;

if (data_[helper] > max_val)

{

max_val = data_[helper];

}

sum = 0.0f;

helper = helper2;

for (size_t j = 0; j < cols_; ++j)

{

data_[helper] = std::exp(data_[helper] - max_val);

sum += data_[helper];

helper++;

}

helper = helper2;

for (size_t j = 0; j < cols_; ++j)

{

data_[helper] /= sum;

helper++;

}

helper2 = helper;

}

void Matrix::layer_norm()

{

float mean;

float variance;

int helper;

int helper2=0;

float dif;

for (size_t i = 0; i < rows_; ++i)

{

mean = 0.0f;

helper = helper2;

for (size_t j = 0; j < cols_; ++i)

{

mean += data_[helper];

helper++;

}

mean /= cols_;

variance = 0.0f;

helper = helper2;

for (size_t j = 0; j < cols_; ++i)

{

dif = (data_[helper] - mean);

variance += dif * dif;

helper++;

}

variance /= cols_;

dif = 1.0f / std::sqrt(variance + 1e-6f);

helper = helper2;

for (size_t j = 0; j < cols_; ++j)

{

data_[helper] = (data_[helper] - mean) * dif;

helper++;

}

helper2 = helper;

}

void Matrix::fill_random(float min, float max)

{

std::random_device rd;

std::mt19937 gen(rd());

std::uniform_real_distribution<float> dist(min, max);

for (auto& x : data_)x = dist(gen);

}

Matrix Matrix::transpose() const

{

Matrix result(cols_, rows_);

for (size_t i = 0; i < rows_; ++i)

{

for (size_t j = 0; j < cols_; ++j)

{

result(j, i) = (*this)(i, j);

}

return result;

}

void Matrix::zero()

{

std::fill(data_.begin(), data_.end(), 0.0f);

}

void Matrix::broadcast_add(const Matrix& bias)

{

assert(cols_ == bias.cols() && bias.rows() == 1);

for (size_t i = 0; i < rows_; ++i)

{

for (size_t j = 0; j < cols_; ++j)

{

(*this)(i, j) += (0, j);

}

Transformer Object.h

#pragma once

#include "attention.h"

#include "feedforward.h"

#include "layer_norm.h"

class TransformerBlock

{

public:

TransformerBlock(int d_model, int num_heads, int d_ff);

Matrix forward(const Matrix& x);

Matrix backward(const Matrix& d_out);

void step(float lr);

void zero_grad();

private:

MultiHeadAttention mha_;

feedforward ff_;

Layer_norm ln1_, ln2_;

Matrix x_, y_, z_;

};

Transformer object.cpp

#include "transformer_object.h"

TransformerBlock::TransformerBlock(int d_model, int num_heads, int d_ff) : mha_(d_model,num_heads),ff_(d_model, d_ff),

ln1_(d_model),ln2_(d_model)

{}

void TransformerBlock::zero_grad()

{

mha_.zero_grad();

ff_.zero_grad();

ln1_.zero_grad();

ln2_.zero_grad();

//x_ - Input Matrix / stores input for backward pass - (seq_len,d_model)

//y_ - Intermediate output / output after attention + residual - (seq_len,d_model)

//z_ - Final output / final output after FFN + residual - (seq_len,d_model)

}

Matrix TransformerBlock::forward(const Matrix& x)

{

x_= x;

y_ = ln1_.forward(mha_.forward(x) + x);

z_ = ln2_.forward(ff_.forward(y_) + y_);

return z_;

}

Matrix TransformerBlock::backward(const Matrix& d_out)

{

Matrix d_z = ln2_.backward(d_out);

Matrix d_ff_out = d_z;

Matrix d_y = ff_.backward(d_ff_out);

d_y = d_y + d_z;

Matrix d_mha_out = ln1_.backward(d_y);

Matrix d_x = mha_.backward(d_mha_out);

d_x = d_x + d_mha_out;

return d_x;

}

void TransformerBlock::step(float lr)

{

mha_.step(lr);

ff_.step(lr);

ln1_.step(lr);

ln2_.step(lr);

}

Main.cpp

// Transformer.cpp : This file contains the 'main' function. Program execution begins and ends there.

#include <iostream>

#include "transformer_object.h"

//MULTI HEADED ATTENTION

//project input into query,key,value spaces

//computes attention scores and applies the to value

//combines attention heads and projects back to model dimension

//FEEDFORWARD

//applies two linear transfromations with ReLu activation in between

//First expands dimension to d_ff, then contracts back to d_model

//LAYERNORM

//Normalises activation to have zero mean and unit variance

//applies learnable scale and shift parameters

//TRANSFORMER BLOCK

//combines attention and feed forward with residual connections

//applies layer normalisation before and after each sub layer

int main()

{

size_t seq_len = 4;

size_t d_model = 8;

int num_heads = 2;

int d_ff = 16;

float learning_rate = 0.001f;

int epochs = 100;

//d_model: model dimension - think size

//d_ff: feed forward dimension

//d_head head dimension

//seq_legnth

//num_heads - Number of attention heads

TransformerBlock block(d_model, num_heads, d_ff);

Matrix input(seq_len, d_model);

Matrix target(seq_len, d_model);

input.fill_random(-0.5f, 0.5f);

target.fill_random(-0.1f, 0.1f);

float loss=0;

for (int epoch = 0; epoch < epochs; ++epoch)

{

Matrix output = block.forward(input);

Matrix grad(seq_len, d_model);

grad.fill_random(-0.1f, 0.1f);

//backward pass

block.zero_grad();

Matrix d_input = block.backward(grad);

block.step(0.001f);

if (epoch % 10 == 0)

{

std::cout << "Epoch " << epoch << ", Loss: " << loss << std::endl;

}

return 0;

}

// Run program: Ctrl + F5 or Debug > Start Without Debugging menu

// Debug program: F5 or Debug > Start Debugging menu

// Tips for Getting Started:

// 1. Use the Solution Explorer window to add/manage files

// 2. Use the Team Explorer window to connect to source control

// 3. Use the Output window to see build output and other messages

// 4. Use the Error List window to view errors

// 5. Go to Project > Add New Item to create new code files, or Project > Add Existing Item to add existing code files to the project

// 6. In the future, to open this project again, go to File > Open > Project and select the .sln file

Academic Appendix:

Attention is all you need (Vaswani ey al., 2017

Layer Normalisation (Ba et all,2016

Deep Residual learning for image recognition (He et al, 2016)

Neural Machine Translation by jointly loearning to align and translate – 2015

“Self-attention with relative position representation 2018

Adam a mthod for stochastic Optimixation 2015 – not in code

Fixup initialisation Residual learning without Normalisation -2019 Explains the Xavier initialisation used in the code

Bert: pre training of Deep bidirectional transformers for langage understanding 2019

Transformer – XL: Attentive language models beyond a fixed legnth context

Reformer the efficient transformers

The Matrix calculus you need for deep learning (Petersen and Voight 2012)

« Previous Building A Transformer AI From Scratch - Part 1