Writing an LLM from scratch, part 22 – training our LLM

−

js8

It's based on a book https://www.manning.com/books/build-a-large-language-model-f..., is it a good book?

−

checker659

I have done a little bit of DL stuff (with keras) before this. I'm currently in the attention chapter. The book gives you the code, but I feel like there is very little in the way of building intuition. Thankfully, there are tons of videos online to help with that.

I think it is a great guide. An extended tutorial if you will (at least until this point in my reading). Also having the code right in front of you helps a lot. For example, I was under the impression that embedding vectors were static like in word2vec. Turns out, they are learnable parameters too. I wouldn't have been able to tell for sure if I didn't have the code right in front of me.

−

dvt

The book gives you the code, but I feel like there is very little in the way of building intuition.

There isn't really much intuition to begin with, and I don't really think building intuition will be useful, anyway. Even when looking at something as barebones as perceptrons, it's hard to really see "why" they work. Heck, even implementing a Markov chain from scratch (which can be done in an afternoon with no prior knowledge) can feel magical when it starts outputting semi-legible sentences.

It's like trying to build intuition when it comes to technical results like the Banach-Tarski paradox or Löb's theorem. Imo, understanding the math (which in the case of LLMs is actually quite simple) is orders of magnitude more valuable than "building intuition," whatever that might mean.

−

CamperBob2

Even when looking at something as barebones as perceptrons, it's hard to really see "why" they work.

Check out the Karpathy "Zero to Hero" videos, and try to follow along by building an MLP implementation in your own language of choice. He does a good job of building intuition because he doesn't skip much of anything.

−

checker659

Even when looking at something as barebones as perceptrons

I was thinking something like "it is trying to approximate a non-linear function" (which is what it is in the case of MLPs).

−

mettamage

Here's part 1^[1]. Since his archive goes by date, it makes it a bit easier to guestimate which part is made in which month.

[1] https://www.gilesthomas.com/2024/12/llm-from-scratch-1

−

3abiton

It's interesting 22 parts in under a year, seems like a fun up to date project. Karpathy did something very similar with nanochat (following nanogpt).

−

ziyunli

seems like you can filter by tag https://www.gilesthomas.com/llm-from-scratch

−

pppoe

Love this. This reminds me of LFS (Linux From Scratch) https://www.linuxfromscratch.org

Feeling nostalgic about the days building LFS in college.

Learning by building wouldn't help you remember all the details but many things would make more sense after going through the process step by step. And it's fun.

−

mrasong

The cost comparison between local RTX 3090 and cloud A100 clusters is useful, but I wonder if the author accounted for hidden overhead—like data transfer time for large datasets or the time spent debugging CUDA compatibility issues on local hardware.