So I Took an AI Course to Understand ChatGPT

Anne Werkmeister
Jul 17
4 min read

Let me start with a professional confession:I co-run a business with someone who holds a PhD in Data Science. So when he starts talking about vector spaces and embeddings, I nod like I totally get it…

In fact, I’ve been using ChatGPT every day, for writing blog posts, planning features, troubleshooting, Jira tickets, you name it. But I had almost zero idea how it actually worked. 100% black box.

Lately, we’ve seen companies leaning hard on GPT and other LLMs, calling them “mastermind” or “creative spirit” tech.

Knowing what I know now, I don’t think that’s the smartest strategy. Understanding how these models work reveals their limits, and why treating them like oracles can backfire.

So I joined Prof. Ali Hirsa’s NLP course at Columbia to unpack the engine behind what I use every day.

This post is a no pretension summary, just enough to make you sound smart at your next pitch, client meeting, or dinner party.

And yeah, I’ll help you look as sharp as I feel now (haha).

What Even Is a Language Model?

A language model predicts the next word. That’s it.Like:

“The cat sat on the…” → obviously “mat”, not “carburettor”.

Older models (n‑grams) only look 2–3 words back. Modern ones, like GPT, can reference sentences or whole paragraphs for context, which is where things get interesting.

Step 1: From Words to Numbers

You can’t process words, they’re not digits. Early tricks used one-hot vectors, huge binary arrays with a single 1. Problem? Zero semantic meaning: cat and kitten are as similar as cat and astrophysics in that format.

Then came word embeddings: dense, continuous vectors, like maps, where semantically related words are close together. Models learn them by analyzing co-occurrence in massive text corpora.

Step 2: The Neural Network Journey

Columbia class touched on Transformers, but let’s go deeper.

evolution of LLM — https://www.kudosai.com/

Perceptron (1943–60s) – McCulloch & Pitts invented the concept of artificial neurons. Rosenblatt later built the perceptron, single-layer nets that could learn patterns
Feedforward MLPs (1950s–’70s) – Multilayer perceptrons could handle non-linear data, trained by backprop. Early promise, quickly hit limitations without deep layers .
RNNs & LSTMs (1990s) – RNNs introduced recurrence for sequences, but vanished with long context due to gradient issues. LSTMs fixed that with memory gates around 1997.
Seq2Seq + attention (2014–16) – Encoder-decoder LSTM nets tackled translation. Bahdanau’s attention let decoders access the full input sequence, not just a bottleneck vector .
Transformers (2017) – “Attention Is All You Need” ripped out recurrence entirely. Instead, multi-head self-attention processes tokens in parallel, massive boost in context visibility and training speed.
BERT and GPT (2018+) – BERT used transformer encoders (masked word training). GPT employed decoder-only models for next-word prediction, GPT‑2 made waves in 2019; GPT‑3 followed in 2020. GPT‑4 and others today push scale further .

In short: neural nets evolved from basic statistical models to memory-powered RNNs to attention-led Transformers. That’s the core tech behind ChatGPT.

Step 3: Training the Beast

These models don’t come cheap. According to Forbes:

GPT‑3 (2020) cost $2–4 million
GPT‑4 (2023) ballooned to $41–78 million, with total spend over $100 million
Google Gemini clocks in at $30–191 million just for the compute

That’s not pocket change, it’s deep pockets.

Step 4: Grammar Parsing & Dependencies

Columbia also covered grammar tools:

POS tagging
Dependency parsing (subject → verb → object structures)

These aren’t optional, they help models structure language to produce coherent, human-like output. Useful for summarization, translation, and making the machine not sound like a drunk parrot.

What I Really Learned

LLMs aren't magical sentient beings → they’re pattern predictors.
ChatGPT doesn’t “think”→ it guesses word-by-word, but it's damn good.
Word representation and attention mechanics are the critical breakthroughs.
I can now actually engage when my partner talks embeddings at dinner, and even question business decisions that hinge on “creative spirit AI.”

TL;DR: Dinner-Party One-Liner

“ChatGPT is just a massive autocomplete system, with transformer attention letting it see the whole conversation at once.”

Drop that line confidently. You’ll sound like you know what you’re talking about ;)

LLMs Can’t Replace Strategy

Reliance on GPT or other LLMs as “secret weapons” is risky, especially if you don’t grasp their mechanics. Knowing their speed bumps and blind spots means you can use them smarter, not blindly.

So yeah. I came for knowledge, stayed for sanity, and now I sound brilliant at work and dinner.

References

Columbia University AI Course – Prof. Ali HirsaCourse Slides: Overview and Evolution of Language ModelsTopics: word embeddings, Skip-gram, CBOW, POS tagging, dependency parsing, arc-standard model, beam search.
Cabello, Adria (2023).“The Evolution of Language Models: A Journey Through Time”https://medium.com/@adria.cabello/the-evolution-of-language-models-a-journey-through-time-3179f72ae7eb Covers: history of neural networks from perceptrons to transformers.
Buchholz, Katharina (2024).“The Extreme Cost Of Training AI Models” – Forbeshttps://www.forbes.com/sites/katharinabuchholz/2024/08/23/the-extreme-cost-of-training-ai-models/Includes estimates for GPT-3, GPT-4, and Google Gemini training costs.
Vaswani et al. (2017).“Attention Is All You Need” – NeurIPShttps://arxiv.org/abs/1706.03762 Introduced the Transformer architecture, now the foundation of GPT models.
Mikolov et al. (2013).“Efficient Estimation of Word Representations in Vector Space” – Google Researchhttps://arxiv.org/abs/1301.3781 Original paper behind Word2Vec (Skip-gram and CBOW).
OpenAI (Various, 2020–2023).Technical documentation and model overviews from GPT-2 through GPT-4https://platform.openai.com/docs