Reference

Glossary

Plain-language definitions of the terms used across the curriculum. Each definition links to the article that introduces the concept, so you can drop in for the term and stay for the full explanation.

Organized by topic, not alphabetically. If you want alphabetical, use Cmd/Ctrl+F.

LLM fundamentals

LLM (Large Language Model). A neural network trained to predict the next token, scaled to the point where it generalizes to most natural language tasks. From your code's point of view, a stateless function: text in, text out. Introduced in article 01.

Token. The unit an LLM reads and writes. Roughly 4 characters of English text per token. Pricing and context windows are denominated in tokens. article 01.

Context window. The maximum number of tokens (input + output combined) the model can attend to in a single call. Modern models range from 8k to over 1M. article 01.

Tokenizer. The deterministic function that splits text into tokens. Different providers use different tokenizers. article 01.

Temperature. A sampling parameter from 0 to 2 that controls randomness. Higher temperatures pick less likely next tokens more often. Temperature 0 is the closest you get to deterministic, but it is not truly deterministic. article 01.

Sampling. The process of choosing the next token from the model's probability distribution. Temperature, top-p, and top-k are sampling parameters. article 01.

Hallucination. Confidently stated output that is factually wrong. LLMs hallucinate because they generate plausible text, not verified facts. The main defenses are retrieval (give the model the right context) and prompt design (let the model say "I do not know"). article 01.

TTFT (Time To First Token). The latency from sending the request to receiving the first token of the response. The user-perceived latency in streaming UIs. article 11.

TPS (Tokens Per Second). The rate at which a model generates output tokens after the first one. Determines the speed of streaming UIs. article 11.

Prompting

Prompt. The text input you send to the model. The combination of system message, user messages, and any retrieved context. article 03.

System prompt. The high-priority instruction that sets the model's role and constraints. Stays constant across user turns in a conversation. article 03.

Structured outputs. Forcing the model to return JSON that conforms to a schema you defined. Removes the entire class of "the LLM returned malformed JSON" bugs. article 03.

Few-shot prompting. Including 2-5 example input/output pairs in the prompt to steer the model toward the format and tone you want. article 03.

Zero-shot prompting. Asking the model to do a task without examples. Works when the task is in the model's training distribution. article 03.

Prompt injection. An attack where untrusted input contains instructions that the model follows instead of your system prompt. The fix is delimiters, content quarantining, and never letting user content reach tool-calling decisions. article 01, article 03.

RAG (Retrieval-Augmented Generation)

RAG. A pattern where you retrieve relevant documents from a knowledge store and inject them into the prompt before the model answers. The most common way to ground an LLM in your own data. article 04.

Embedding. A fixed-length vector of floats (commonly 1,536 or 3,072 dimensions) that represents the meaning of a piece of text. Texts with similar meanings have similar embeddings. article 07.

Vector database. A database optimized for similarity search over embeddings. Postgres with pgvector counts. Specialized vector DBs (Pinecone, Weaviate, Qdrant) trade simplicity for scale. article 04.

Cosine similarity. A measure of vector similarity. Two embeddings with cosine similarity near 1 are semantically close; near 0, unrelated. The default scoring function for vector search. article 07.

Chunk / chunking. Splitting a document into smaller passages before embedding, because embeddings of long passages lose precision. The chunk size, overlap, and split strategy materially affect retrieval quality. article 05.

BM25. The classical keyword-based ranking function (Best Match 25). Built into Postgres, Elasticsearch, and every traditional search engine. Strong at exact-term matches that vector search misses. article 06.

Hybrid search. Running vector search and BM25 in parallel, then fusing the ranks. Beats either alone in most production benchmarks. article 06.

RRF (Reciprocal Rank Fusion). The standard fusion algorithm for hybrid search. Each candidate gets a score of 1 / (60 + rank) from each retriever, summed across retrievers. Robust, parameter-free. article 06.

Reranking. A second-pass scoring step that re-orders the top-N retrieved candidates using a more expensive cross-encoder model. Improves recall at the cost of latency. article 06.

Evals

Eval (evaluation). An automated test for AI quality. Typically a labeled dataset plus a scoring function. The closest analog in traditional software is integration tests. article 08.

Eval dataset. The set of input/expected-output pairs the eval runs against. Built from real user queries, edge cases, and bugs found in production. Grows over time. article 08.

LLM-as-judge. Using an LLM (usually a stronger one) to score the output of another LLM. Useful for free-form quality evaluation. Has known biases. article 09.

Position bias. A failure mode of LLM-as-judge where the judge favors whichever response is shown first (or last). Mitigated by randomizing position. article 09.

Verbosity bias. A failure mode of LLM-as-judge where the judge favors longer answers regardless of quality. Mitigated by explicit length-neutral rubrics. article 09.

Self-preference bias. A failure mode of LLM-as-judge where a judge consistently rates outputs from the same model family higher. Mitigated by using a different judge model. article 09.

Drift. A gradual decline in production quality, usually because the input distribution shifted (new user behaviors, new content) or the model's output changed (silent provider updates). Caught by re-running evals on a fresh sample. article 11.

Production AI

Prompt cache. A provider-side cache that reuses the model's intermediate computations across requests with shared prompt prefixes. Cuts input token cost by 50-90% for repeated system prompts. article 10.

Semantic cache. An application-side cache keyed by embedding similarity rather than exact string match, so "near duplicate" queries can reuse cached answers. Useful when traffic has paraphrases. article 10.

Embedding cache. An application-side cache of embeddings keyed by the source text's hash. Eliminates re-embedding the same text on re-ingest. article 10.

Observability. The practice of emitting telemetry (traces, metrics, logs) so you can debug production behavior. For AI, this includes per-request token counts, model versions, retrieval depth, and cost. article 11.

p50 / p95 / p99 latency. The latency at which 50%, 95%, or 99% of requests complete. p95 is the standard "are most users having a good time?" metric for production AI services. article 11.

Model routing. Sending easy requests to a small cheap model and hard requests to a large expensive model. Cuts cost without dropping quality. article 12.

Fine-tuning. Continuing to train a base model on your own data to specialize it. Rarely worth it for backend engineers in 2026; prompt engineering and RAG cover most use cases at a fraction of the operational cost.

Agents and protocols

Tool use / Function calling. A pattern where the model decides which of your functions to call and with what arguments. Lets the LLM take actions, not just produce text. (Article coming.)

Agent. A system where the LLM is in a control loop: it observes, decides, calls a tool, observes the result, and decides again, until it reaches a stop condition. (Article coming.)

MCP (Model Context Protocol). An open protocol for connecting LLMs to external data sources and tools through standardized servers. Anthropic-originated, now broadly adopted. MCP article 01.

Things this site does not cover yet

These terms come up in the field but are not in the current curriculum. They are on the roadmap.

Multi-modal: models that handle text + image (+ audio). GPT-4o and Claude can both do this.
RLHF (Reinforcement Learning from Human Feedback): how base LLMs become helpful chat models.
Distillation: training a small model to mimic a big one. A cost lever for high-volume tasks.
Speculative decoding: a serving-side optimization that lets a small model draft tokens for a big model to verify, cutting per-token latency.

Missing a term? Spot an inaccurate definition? Let me know and it gets fixed within a day.