2. Post-Training: Supervised Fine-Tuning

2. Post-Training: Supervised Fine-Tuning#

So far, we’ve looked at base models, which are just pre-trained text generators. But to make an actual assistant, you need post-training.

Base models hallucinate a lot → They generate text, but it’s not always useful.
Post-training fixes this by fine-tuning the model to respond better.
The good news? Post-training is way cheaper than pre-training (e.g., months vs. hours).

2.1. Conversations Data#

Once the base model is trained on internet data, the next step is post-training. This is where we replace the internet dataset with human/assistant conversations to make the model more conversational and useful.

Pre-training takes months, but post-training is much faster. It can take as little as a few hours.
The model’s algorithm stays the same, we’re just fine-tuning the existing parameters.

To teach a model how to handle back-and-forth conversations, we use chat templates. These define the structure of a conversation and let the model know which part is user input and which part is an assistant response. You can read more about them here.

Example template:

<|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|>
<|im_start|>user<|im_sep|>What is 4 + 4?<|im_end|>
<|im_start|>assistant<|im_sep|>4 + 4 = 8<|im_end|>

<|im_start|> and <|im_end|> are special tokens that help structure conversations.
The model didn’t see these new tokens during pre-training, they’re introduced during post-training.
OpenAI has discussed fine-tuning LLMs for conversation in the InstructGPT paper.

To visualize this, go to tiktokenizer.

One such post-training dataset is OASST1. Early post-training datasets were hand-curated by humans. Now, models like UltraChat can generate synthetic conversations, allowing models to improve without as much human input. You can visualize these mostly synthetic datasets here.

2.2. Hallucinations, Tool Use, and Memory#

One major issue with LLMs is hallucination, where the model confidently generates incorrect or made-up information.

Why does this happen?

During post-training, models learn that they must always give an answer.
Even if the question doesn’t make sense, the model tries to generate a response instead of saying, “I don’t know.”

How Meta Deals with Hallucinations

Meta’s research on factuality (from their Llama 3 paper) describes a way to improve this:

Extract a snippet of training data.
Generate a factual question about it using Llama 3.
Have Llama 3 generate an answer.
Score the response against the original data.
If incorrect, train the model to recognize and refuse incorrect responses.

Essentially, this process teaches models to recognize their own knowledge limits.

2.3. Using Tools to Reduce Hallucinations#

One way to fix hallucinations is to train models to use tools when they don’t know the answer. This approach follows the pattern:

<|im_start|>user<|im_sep|>Who is Orson Kovacs?<|im_end|>
<|im_start|>assistant<|im_sep|><SEARCH_START>Who is Orson Kovacs?<SEARCH_END><|im_end|>

[...search results...]

<|im_start|>assistant<|im_sep|>Orson Kovacs is ....<|im_end|>

With repeated training, models learn that if they don’t know something, they should look it up instead of making things up.

“Vague Recollection” vs. “Working Memory”

Model parameters store vague recollections (like remembering something from a month ago).
Context tokens function as working memory, giving models access to fresh information.

This is why retrieval-augmented generation (RAG) works so well: if the model has direct access to relevant documents, it doesn’t need to guess.

2.4. Knowledge of Self#

If you prompt an untuned base model about who it is, it will likely hallucinate. For example, a non-OpenAI model might still say it was created by OpenAI simply because most internet data links AI models to OpenAI.

How to Fix This

Hardcode self-identity into training data → Example: Olmo-2 dataset.
Use system messages → At the start of every conversation, include a reminder of its identity.

By default, LLMs have no real knowledge of themselves. Without specific training, they default to generic AI responses.

2.5. Models Need Tokens to Think#

LLMs don’t reason like humans. They generate tokens sequentially, meaning they need structured generation to think properly.

Example: Bad vs. Good Model Output

Bad Model Output:

Human: Emily buys 3 apples and 2 oranges. Each orange costs $2. The total cost of all the fruit is $13. What is the cost of apples? Assistant: The answer is $3.

The model jumps to the answer without breaking it down.

Good Model Output:

Human: Emily buys 3 apples and 2 oranges. Each orange costs $2. The total cost of all the fruit is $13. What is the cost of apples? Assistant: The total cost of the oranges is $4. 13 - 4 = 9, so the cost of the 3 apples is $9. 9/3 = 3, so each apple costs $3.

Here, the model works through the reasoning step-by-step.

Why This Matters

If a model jumps straight to an answer, it might just be guessing.
If it walks through the solution step-by-step, it’s more reliable.
The model breaks down the problem into smaller steps. Since there are finite layers in the model, one token output cannot be processed indefinitely. Breaking down the problem into smaller steps allows the model to process the problem in a way that is more likely to yield the correct answer.

For math and logic tasks, it’s best to ask the model to use external tools rather than relying on its own reasoning.