Prompting

Prompt Engineering in 2026: What Still Works, What Doesn't

January 8, 2026 9 min read Ahmed Raoofuddin

Two years ago, prompt engineering meant stuffing a wall of few-shot examples into your context window and hoping for the best. In 2026, the game has changed completely. Models are bigger, smarter, and handle instructions directly. The techniques that worked on GPT-3.5 are often counterproductive on Claude Opus 4.6.

Here's what I've learned shipping production prompts across three frontier models over the past year.

What Still Works

1. Clear Role and Task Definition

The single most effective technique remains the simplest: tell the model exactly what role it is playing and what outcome you want. Not "you are a helpful assistant" but something concrete:

You are a senior real estate data analyst for the Dubai market.
Your task is to produce a JSON summary of the three top-matching
listings for the user's query, ranked by relevance.

This alone often beats elaborate prompt tricks.

2. Structured Output (JSON Mode, Tool Calling)

Free-form text outputs are fragile. JSON schema enforcement is not. Every major model now supports native structured outputs: Anthropic's tool use, OpenAI's JSON mode and function calling, Google's structured output mode. Use them.

In my Cortivex pipelines, every agent call goes through a typed schema. Parse errors dropped from around 4 percent to under 0.1 percent when I switched from prompt formatting tricks to native structured output.

3. XML Tags for Sectioning

Anthropic's models in particular love XML tags. Wrapping instructions, examples, and context in distinct tags helps the model separate concerns:

<instructions>
Classify the user's intent.
</instructions>

<context>
Previous conversation history goes here.
</context>

<query>
{user_question}
</query>

This is more effective than markdown headers in my tests, especially for Claude.

What Has Stopped Working

1. Massive Few-Shot Example Dumps

Five years ago, adding ten examples was a cheat code. Today, with reasoning models, it often hurts. The model pattern-matches too strongly to the examples and loses the plot on edge cases.

Two or three diverse examples are still helpful for calibration. Ten are not.

2. "You are GPT-4, the smartest AI"

Flattery prompts were always dubious. In 2026, they actively backfire on Claude and GPT-5. The models are tuned to recognize and ignore performative praise.

3. "Let's think step by step"

Chain-of-thought prompting as a phrase is obsolete. Modern reasoning models do this natively, and explicitly telling them to often makes the output more verbose without improving quality. Use the model's reasoning mode flags instead.

The New Playbook

1. Prompts Are Code, Not Spells

Treat every prompt like a function. Version it, test it, monitor it. I version prompts in MLflow with the exact model snapshot they were tuned against, and re-run the golden set every time either changes.

2. Caching Is Your Best Friend

Prompt caching (Anthropic's cache_control, OpenAI's cached prefixes) can reduce cost by 80 percent for long system prompts. Structure your prompts with the stable parts first so they can be cached:

[STABLE SYSTEM PROMPT]  <- cached
[STABLE INSTRUCTIONS]   <- cached
[DYNAMIC CONTEXT]       <- not cached
[USER QUERY]            <- not cached

3. Evaluation Is Non-Negotiable

You can't improve what you don't measure. Every prompt change in my production stack runs against a golden set of at least 200 examples. If accuracy drops, the change is reverted. Simple as that.

4. Tool Calling Over Prompt Engineering

Instead of trying to prompt the model to extract structured data, give it a tool and let it call it. The model does the extraction naturally, the response is typed, and you skip a whole class of parsing bugs.

Model-Specific Quirks

Claude Opus 4.6: Loves XML tags. Responds very well to "think through the problem before answering." Best in class for long documents.

GPT-5: Strongest with JSON mode and function calling. Shortest, most concise outputs. Best for cost-sensitive workloads.

Gemini 2.5 Pro: Phenomenal context length and multimodal handling. Best for video and document understanding. Slightly more verbose than the others.

The Mindset

Stop thinking of prompts as magic incantations. Start thinking of them as typed interfaces to an unreliable function. You'd never ship a typed function without tests. Don't ship a prompt without them either.

Prompt engineering in 2026 looks more like software engineering and less like poetry. That is a good thing.