Back to Blog

LLMOps in Production: MLflow, Langfuse, RAGAS

LLMOps

"Ship it and hope" works for a hackathon prototype. It does not work when your LLM powers a property search touching 1.5k requests per minute, with real money and real legal liability on the line.

Here's the LLMOps stack I run in production, and why each piece earns its keep.

The Four Pillars of LLMOps

  1. Observability, traces, spans, tokens, costs
  2. Evaluation, automated quality gates on every change
  3. Versioning, prompts, models, and datasets treated like code
  4. Feedback loops, production signals flow back into evaluation

Observability: Langfuse + LangSmith

Every LLM call in my stack is traced. Langfuse captures the prompt, completion, latency, token counts, cost, user ID, and session. LangSmith adds the LangChain/LangGraph graph view so I can see which node in a chain is bleeding tokens.

Why both? Langfuse is open-source and I self-host it for PII reasons. LangSmith is sharper for debugging agent loops. They complement each other.

Evaluation: RAGAS + Golden Sets

RAGAS gives you the four metrics that matter for RAG systems:

  • Context Precision, did we retrieve the right chunks?
  • Context Recall, did we miss relevant info?
  • Faithfulness, does the answer stay grounded in the context?
  • Answer Relevancy, does the answer actually address the question?

I maintain a golden set of 200 hand-labeled examples. Every pull request runs the full eval. If faithfulness drops below 0.9, the build fails. No exceptions.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

result = evaluate(
    dataset=golden_set,
    metrics=[faithfulness, answer_relevancy, context_precision]
)

assert result['faithfulness'] > 0.9, "Faithfulness regression!"

Versioning: MLflow for Prompts + Models

Prompts are code. I version every prompt in MLflow alongside the model version, the eval results, and the dataset hash. When something breaks in production, I can roll back to a known-good combination in one command.

This sounds boring. It isn't. The first time you need to roll back a prompt at 2am because a model update broke your few-shot examples, you'll understand why it matters.

Feedback Loops: Production → Eval Set

The best eval set is your production traffic. I sample 0.5% of production requests, anonymize them, and feed them into a human review queue. Anything flagged as wrong becomes a new golden set entry.

This is how you avoid the "demos beautifully, fails in production" trap. Your eval set becomes a living reflection of what users actually ask.

Cost Tracking

Token costs compound fast. At 1.5k req/min with a 2k-token context, you're burning roughly $3k/day on GPT-5 alone. I track per-endpoint, per-user, and per-feature cost in Langfuse, and page myself when daily cost drifts 20% above baseline.

Cost is a product metric, not just a finance metric.

Putting It Together: CI/CD for LLMs

# On every PR:
1. Run unit tests + linters
2. Run RAGAS eval on golden set
3. Compare to baseline (main branch)
4. Fail if faithfulness/precision regresses
5. Human review required for prompt changes

# On merge to main:
1. Deploy to staging
2. Run 10% canary for 1 hour
3. Compare Langfuse metrics
4. Promote to prod or auto-rollback

The Mindset Shift

Treat LLMs like any other production service. They need observability, tests, versioning, and rollback. The tools are here, use them.

LLMOps isn't optional anymore. If you're not doing it, you're shipping bugs to users and hoping they don't notice.