Your Agents Are Flying Blind. 21,000 Stars Says the Fix Takes 10 Minutes.

By Haim Ari · 2026-04-02T13:00:00

Most teams running AI agents in production have no idea what's happening inside them. Langfuse — open-source, self-hostable, MIT-licensed — changes that. Here's how to set it up in 10 minutes and what you'll see when you do.

A team running a multi-step research agent woke up one morning to find they'd spent $800 on LLM calls overnight. The agent had hit an error on step 3 of 12, couldn't recover gracefully, and retried the entire workflow — again and again — until a scheduled cron killed it six hours later. Every retry burned the same 50,000 tokens.

They had logs. They had no traces.

The logs showed the error message. They had no way to see which prompt triggered it, how many tokens step 3 consumed vs step 7, or which model call was the bottleneck. Debugging it took two engineers most of a day.

This is the standard state of agent development in 2026: teams shipping agents to production with nothing but print() statements and a rising sense of dread.

The problem has a name — agent observability — and it's the gap between "the agent returned an answer" and "I understand exactly what the agent did to get there."

What Observability Actually Means for Agents

For traditional software, logging and APM tools are well understood. You instrument your HTTP handlers, you add distributed tracing, you watch latency percentiles. The tooling is mature.

Agents don't fit that model. An agent isn't a request/response chain — it's a reasoning loop. A single user query might trigger 12 LLM calls, 3 tool invocations, a retrieval step, and a synthesis pass, each with its own inputs, outputs, token counts, and cost. The meaningful unit isn't the HTTP request — it's the trace.

What you actually need to know when an agent misbehaves:

• Which step failed, and what inputs it received

• How many tokens each step consumed (and therefore what it cost)

• How long each step took — and whether that's regressing

• What the model actually received as context, not what you think you sent

• Whether the problem is consistent or a one-off

Without tooling that captures all of this, debugging a production agent is archaeology. You're reconstructing what happened from partial evidence.

![image-2.png]

Langfuse: Open-Source, 21,000 Stars, Now Backed by ClickHouse

Langfuse started in 2023 as a Y Combinator company with a straightforward thesis: the LLM ecosystem needed an open-source observability layer that any team could self-host without a vendor contract. Three years later it has 21,000+ GitHub stars, 26 million SDK installs per month, and in January 2026, it was acquired by ClickHouse as part of ClickHouse's $400M Series D.

The ClickHouse acquisition is significant. Langfuse was already fast, but ClickHouse is the database that powers the metrics infrastructure at Cloudflare, Uber, and eBay. The traces and observations that Langfuse captures now sit on the most performant OLAP engine in open-source. Query performance on millions of traces went from seconds to milliseconds.

And it's MIT-licensed. The full self-hosted version — no feature flags, no seat limits, no "enterprise tier" to unlock the useful parts — is free. You run it on your own infrastructure. Your traces stay on your machines.

The thing Langfuse shows you isn't just logs. It's a full trace tree: every LLM call represented as a span, nested inside the agent run that triggered it, with the exact prompt sent and response received at each step. You can see, for a single user query, exactly which step cost $0.08, which step took 4 seconds, and which step returned a response that caused the agent to go sideways.

What You See in the Dashboard

The trace view is the core of Langfuse. A trace maps to a single agent run — one conversation turn, one document processing job, one background task. Inside it are observations: spans (any timed block of code), generations (LLM calls), and events (discrete moments you want to tag).

For each generation, Langfuse captures:

• Input and output (the full prompt + completion)

• Model name and parameters

• Token counts (prompt, completion, total)

• Cost (calculated automatically for Anthropic and OpenAI models)

• Latency

The cost tracking is automatic for Claude and GPT models — Langfuse knows the per-token prices and calculates them server-side. You don't wire anything up. You just start seeing dollar amounts next to each call.

The dashboard aggregates this across all your traces. You can see total spend by day, by user, by session. You can filter to the top-10 most expensive traces and open each one to see exactly where the money went. The "$800 overnight" incident becomes visible in the first minute after you set this up — you'd see the trace count spike and the per-trace cost on a single agent session.

![image-3.png]

Try It Yourself

Option 1: Self-Host with Docker (Recommended)

Langfuse runs as six containers: a web UI, an async worker, ClickHouse for trace storage, Postgres for metadata, Redis for queuing, and MinIO for blob storage. The docker-compose.yml is maintained in the GitHub repo.

``bash

Clone the repo and bring up the stack

git clone https://github.com/langfuse/langfuse.git

cd langfuse

docker compose up -d

Before running, create a .env file and change the secrets marked in the compose file:

`bash

Generate your secrets

SALT=$(openssl rand -hex 16)

ENCRYPTION_KEY=$(openssl rand -hex 32)

NEXTAUTH_SECRET=$(openssl rand -hex 32)

Then update the corresponding values in docker-compose.yml (they're marked with # CHANGEME). The defaults work for local development but you should change them before exposing the service.

The dashboard is at http://localhost:3000. Create an account, create a project, and copy your public/secret key pair.

One important note: all containers must run with TZ: UTC. Non-UTC system time causes empty query results in the dashboard — this is documented but catches teams by surprise.

Option 2: Langfuse Cloud

If you don't want to manage the infra, Langfuse Cloud has a free tier at 50,000 observation units per month. Create an account at cloud.langfuse.com (EU) or us.cloud.langfuse.com (US), create a project, get your keys. Skip straight to the instrumentation step below.

Instrumenting Your Agent (Python)

Install the dependencies:

`bash

pip install langfuse anthropic opentelemetry-instrumentation-anthropic

Set your credentials (or put them in .env):

`bash

export LANGFUSE_PUBLIC_KEY="pk-lf-..."

export LANGFUSE_SECRET_KEY="sk-lf-..."

export LANGFUSE_HOST="http://localhost:3000" # or https://cloud.langfuse.com

Initialize Langfuse and instrument the Anthropic SDK:

`python

from anthropic import Anthropic

from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor

from langfuse import get_client

Patches all Anthropic API calls to emit OTEL spans automatically

AnthropicInstrumentor().instrument()

langfuse = get_client()

client = Anthropic()

Now wrap your agent steps:

`python

def run_research_agent(query: str, user_id: str) -str:

with langfuse.start_as_current_observation(

as_type="span",

name="research-agent",

user_id=user_id,

metadata={"query_length": len(query)}

) as trace:

# Step 1: Plan

with langfuse.start_as_current_observation(as_type="span", name="plan"):

plan_response = client.messages.create(

model="claude-sonnet-4-6",

max_tokens=512,

messages=[{"role": "user", "content": f"Create a research plan for: {query}"}]

)

# Step 2: Execute

with langfuse.start_as_current_observation(as_type="span", name="execute"):

result_response = client.messages.create(

model="claude-sonnet-4-6",

max_tokens=2048,

messages=[

{"role": "user", "content": f"Execute this plan: {plan_response.content[0].text}"}

]

)

return result_response.content[0].text

In short-lived scripts, flush before exit

langfuse.flush()

Every client.messages.create() call is automatically captured as a generation with full inputs, outputs, token counts, and cost. The span nesting is preserved — you see research-agent → plan → [LLM call] and research-agent → execute → [LLM call] as a tree in the dashboard.

What to Check First

After your first few traces arrive, look at three things:

Cost by trace: Go to Traces → sort by cost descending. The outliers show up immediately. Any trace costing 10x the average warrants a look.

Latency by step: Open a trace, look at the span durations. A 12-second span in a step that should take 1 second usually means either a massive context window or a model that's struggling with the prompt.

Failure patterns: Filter traces by status ERROR`. Open a few. See what the model received as input on the failing step — more often than not, the bug is in what you sent, not in the model's response.

![image-4.png]

Langfuse vs LangSmith

The honest comparison: if your entire stack is LangChain or LangGraph, LangSmith is the path of least resistance. The native integration is deep, the hosted service is polished, and you won't have to run any infrastructure.

For everything else, Langfuse is the better choice on almost every dimension:

Self-hosting is free and first-class. LangSmith self-hosting requires an Enterprise contract. Langfuse's self-hosted version is identical to the cloud version, MIT-licensed, no seats or feature limits.

Framework-agnostic. Langfuse uses OpenTelemetry — the CNCF standard for distributed tracing. If your stack changes (and it will), your instrumentation doesn't need to. Any library with an OTEL instrumentor works: Anthropic, OpenAI, AWS Bedrock, Vertex AI.

Data stays on your infrastructure. This matters more than teams initially think. Trace data contains the full text of every prompt you send and every response you receive. That includes user queries, internal documents, code snippets, and anything else you pass to the model. The only way to be certain that data doesn't leave your environment is to self-host.

ClickHouse backing. As of January 2026, Langfuse's trace storage runs on ClickHouse. For teams ingesting millions of traces, the query performance difference is material.

The one place LangSmith still wins: if you're deep in the LangGraph ecosystem and want tight native integration, the setup friction for Langfuse's OTEL-based approach is higher. But that gap closes with every SDK release.

The Bottom Line

The "$800 overnight" incident that opened this post isn't a horror story — it's a fairly typical Tuesday for a team that ships agents without observability. The bug was fixable in an afternoon. The problem was that without trace visibility, they didn't know where to look.

Langfuse gives you the trace tree that makes the answer obvious. Not just what the agent said, but what it was thinking when it said it — the full input, the full context, the exact model, the exact cost, at every step.

The setup takes under 10 minutes. The docker-compose stack runs on any machine with Docker. The Python SDK is 5 lines. And the first time you open a trace and see exactly where a $0.40 agent call happened and what it said, you'll wonder how you were running agents without this.

21,000 developers apparently had the same thought.

If you're managing agent costs at the infrastructure level, the agent loop budget guardrails post covers the three-level cost defense system (per-request limits, task budgets, and dashboard spending caps) that works alongside observability to prevent runaway spend.