1 minute to read

AI in production: tracing and evaluations with the OpenAI-compatible API

AI in production: tracing and evaluations with the OpenAI-compatible API

Most developers have built something with an LLM by now. That's the premise Martin, Developer Relations at Mittwald, opened with at our most recent Shopware community event. Judging by the room’s reactions, he isn't wrong.

Getting a prompt in and a response is the easy part. Knowing whether what comes back is actually quality, and having something to debug when it isn't, is where most integrations stop short. Martin's session was about closing that gap through the two things most integrations are missing: tracing and evaluations.

Why the prototype stage isn't enough

Most LLM integrations start the same way. You call an OpenAI-compatible API endpoint, you get a response, you put it in your application. It works and then you ship it.

The problem is that AI-generated output is inherently non-deterministic. The same prompt can return meaningfully different results on every call. That means you can't test your way to confidence and then move on. Things go wrong in production in ways that simply don't show up during development.

Martin gave a concrete example of the risk: a consumer AI assistant, asked whether gasoline could be used to cook spaghetti, responded that it could and recommended adding garlic in it first. Objectively dangerous, and a reminder that LLMs can fail in ways that are impossible to fully anticipate in advance.

For Shopware developers, the practical version of that risk is less dramatic but just as real: an AI-generated product description with wrong specifications shows on a live storefront, a support assistant gives a confident but wrong answer about a return policy, a search feature degrades after a model update. Without monitoring, none of these are visible until a user complains.

The two key features: tracing and evaluations

Tracing means getting visibility into what your LLM interactions are actually doing at runtime such as what prompts went into the model, what came back, how long each call took, and how many tokens it consumed. Every interaction gets a unique trace ID that can be surfaced in response to headers or error messages, so that when a user reports a problem, there is something factual to check.

Evaluations go a step further. Tracing tells you what happened. Evaluations tell you whether what happened was any good. That means scoring model output against criteria that matter for your specific application, so they are always context dependent. For a product description generator, quality means factual accuracy and brand-consistent tone. For a support assistant, it means relevance and policy compliance.

What tool to use: LangFuse

Martin's tool of choice for both is LangFuse, an open-source, MIT-licensed LLM observability platform built by a Berlin-based team.

It can be self-hosted via Docker or used via its hosted cloud option, and it has SDKs across languages including Node.js and PHP.

Under the hood, LangFuse uses OpenTelemetry, the widely adopted cloud-native observability standard, which means it auto-detects the environment variables injected into a server process and doesn't require significant instrumentation work.

How little code it actually takes

The integration requires a single change: wrapping the existing OpenAI API client in a LangFuse observer at initialization. Everything else in the application continues as before.

From that point, LangFuse captures a full timeline of every LLM call, each one with a traceable span with its input, output, latency, and token count. Individual spans can then be enriched with additional metadata.

One configuration note: immediate telemetry export is convenient in development but not the right setup for production at any meaningful volume. Instead, batch processing is more efficient and is the recommended approach.

What you can see once traces are flowing

Beyond individual trace inspection, LangFuse gives you aggregated views: token usage over time, cost tracking by query type, and response time distributions. Martin points to response time in particular as something that tends to get underestimated. “If you spend all of the time waiting for LLM responses, you might come to a point where user experience suffers to a degree that you're just losing user interaction.” The platform also supports dashboards configured around specific queries, useful for understanding not only how many tokens an application is burning in aggregate, but which types of interactions are the most expensive.

The two ways to score quality

The first is human annotations. A reviewer scores trace manually through an internal queue, or end-user feedback signals are piped back via the LangFuse API. A thumbs up / thumbs down at the end of an interaction for example. Martin mentions this as underrated: user feedback, collected and routed into an observability platform, is a quality signal that most applications are not actively using.

The second is LLM-as-a-judge. A second AI model evaluates the output of the primary model against a defined rubric, running automatically on every interaction. LangFuse calls these evaluators, and they can be configured to run continuously across all LLM interactions the application produces. Martin demonstrated this with a custom evaluator that checks whether the difficulty level of a generated presentation actually matches what a user selected.

Where this applies in Shopware

The OpenAI-compatible API pattern means none of this is tied to a specific model provider. The same observability layer works regardless of what model sits behind the endpoint, and switching providers doesn't require changes to the tracing setup.

For Shopware developers, the highest-value starting points are AI-generated product content, where a quality regression can affect a live storefront before anyone notices, and customer support assistants, where having an audit trail of what the model said matters. AI-powered search is worth adding to the picture too, since LLM latency compounds with search latency in ways that aren't obvious until you're looking at a timeline.

LangFuse is freely self-hostable via Docker, and a hosted cloud option is available for teams who'd rather skip the infrastructure. If you're building AI-assisted features on Shopware and want to compare notes on production observability, the Shopware Discord is where that conversation is already happening.

Copied to clipboard