Learn/Phase 8/OpenTelemetry and Observability

OpenTelemetry and Observability

Ch 18 · Advanced Systems 55 min
Spanstrace_idJaegermeshflow.verdict
Hands-on:MESHFLOW_MOCK=1 python3 hands_on/18_opentelemetry.py

Lesson 18: OpenTelemetry And Production Observability

Lesson Goal

By the end of this lesson, you should be able to:

  • Explain the difference between logs, metrics, and distributed traces.
  • Describe what OpenTelemetry spans are and how they relate to MeshFlow runs.
  • Enable console telemetry for local development.
  • Configure an OTLP exporter to send spans to Jaeger, Grafana, or Datadog.
  • Read trace_id from a run result and look up the trace in your backend.
  • Identify what MeshFlow attributes appear on each span type.

Estimated time: 45 to 60 minutes.

1. Logs, Metrics, And Traces

Production observability is built on three pillars:

Logs answer: what happened? They are unstructured event records with a timestamp and a message. Easy to write, hard to correlate across services.

Metrics answer: how much? They are numeric measurements over time — request rate, error rate, latency percentiles, cost per hour. Good for alerting and dashboards.

Distributed traces answer: how did this specific request travel through the system? A trace is a tree of spans, one per operation, each with start time, end time, and attributes. Traces are the most useful tool for understanding the behavior of a specific AI run.

MeshFlow generates distributed traces. Every run is a root span. Every agent execution, gate evaluation, and tool call is a child span.

2. OpenTelemetry

OpenTelemetry (OTEL) is the open standard for generating, collecting, and exporting telemetry data. It works with any backend: Jaeger, Grafana Tempo, Datadog, Honeycomb, New Relic, and others.

Key concepts:

  • Span: a single timed operation with a name, start/end times, status, and

key-value attributes.

  • Trace: a tree of spans representing one complete operation.
  • trace_id: the unique identifier shared by all spans in one trace.
  • Exporter: sends spans to a backend (OTLP over HTTP or gRPC).
  • Collector: an intermediate service that receives spans and routes them.

3. Console Telemetry For Local Development

The simplest setup requires no backend:

policy = Policy(
    telemetry_console=True,   # print spans to stdout
)
result = await mesh.run(task)

Console output for each span looks like:

[SPAN] meshflow.run
  trace_id    : 4a8f2b1c...
  run_id      : run_abc123
  duration_ms : 1243
  cost_usd    : 0.0024
  total_tokens: 312
  status      : OK

[SPAN] meshflow.agent
  trace_id    : 4a8f2b1c...  ← same trace
  agent_id    : researcher-agent
  role        : researcher
  tokens      : 148
  cost_usd    : 0.0012
  uncertainty : 0.18
  verdict     : allowed

Console telemetry is ideal for development and debugging. For production, use an OTLP exporter.

4. OTLP Exporter

policy = Policy(
    telemetry_otlp_endpoint="http://localhost:4318",
    telemetry_otlp_protocol="http/protobuf",  # or "grpc"
)

Or via environment variable:

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
MESHFLOW_MOCK=1 python3 hands_on/18_opentelemetry.py

The OTLP endpoint can point to:

  • Jaeger: docker run -d -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one
  • Grafana Tempo: docker run -d -p 3200:3200 -p 4318:4318 grafana/tempo
  • OTEL Collector: a collector that fans out to multiple backends

5. MeshFlow Span Attributes

Every MeshFlow span carries attributes that identify the run and the step:

AttributeSpan typeDescription
meshflow.run_idallUnique run identifier
meshflow.trace_idallLinks to the OTEL trace
meshflow.agent_idagentAgent identifier
meshflow.roleagentAgentRole value
meshflow.cost_usdagent, runCost in USD
meshflow.tokensagent, runToken count
meshflow.uncertaintyagentConfidence score (0-1)
meshflow.verdictagentallowed or blocked
meshflow.carbon_gagent, runCarbon footprint
meshflow.gate_resultgatepassed or blocked
meshflow.compliancerunActive compliance mode

6. Linking trace_id To Your Backend

After a run:

result = await mesh.run(task)
print(result.trace_id)   # e.g. "4a8f2b1c9d3e7f2a..."

Use this trace_id in your backend to look up all spans for the run:

In Jaeger: http://localhost:16686/trace/4a8f2b1c9d3e7f2a

In Grafana: search by meshflow.run_id or meshflow.trace_id in the trace explorer.

7. Filtering By Blocked Spans

One of the most useful queries in production: find all runs where a guardian blocked an agent. In Jaeger or Grafana, filter by:

meshflow.verdict = blocked

This shows you every span where the safety guardian rejected an agent output — invaluable for monitoring what kinds of content your pipeline is blocking and whether you need to tune the guardian.

8. Cost And Latency Analysis

With spans exported to a backend, you can aggregate:

  • p95 latency per agent role: which role takes the longest 95% of the time?
  • Total cost per pipeline run: are costs growing over time?
  • Token usage by workflow version: did a prompt change increase token usage?
  • Carbon per pipeline variant: which configuration is most efficient?

These aggregations require a backend with query capability (Grafana, Datadog, Honeycomb). Console telemetry alone cannot answer them.

9. Hands-On Lab

MESHFLOW_MOCK=1 python3 hands_on/18_opentelemetry.py

Observe the console span output. For each span, note:

  • The trace_id (same across all spans in one run)
  • The agent_id and role
  • The verdict (allowed vs. blocked)
  • The cost and token count

For the full Jaeger experience:

docker run -d -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one
MESHFLOW_MOCK=1 python3 hands_on/18_opentelemetry.py --jaeger
open http://localhost:16686

10. Summary

OpenTelemetry transforms MeshFlow runs into queryable distributed traces. Each run produces a root span; each agent, gate, and tool produces a child span with attributes for cost, tokens, role, verdict, and carbon. Enable console output with telemetry_console=True for local development. Export to a backend with telemetry_otlp_endpoint for production monitoring. Use result.trace_id to look up the full trace. Filter by meshflow.verdict=blocked to monitor safety events.


Exercises

Exercises

Exercise 1: Run with Console Telemetry and Find All Span Types

Goal: Identify every distinct span type that MeshFlow emits by reading the console telemetry output.

Instructions:

  1. Run the hands-on script with console telemetry enabled:
   python hands_on/18_opentelemetry.py

The script configures telemetry_console=True, which prints span data to stdout in a human-readable format.

  1. Read the complete output. Every line (or block) that begins with [SPAN] or similar is one span. Copy the output into your notes.
  2. Go through every span and record its name field (this is the span type). Common span types you should find include:

- meshflow.run — the root span covering the entire pipeline execution - meshflow.node — one per node execution - meshflow.gate — for each gate evaluation (HITL, content policy, etc.) - meshflow.ledger_write — for each ledger record written - meshflow.policy_check — for each policy evaluation

  1. For each span type, record:

- The span name - Whether it has a parent span (and which span is the parent) - The key attributes attached to it (e.g., run_id, node_id, cost_usd, verdict) - The approximate duration in milliseconds

  1. How many total spans were emitted for a single pipeline run? Does this number match the number of nodes in the pipeline, or are there additional spans for ledger writes, policy checks, and gate evaluations?

Expected output: A complete inventory of span types with their names, parent relationships, key attributes, and durations — covering every span emitted during the run.


Exercise 2: Identify the trace_id and Confirm It Is the Same Across All Spans

Goal: Verify that all spans in a single run share one trace_id, confirming that they belong to the same logical trace.

Instructions:

  1. Run the script and capture the full telemetry output.
  2. Find the trace_id field in the root meshflow.run span. It should be a 32-character hexadecimal string, for example: 4bf92f3577b34da6a3ce929d0e0e4736.
  3. Now search every other span in the output for their trace_id field. Use a simple text search:
   python hands_on/18_opentelemetry.py 2>&1 | grep trace_id
  1. Verify that every span has the same trace_id. Record the trace_id and the count of spans that share it.
  2. Now look at the span_id field. Every span should have a unique span_id. The root span's span_id should appear as the parent_span_id of all direct child spans.
  3. Draw a tree in your notes showing the parent-child span relationships. Use indentation to show nesting depth. The root span should be at the top; direct children one indent level below; grandchildren two levels below.
  4. If you run the script twice, do the two runs share the same trace_id? Why or why not?

Expected output: Confirmation that all N spans share one trace_id, a mapping of span_id to parent_span_id, and a hand-drawn span tree showing the nesting structure.


Exercise 3: Find the Blocked Span in the Output

Goal: Locate the span with verdict=blocked and understand what it represents.

Instructions:

  1. The hands-on script includes a pipeline stage that triggers a policy block — for example, an agent that produces content flagged by a content classifier, or a node whose output exceeds a cost threshold. Run the script:
   python hands_on/18_opentelemetry.py
  1. Search the output for the word "blocked":
   python hands_on/18_opentelemetry.py 2>&1 | grep -i blocked
  1. Find the complete span that contains verdict=blocked. Record all of its attributes: span_id, parent_span_id, trace_id, node_id, agent_id, role, verdict, cost_usd, tokens, uncertainty, carbon_g, gate_result, and any others present.
  2. Answer the following questions about the blocked span:

- Which node generated the blocked span? What was this node's role in the pipeline? - What attribute triggered the block? Was it the content, the cost, the uncertainty level, or something else? - Did the pipeline halt after the block, or did it continue on an alternate path? - Is there a subsequent span showing what happened after the block (e.g., a fallback node, a HITL escalation, or a rejection terminal)?

  1. In a production monitoring system (Jaeger, Grafana), you would set up an alert that fires whenever a span with verdict=blocked appears. Write the query you would use in a Jaeger UI or PromQL expression to find all blocked spans from the last hour:
   # Jaeger search (conceptual)
   service: meshflow  tag: verdict=blocked  lookback: 1h

Expected output: The complete attribute list of the blocked span, answers to all four questions about the block context, and a written alert query for a monitoring system.


Exercise 4: Design a Trace Query to Find All Runs Over $0.01

Goal: Write and explain a trace query that finds expensive runs using span attribute filtering.

Instructions:

  1. The meshflow.run root span has a cost_usd attribute recording the total cost of the entire pipeline run. You want to find all runs that cost more than $0.01.
  2. In Jaeger's UI (or the query language of your chosen backend), write the query:
   service: meshflow
   operation: meshflow.run
   tag: cost_usd > 0.01
   lookback: 24h

Note: Jaeger's tag filter syntax may require exact key-value matching. For range queries, you may need to use a backend that supports them (Grafana Tempo with TraceQL, or Honeycomb with their query builder).

  1. Write the equivalent query in three different backends:

- Jaeger search UI: tag filter approach - Grafana Tempo TraceQL:

     { span.meshflow.run_cost_usd > 0.01 }

- Honeycomb:

     {"column": "span.cost_usd", "op": ">", "value": 0.01}
  1. For each query, explain:

- At what span level is cost_usd available? (root run span only, or also on individual node spans?) - Would the query return individual node spans or the root run span? - How would you drill down from the root span to find which node was the most expensive?

  1. Design an alert rule (in Prometheus alert syntax or plain English) that fires if more than 3 runs in the last 10 minutes exceeded $0.01:
   alert: HighCostRunRate
   expr: count(meshflow_run_cost_usd > 0.01)[10m] > 3
   for: 0m
   labels:
     severity: warning
   annotations:
     summary: "More than 3 runs exceeded $0.01 in the last 10 minutes"

Expected output: Written queries in three backends, explanations of span-level attribute availability, a drill-down strategy for finding the most expensive node, and a complete alert rule definition.


Exercise 5: Set Up Jaeger with Docker and Search for a Specific Run

Goal: Export real OTEL spans to a local Jaeger instance and use the Jaeger UI to find a specific run by trace_id.

Instructions:

  1. Start Jaeger using Docker (the all-in-one image includes the collector, query service, and UI):
   docker run -d --name jaeger \
     -p 16686:16686 \
     -p 4317:4317 \
     -p 4318:4318 \
     jaegertracing/all-in-one:latest

- Port 16686: Jaeger UI - Port 4317: OTLP gRPC receiver - Port 4318: OTLP HTTP receiver

  1. Configure the hands-on script to send spans to Jaeger's OTLP HTTP endpoint:
   app = MeshFlow(
       telemetry_otlp_endpoint="http://localhost:4318/v1/traces"
   )

Or set the environment variable:

   export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
  1. Run the pipeline:
   python hands_on/18_opentelemetry.py
  1. Note the trace_id printed in the console output (or the run_id from the pipeline output — they may differ; use the trace_id from the OTEL data).
  2. Open the Jaeger UI in your browser: http://localhost:16686
  3. In the Jaeger UI:

- Select service: meshflow (or whatever service name the script uses) - Click "Find Traces" - Click on the most recent trace to open the span waterfall view

  1. In the span waterfall, answer the following:

- How many spans are shown for this trace? - Which span is the root (longest bar)? - Which spans ran in parallel (overlapping bars)? - Find the span with verdict=blocked (if present) — what color does Jaeger use to highlight error spans?

  1. Use "Search by Tag" in Jaeger to find a run by a specific attribute:

- Search for run_id=<the run_id from the output> - Confirm that exactly one trace appears

  1. Stop and remove the Jaeger container when finished:
   docker stop jaeger && docker rm jaeger

Expected output: A description of the Jaeger UI span waterfall view for your run, answers to the seven waterfall questions, confirmation that the run_id search returns exactly one trace, and a note about the parallel spans' visual representation.