Learn/Phase 3/Debugging, Traces, and Observability

Debugging, Traces, and Observability

Ch 07 · Agentic Workflows 45 min
Trace readingGate forensicsRoot cause analysis
Hands-on:MESHFLOW_MOCK=1 python3 hands_on/06_guardian_safety.py

Lesson 07: Debugging, Traces, And Observability

Lesson Goal

This lesson teaches how to debug AI workflows. Agentic systems fail differently from ordinary code because failures can hide inside context, memory, tool calls, or model output. Traces make those failures visible.

By the end, you should know what to inspect when a workflow gives a bad answer, stops unexpectedly, or silently uses the wrong information.

1. Why AI Debugging Is Different

Traditional code often fails with a clear exception:

TypeError: expected string, got dict

AI workflows can fail more quietly:

  • The model used stale context.
  • A retrieved memory was irrelevant.
  • A tool returned incomplete data.
  • The model ignored an instruction.
  • The workflow skipped a required review.
  • A gate blocked for the right reason.
  • A final answer sounded good but was unsupported.

You need observability for both code behavior and reasoning inputs.

2. What A Trace Is

A trace is the execution record of a workflow run.

At minimum, a trace should show:

  • Run ID.
  • Node ID.
  • Node type.
  • Start and end time.
  • Inputs.
  • Dependencies.
  • Tool calls.
  • Produced artifacts.
  • Gate decisions.
  • Errors.
  • Final status.

In a governed AI system, the trace is not optional. It is how you explain what happened.

3. Debugging Questions

When something goes wrong, ask these in order:

  1. Did the expected node run?
  2. Did it receive the right input?
  3. Did it use the right context?
  4. Did it retrieve the right memory?
  5. Did it call the right tool?
  6. Did the tool return valid data?
  7. Did the node produce the required artifact?
  8. Did a gate block the workflow?
  9. Was the block expected?
  10. Did the final answer depend on approved artifacts?

This order prevents guessing.

4. Common Failure Types

FailureSymptomWhat To Inspect
Missing artifactLater node cannot runNode contracts and dependencies
Wrong contextAnswer uses wrong factsRendered prompt or node input
Stale memoryOld decision appearsMemory retrieval and timestamps
Tool failureEmpty or malformed outputTool input, output, error
Gate blockWorkflow stops earlyGate condition and approval record
HallucinationUnsupported claimEvidence artifacts and citations
LoopingRepeated stepsStop condition and retry count

5. Hands-On: Find The Blocked Gate

Run:

python3 -m src.mini_meshflow run examples/03_agent_with_gate.json

Then answer:

  • Which node blocked?
  • What reason did it give?
  • Which nodes ran before the block?
  • Which node did not run after the block?
  • What value would allow execution to continue?

This is the simplest debugging exercise in the course.

6. Debugging Context

If a model gives the wrong answer, inspect the exact context it received.

Look for:

  • Missing instructions.
  • Contradictory facts.
  • Irrelevant retrieved documents.
  • Tool results that were not included.
  • Artifacts with confusing names.
  • Too much old conversation history.

Beginner rule: do not debug the model first. Debug the assembled context first.

7. Debugging Memory

Memory problems usually come from retrieval quality.

Ask:

  • Was the memory stored correctly?
  • Was it indexed correctly?
  • Was the retrieval query specific enough?
  • Was the retrieved memory fresh?
  • Was it relevant to the current task?
  • Should it have been filtered out?

Bad memory can be worse than no memory because it can make the model confidently use old or irrelevant facts.

8. Debugging Tools

For every tool call, log:

  • Tool name.
  • Tool version, if available.
  • Arguments.
  • Validation result.
  • Execution status.
  • Output.
  • Error message.
  • Duration.
  • Cost, if relevant.

Tool output should be treated as data, not truth. Validate it before relying on it.

9. Debugging Gates

When a gate blocks, ask:

  • Is the gate condition correct?
  • Is the approval value missing?
  • Did the quality score fail?
  • Is the threshold too strict?
  • Is the workflow missing a route for failure?
  • Should the workflow stop or revise?

A blocked gate is not automatically bad. It may be the safest outcome.

10. Observability For Production

Production systems usually need more than console output:

  • Structured logs.
  • Metrics.
  • Distributed traces.
  • Model input/output sampling.
  • Tool call dashboards.
  • Cost tracking.
  • Latency tracking.
  • Error rates by node.
  • Human review queues.
  • Audit exports.

The goal is not to collect everything forever. The goal is to collect enough to debug, improve, and prove what happened.

11. Evaluation vs Observability

Observability explains a single run.

Evaluation measures quality across many runs.

Examples of evaluation:

  • Accuracy score.
  • Citation coverage.
  • Policy pass rate.
  • Tool success rate.
  • Human approval rate.
  • Average cost per run.
  • Average time to completion.

Good teams use both. Traces explain incidents. Evaluations show trends.

12. Debugging Practice

Choose one example workflow and intentionally break it:

  1. Remove a dependency.
  2. Rename an artifact.
  3. Set a gate to false.
  4. Use an unknown tool name.

Run compile or run commands and inspect the output:

python3 -m src.mini_meshflow compile examples/02_tools_and_memory.json
python3 -m src.mini_meshflow run examples/03_agent_with_gate.json

After each failure, write:

  • What failed?
  • Where did the trace point?
  • What would you change?
  • What test would catch this next time?

13. Common Beginner Mistakes

Mistake 1: Blaming the LLM immediately.

Correction: First inspect context, tools, memory, and artifacts.

Mistake 2: Logging only final answers.

Correction: Log node-level inputs, outputs, and decisions.

Mistake 3: Treating blocked gates as crashes.

Correction: Read the gate reason.

Mistake 4: Hiding tool calls.

Correction: Make tool input and output visible in traces.

Mistake 5: Evaluating by vibes.

Correction: Create rubrics and measure repeat runs.

14. Summary

Debugging AI workflows means reconstructing what the system saw, did, produced, and decided. Traces make that possible. Observability turns agentic systems from black boxes into inspectable engineering systems.


Exercises

Exercises

Exercise 1: Find The Blocked Gate

Run the gated workflow with the gate set to false:

python3 -m src.mini_meshflow run examples/03_agent_with_gate.json

Answer each question:

  • Which node blocked?
  • What reason was recorded in the trace?
  • Which node did not run because of the block?
  • What would you do first to investigate if this block was unexpected?

Exercise 2: Trace Checklist

Create a checklist of fields you would want in a production AI workflow trace. Group your fields into two sections: run-level fields (one per run) and node-level fields (one per node execution).

Use this starter:

Run level:
  [ ] run_id
  [ ] ...

Per-node:
  [ ] node_id
  [ ] ...

Aim for at least six fields in each section. Compare against answers.md after.

Exercise 3: Read An Audit File

The repository contains audit run files in the root directory:

audit_run_480c.json
audit_run_5e9b.json
audit_run_97da.json
audit_run_e2d6.json

Open one of these files and answer:

  • How many steps were recorded?
  • What was the final status?
  • Find one step that produced an artifact. What was the artifact name?
  • Is there a cost field? If yes, what was the total cost?
  • Could you reproduce this run from the information in the file?

This exercise builds the habit of reading traces as a primary debugging tool, not as a secondary log you check only when something breaks.

Exercise 4: Instrument A Workflow Step

Open ../../examples/02_tools_and_memory.json and add a comment or note to one node describing exactly what you would log for that step. Include:

  • What inputs you would capture
  • What output you would capture
  • What failure mode would produce a confusing trace if not logged
  • One piece of context a human reviewer would need to understand the trace

You do not need to modify the runner — just annotate the JSON with comments and explain your reasoning.

Exercise 5: Design A Debug Session

You run a workflow and the final artifact is empty. Plan your debugging session in five steps using only the trace:

Step 1: Look at ___
Step 2: If ___ then investigate ___
Step 3: Trace back to ___
Step 4: Check ___
Step 5: Fix ___ and re-run

This exercise is about building a systematic debugging habit. The fastest debuggers do not guess — they follow the data lineage backwards from the bad output to its root cause.