Debugging, Traces, and Observability
MESHFLOW_MOCK=1 python3 hands_on/06_guardian_safety.pyLesson 07: Debugging, Traces, And Observability
Lesson Goal
This lesson teaches how to debug AI workflows. Agentic systems fail differently from ordinary code because failures can hide inside context, memory, tool calls, or model output. Traces make those failures visible.
By the end, you should know what to inspect when a workflow gives a bad answer, stops unexpectedly, or silently uses the wrong information.
1. Why AI Debugging Is Different
Traditional code often fails with a clear exception:
TypeError: expected string, got dict
AI workflows can fail more quietly:
- The model used stale context.
- A retrieved memory was irrelevant.
- A tool returned incomplete data.
- The model ignored an instruction.
- The workflow skipped a required review.
- A gate blocked for the right reason.
- A final answer sounded good but was unsupported.
You need observability for both code behavior and reasoning inputs.
2. What A Trace Is
A trace is the execution record of a workflow run.
At minimum, a trace should show:
- Run ID.
- Node ID.
- Node type.
- Start and end time.
- Inputs.
- Dependencies.
- Tool calls.
- Produced artifacts.
- Gate decisions.
- Errors.
- Final status.
In a governed AI system, the trace is not optional. It is how you explain what happened.
3. Debugging Questions
When something goes wrong, ask these in order:
- Did the expected node run?
- Did it receive the right input?
- Did it use the right context?
- Did it retrieve the right memory?
- Did it call the right tool?
- Did the tool return valid data?
- Did the node produce the required artifact?
- Did a gate block the workflow?
- Was the block expected?
- Did the final answer depend on approved artifacts?
This order prevents guessing.
4. Common Failure Types
| Failure | Symptom | What To Inspect |
|---|---|---|
| Missing artifact | Later node cannot run | Node contracts and dependencies |
| Wrong context | Answer uses wrong facts | Rendered prompt or node input |
| Stale memory | Old decision appears | Memory retrieval and timestamps |
| Tool failure | Empty or malformed output | Tool input, output, error |
| Gate block | Workflow stops early | Gate condition and approval record |
| Hallucination | Unsupported claim | Evidence artifacts and citations |
| Looping | Repeated steps | Stop condition and retry count |
5. Hands-On: Find The Blocked Gate
Run:
python3 -m src.mini_meshflow run examples/03_agent_with_gate.json
Then answer:
- Which node blocked?
- What reason did it give?
- Which nodes ran before the block?
- Which node did not run after the block?
- What value would allow execution to continue?
This is the simplest debugging exercise in the course.
6. Debugging Context
If a model gives the wrong answer, inspect the exact context it received.
Look for:
- Missing instructions.
- Contradictory facts.
- Irrelevant retrieved documents.
- Tool results that were not included.
- Artifacts with confusing names.
- Too much old conversation history.
Beginner rule: do not debug the model first. Debug the assembled context first.
7. Debugging Memory
Memory problems usually come from retrieval quality.
Ask:
- Was the memory stored correctly?
- Was it indexed correctly?
- Was the retrieval query specific enough?
- Was the retrieved memory fresh?
- Was it relevant to the current task?
- Should it have been filtered out?
Bad memory can be worse than no memory because it can make the model confidently use old or irrelevant facts.
8. Debugging Tools
For every tool call, log:
- Tool name.
- Tool version, if available.
- Arguments.
- Validation result.
- Execution status.
- Output.
- Error message.
- Duration.
- Cost, if relevant.
Tool output should be treated as data, not truth. Validate it before relying on it.
9. Debugging Gates
When a gate blocks, ask:
- Is the gate condition correct?
- Is the approval value missing?
- Did the quality score fail?
- Is the threshold too strict?
- Is the workflow missing a route for failure?
- Should the workflow stop or revise?
A blocked gate is not automatically bad. It may be the safest outcome.
10. Observability For Production
Production systems usually need more than console output:
- Structured logs.
- Metrics.
- Distributed traces.
- Model input/output sampling.
- Tool call dashboards.
- Cost tracking.
- Latency tracking.
- Error rates by node.
- Human review queues.
- Audit exports.
The goal is not to collect everything forever. The goal is to collect enough to debug, improve, and prove what happened.
11. Evaluation vs Observability
Observability explains a single run.
Evaluation measures quality across many runs.
Examples of evaluation:
- Accuracy score.
- Citation coverage.
- Policy pass rate.
- Tool success rate.
- Human approval rate.
- Average cost per run.
- Average time to completion.
Good teams use both. Traces explain incidents. Evaluations show trends.
12. Debugging Practice
Choose one example workflow and intentionally break it:
- Remove a dependency.
- Rename an artifact.
- Set a gate to false.
- Use an unknown tool name.
Run compile or run commands and inspect the output:
python3 -m src.mini_meshflow compile examples/02_tools_and_memory.json
python3 -m src.mini_meshflow run examples/03_agent_with_gate.json
After each failure, write:
- What failed?
- Where did the trace point?
- What would you change?
- What test would catch this next time?
13. Common Beginner Mistakes
Mistake 1: Blaming the LLM immediately.
Correction: First inspect context, tools, memory, and artifacts.
Mistake 2: Logging only final answers.
Correction: Log node-level inputs, outputs, and decisions.
Mistake 3: Treating blocked gates as crashes.
Correction: Read the gate reason.
Mistake 4: Hiding tool calls.
Correction: Make tool input and output visible in traces.
Mistake 5: Evaluating by vibes.
Correction: Create rubrics and measure repeat runs.
14. Summary
Debugging AI workflows means reconstructing what the system saw, did, produced, and decided. Traces make that possible. Observability turns agentic systems from black boxes into inspectable engineering systems.
Exercises
Exercises
Exercise 1: Find The Blocked Gate
Run the gated workflow with the gate set to false:
python3 -m src.mini_meshflow run examples/03_agent_with_gate.json
Answer each question:
- Which node blocked?
- What reason was recorded in the trace?
- Which node did not run because of the block?
- What would you do first to investigate if this block was unexpected?
Exercise 2: Trace Checklist
Create a checklist of fields you would want in a production AI workflow trace. Group your fields into two sections: run-level fields (one per run) and node-level fields (one per node execution).
Use this starter:
Run level:
[ ] run_id
[ ] ...
Per-node:
[ ] node_id
[ ] ...
Aim for at least six fields in each section. Compare against answers.md after.
Exercise 3: Read An Audit File
The repository contains audit run files in the root directory:
audit_run_480c.json
audit_run_5e9b.json
audit_run_97da.json
audit_run_e2d6.json
Open one of these files and answer:
- How many steps were recorded?
- What was the final status?
- Find one step that produced an artifact. What was the artifact name?
- Is there a cost field? If yes, what was the total cost?
- Could you reproduce this run from the information in the file?
This exercise builds the habit of reading traces as a primary debugging tool, not as a secondary log you check only when something breaks.
Exercise 4: Instrument A Workflow Step
Open ../../examples/02_tools_and_memory.json and add a comment or note to one node describing exactly what you would log for that step. Include:
- What inputs you would capture
- What output you would capture
- What failure mode would produce a confusing trace if not logged
- One piece of context a human reviewer would need to understand the trace
You do not need to modify the runner — just annotate the JSON with comments and explain your reasoning.
Exercise 5: Design A Debug Session
You run a workflow and the final artifact is empty. Plan your debugging session in five steps using only the trace:
Step 1: Look at ___
Step 2: If ___ then investigate ___
Step 3: Trace back to ___
Step 4: Check ___
Step 5: Fix ___ and re-run
This exercise is about building a systematic debugging habit. The fastest debuggers do not guess — they follow the data lineage backwards from the bad output to its root cause.