Traces and Spans in OpenTelemetry
A simple mental model for traces and spans in distributed systems.
The basic idea
When a request travels through multiple services, it leaves a trace. A trace is made of spans.
Trace: GET /checkout
├─ Span: API Gateway
├─ Span: Auth service
├─ Span: Cart service
├─ Span: Payment service
└─ Span: Database query
Each span is one unit of work with a start, an end, and context.
Why it matters
Logs tell you what happened. Metrics tell you how often. Traces tell you where the request spent time and how work moved across services.
In practice, this is what makes bottlenecks and hidden dependencies visible.
Span structure
Typical fields include:
namestart_timeend_timeattributeseventsparent_span_idtrace_idspan_id
Example:
{
"name": "HTTP GET /users/:id",
"duration_ms": 120,
"trace_id": "3f7c1a...",
"span_id": "a91bd2..."
}
Context propagation
A trace works only if every service forwards context:
trace_idspan_idparent_span_id
Without propagation, a distributed request becomes fragmented telemetry.
Automatic vs manual instrumentation
Automatic instrumentation gives fast coverage for HTTP clients, DB calls, and frameworks. But it misses business intent.
The most valuable spans are often manual:
validate-orderreserve-inventoryapply-fraud-check
These spans connect system behavior to domain behavior.
For the next iteration, I want to add one manual business span per critical user flow and compare the trace readability before/after.