Note

Traces and Spans in OpenTelemetry

A simple mental model for traces and spans in distributed systems.

The basic idea

When a request travels through multiple services, it leaves a trace. A trace is made of spans.

Trace: GET /checkout
├─ Span: API Gateway
├─ Span: Auth service
├─ Span: Cart service
├─ Span: Payment service
└─ Span: Database query

Each span is one unit of work with a start, an end, and context.

Why it matters

Logs tell you what happened. Metrics tell you how often. Traces tell you where the request spent time and how work moved across services.

In practice, this is what makes bottlenecks and hidden dependencies visible.

Span structure

Typical fields include:

  1. name
  2. start_time
  3. end_time
  4. attributes
  5. events
  6. parent_span_id
  7. trace_id
  8. span_id

Example:

{
  "name": "HTTP GET /users/:id",
  "duration_ms": 120,
  "trace_id": "3f7c1a...",
  "span_id": "a91bd2..."
}

Context propagation

A trace works only if every service forwards context:

  1. trace_id
  2. span_id
  3. parent_span_id

Without propagation, a distributed request becomes fragmented telemetry.

Automatic vs manual instrumentation

Automatic instrumentation gives fast coverage for HTTP clients, DB calls, and frameworks. But it misses business intent.

The most valuable spans are often manual:

  1. validate-order
  2. reserve-inventory
  3. apply-fraud-check

These spans connect system behavior to domain behavior.

For the next iteration, I want to add one manual business span per critical user flow and compare the trace readability before/after.