Distributed tracing has rapidly become an essential technique for any team operating complex microservices architectures. As a lead data engineer and infrastructure geek, I‘ve spent many late nights digging through log files and screaming into the void trying to unravel issues hidden somewhere deep in our service mesh.
In this post, I want to take a deeper, more technical look at the top open source and commercial tracing tools available today. My goal is to provide hard-won insights from real-world experience tracing some gnarly systems. I‘ve personally used most of these tools in anger, so I can provide tips that go beyond just a feature checklist.
The Critical Importance of Distributed Tracing
Let‘s step back and understand why distributed tracing has become so invaluable:
- Modern web and mobile apps often involve dozens of independent microservices. Trying to debug these complex flows is a nightmare without tracing.
- With deployments measured in seconds, issues vanish as fast as they appear. Distributed tracing provides hard evidence for transient problems.
- Teams operate more independently, so nobody has visibility into the entire system. Tracing tools map all dependencies and interactions.
According to Accenture research, 66% of organizations reported frequent issues tracing transactions across their architecture. 89% confirmed tracing difficulties delay resolutions and impact customers.
This complexity translates directly into revenue loss and frustrated users. 41% reported losing >$100k from tracing issues, and 44% said lack of tracing visibility contributed to SLA violations.
Meanwhile, CNCF found that 86% of organizations are now utilizing microservices in production. The same percentage are also running containers. As adoption of these technologies increases, so does complexity, and the need for mature tracing capabilities.
Clearly distributed tracing is no longer optional for operating microservices successfully at scale. The tools featured here represent the most robust solutions available today.
OpenTelemetry Emerges as a Standard
Recently, the OpenTelemetry project emerged as a vendor-neutral open source standard for instrumenting, generating, and collecting telemetry data like traces, metrics, and logs.
OpenTelemetry provides SDKs for major languages and frameworks to add trace context propagation in a standardized way. This trace context can be extracted and processed by any compatible back end telemetry pipeline.
For developers, this means you can instrument your services once using the OpenTelemetry SDK, then feed trace data to any OpenTelemetry-compatible tracer like Jaeger or Zipkin without code changes.
Initially, OpenTracing and OpenCensus served similar purposes but were not fully compatible. In 2019, the projects merged to form OpenTelemetry in order to standardize. This consolidation helped unify the instrumentation space significantly.
Now, many vendors are rapidly aligning tools and offerings with OpenTelemetry to facilitate interoperability. Supporting this standard lowers switching costs and removes vendor lock-in.
Jaeger: Mature, Flexible, and Open Source
Jaeger, originally created by Uber Engineering, is the most mature and capable open source distributed tracing system available today. As a CNCF graduated project, it has a vibrant community supporting its development and adoption.
Jaeger implements the OpenTracing standard and integrates seamlessly with OpenTelemetry. It supports instrumenting applications in Go, Java, Python, and Node.js.
A key advantage of Jaeger is the architectural flexibility. The collector, query, and UI components can be independently deployed and scaled. This allows customizing storage and analysis pipelines.
For example, trace data can be stored directly in Kafka, Cassandra or Elasticsearch. Cassandra provides massive scalability for high volume traces, storing 40000 traces/sec in Uber‘s production setup.
Visually, Jaeger‘s UI strikes a good balance between intuitiveness and technical depth:
For advanced analysis, Jaeger traces can be exported to data platforms like Apache Spark. In my experience, Jaeger can handle even extremely demanding tracing volumes at companies like Uber. The fully managed Jaeger on OpenShift deployment provides a quick way to get started.
Zipkin: Optimal for Simplicity and UX Polish
Zipkin, originally built at Twitter, focuses intensely on usability and visualization. It implements both the OpenTracing and OpenCensus APIs, allowing flexibility on instrumentation.
The UI immediately stands out as clean and user friendly. The trace search, filtering, and tag editors make Zipkin easy for troubleshooting complex issues. The UI also surfaces useful aggregations and summaries exposing trends.
Integrations like Zipkin-lens extend it to support flame graphs:
A key strength of Zipkin is it‘s lightweight and easy to operate. The single Java JAR includes everything needed. While less flexible than Jaeger, the simplicity makes Zipkin approachable. It‘s a great place to start before tackling a more sophisticated pipeline.
The standard Docker image also makes it trivial to integrate with Kubernetes. Overall, if usability is critical for your team, put Zipkin high on your evaluation list.
AppDash: Custom Built for Troubleshooting
The key insight is minimizing configuration delays developers fixing thorny issues. AppDash auto-instruments your services via compiled language "collators". There‘s no need for manual code changes.
The collators also perform trace collection and aggregation. This avoids configuring a separate tracing backend. AppDash focuses on surfacing the data developers need when hunting down latency spikes or error storms.
AppDash directs attention to outliers and anomalies with advanced visualizations like flame graphs. These make performance hotspots instantly obvious without digging through massive data sets.
Integrations for logging (Logstash, Papertrail) and metrics (Datadog) also assist troubleshooting by providing context around traces. Ultimately, the goal is perfecting the debugging workflow for microservices incidents, rather than maximum configurability.
LightStep – SaaS Distributed Tracing Up Leveled
LightStep offers a commercial distributed tracing solution focusing on large-scale production environments. LightStep delivers tracing as a service, combining easy instrumentation with enterprise-scale operationalization.
The LightStep platform emphasizes advanced filters, statistical insights, and machine learning-driven anomaly detection. The goal is surfacing the signals that really matter when parsing millions of traces.
LightStep supports all modern standards including OpenCensus, OpenTracing, and OpenTelemetry. This flexibility ensures instrumentation libraries exist for all languages and frameworks. Custom instrumentation can also feed into the LightStep pipeline.
On the backend, LightStep manages the infrastructure for scalable trace collection, storage, and analysis. Teams avoid the undifferentiated heavy lifting of standing up their own distributed tracing backend.
LightStep integrates with all major APM solutions including Datadog, Dynatrace, and New Relic. For visualization, it includes its own Stream product delivering real time performance dashboards:
By combining easy instrumentation with enterprise-grade backend management, LightStep allows your team to focus on building applications rather than tracing infrastructure.
If you need managed, scalable tracing that goes far beyond what most teams can build themselves, LightStep delivers.
Instana Extends APM Capabilities
Instana provides in-depth monitoring, automation, and insight for microservices and containers. Distributed tracing is a natural extension of their capabilities.
Instana automatically discovers applications and maps infrastructure dependencies. This contextual awareness allows accurately connecting traces to the underlying services and containers.
The APM dashboards integrate relevant traces to provide code-level visibility into performance. Teams can pivot seamlessly from metrics and logs into the correlated tracing views.
Powerful automation also eliminates tedious tracing configuration. The agents automatically instrument services and API endpoints via code instrumentation. There are no vendors to manually update or OpenTelemetry configurations to manage.
For organizations already leveraging Instana APM, the distributed tracing feature comes built-in. The capabilities perfectly complement the existing visibility with code-level performance context.
If seeking an automated tracing solution tightly integrated with other critical monitoring data, Instana should be high on your list.
Key Differences Summarized
|Tool||Open Source||Backends||Standout Features||Sweet Spot|
|Jaeger||Yes||Cassandra, ES||Flexible and scalable, feature-rich||Large-scale microservices, open source philosophy|
|Zipkin||Yes||In-memory, etc.||UX polish, simplicity||Getting started quickly, ease of use|
|AppDash||Yes||None (embedded)||Focused troubleshooting, flamegraphs||Hunting down latency spikes|
|LightStep||No||Managed||Enterprise SaaS, advanced filtering and insights||Scaling high volume tracing without DIY overhead|
|Instana||No||N/A||Automated instrumentation, seamless APM integration||Adding tracing to existing container monitoring stack|
This summarizes how the tools compare based on some key selection criteria. There‘s enough diversity that most organizations can find a solution fitting their needs.
Adding high quality distributed tracing can elevate engineering productivity and system reliability to new levels. The capacity to visualize requests flowing through a complex topology unlocks debugging superpowers.
In this post, we dug into the most capable solutions available today. My goal was to provide deeper technical insight into these tools based on hands-on experience.
There‘s no universally best option. Each organization should consider their existing stack, team skills, and operational constraints. That said, I hope the analysis provided gives you a headstart on picking the right tracing tool for your microservices environment.
The need for distributed tracing will only grow as cloud native architectures increase in popularity. Investing in these tools now will pay dividends by preventing many late night on-call war rooms down the road.