Davide's Code and Architecture Notes - Metrics, Logs, and Traces: the three pillars of Observability

2025-11-18 8 min read Code and Architecture Notes

Learn the differences between metrics, logs, and traces - the three pillars of observability in distribut ed systems - and how to use them effectively

Just a second! 🫷
If you are here, it means that you are a software developer. So, you know that storage, networking, and domain management have a cost .

If you want to support this blog, please ensure that you have disabled the adblocker for this site. I configured Google AdSense to show as few ADS as possible - I don't want to bother you with lots of ads, but I still need to add some to pay for the resources for my site.

Thank you for your understanding.
- Davide

Understanding what is happening inside a software system is becoming increasingly difficult, given that most applications are now distributed and composed of multiple services.

To address this challenge, we can (and should!) use Observability: a set of tools and practices used to have information about what is really going on in the system, from different standpoints: resource usage, errors, components interaction, and logs.

Observability is usually based on three elements: metrics, logs, and traces. The problem is that they are often confused misused, or used with the wrong purpoose.

In this article, we will clarify the differences between metrics, logs, and traces, and explain how they complement each other to provide a comprehensive view of a system’s behaviour.

I have added some questions for you to reason about: I don’t have clear answers for these questions, so I’d really like to learn from you: feel free to share your thoughts in the comments!

Metrics: plain numbers that describe your system over time

Metrics are just pure numbers. They are quantitative measurements that represent specific aspects of the system, allowing you to understand how it changes over time. Metrics are typically collected at regular intervals and stored in a time-series database for analysis and visualisation.

It’s thanks to metrics that we can monitor trends of specific measurements, such as CPU usage, memory consumption, request latency, error rates, and throughput. As Metrics are just numbers, they are often represented as graphs or charts, allowing us to identify anomalies or patterns in the data quickly.

What you can do with metrics depends on the type of metric you are collecting. Common types of metrics include:

Counters: these are metrics that only increase over time, such as the number of requests received or the number of errors encountered.
Gauges: these are metrics that can go up or down, such as CPU usage or memory consumption.
Histograms: these are metrics that capture the distribution of values, such as request latency or response size: metrics are grouped into “buckets” that represent ranges of values.

In short, the most important thing to do with metrics is to monitor them over time, set up alerts for anomalies, and use them to identify trends and patterns in your system’s behaviour. The single value has no meaning by itself, but the trend over time is what really matters.

Therefore, ensure that you collect the appropriate metrics for your system and utilise them to gain valuable insights into its performance and behaviour. Compare current values with historical data to identify trends and anomalies. But, most of all, collect only metrics that are actually relevant to your system and avoid collecting too many metrics that can lead to noise and confusion.

❓ A question for you: would you monitor metrics like “number of new accounts” or “number of posts published” in a social media application? Why, or why not? On the one hand, it is a value that most companies should keep track to understand the health of their website. On the other hand, it’s not related to the infrastructure of the system, but on user behavior. So… I don’t know!

Logs: detailed records of events

Logs are text records that capture detailed information about events that occur within a system. Logs are typically generated by applications, services, or infrastructure components and can include information such as timestamps, log levels, messages, and contextual data.

However, logs without timestamps are useless; in fact, timestamps enable us to correlate events across different components and systems, helping you understand the sequence of events that led to a particular outcome.

Logs are often used for troubleshooting and debugging purposes, as they provide a detailed record of what happened in the system at a specific point in time. Logs can also be used for auditing and compliance purposes, as they can provide a record of user activity and system changes.

When working with logs, it’s important to consider the following best practices:

Log levels: Use log levels (e.g., DEBUG, INFO, WARN, ERROR) to categorise logs based on their severity and importance. This helps you filter and prioritise logs during analysis.
Structured logging: Use structured logging formats (e.g., JSON) to make it easier to parse and analyse logs programmatically.
Contextual information: Include relevant contextual information in logs, such as user IDs, request IDs, and session IDs, to help correlate events across different components and systems. But pay attention to not log sensitive information!
Log retention: Define a log retention policy to manage the storage and lifecycle of logs, ensuring that you retain logs for an appropriate period while minimising storage costs.
Sampling: In high-throughput systems, consider sampling logs to reduce the volume of data while still capturing representative events.

Log messages should be complete and contain all the necessary info

Always keep in mind that logs are a common threat vector: review them carefully to avoid leaking sensitive information, such as passwords, API keys, or personally identifiable information (PII). Secure the log storage by implementing Access Control and avoiding exposure to external systems or actors. And don’t forget to sanitise the data you are going to log! Attackers can exploit logs to gain unauthorised access to systems or data (using a technique called Log Injection).

If your system processes data from European users, ensure compliance with GDPR regulations when handling logs. You cannot store personal data without users’ consent, and you must ensure that logs are stored securely and deleted when no longer needed.

❓ A question for you: suppose that your application throws an exception, which is logged and thrown to the caller. Would you log it only where the exception was thrown, or would you log it also in the upper layers that caught it? Why?

Traces: following the journey of a request

Traces are records that capture the journey of a request as it flows through the system. They provide a detailed view of how requests are processed across multiple services and components, allowing you to understand the end-to-end flow of a request and identify bottlenecks or failures.

Each single step of a trace is called Span: in each span, you can have one or more operations that are executed as part of the request. Spans can be nested, allowing you to capture the hierarchical structure of a request as it flows through different services and components.

Traces are important for two main reasons:

They help you understand the whole end-to-end flow of a request, including the interactions between different services and components. When paired with Logs, they can provide detailed context about what happened at each step of the request.
They help you identify performance bottlenecks and latency issues in your system. By analysing traces, you can pinpoint which services or components are causing delays and optimise them for better performance.

Diagram showing a distributed trace with multiple spans across different services, illustrating the hierarchical structure of request flow

When working with traces, it’s important to propagate the context: you should propagate the trace context (e.g., trace IDs, span IDs) across service boundaries to maintain the continuity of traces. This trick, when performed on a system where all traces are stored in a single location, allows you to reconstruct the entire journey of a request, even if it spans multiple services and components.

Finally, consider that traces can generate a large amount of data, especially in high-throughput systems. To manage the volume of trace data, you can implement sampling strategies to reduce the amount of data collected while still capturing representative requests.

Wrapping up

In this article, we explored the three pillars of Observability: metrics, logs, and traces. We discussed their differences, how they complement each other, and best practices for using them effectively in distributed systems.

But, here’s the trick: you don’t need to implement all three pillars at once! Start with what makes the most sense for your application and gradually expand your observability capabilities as needed. Also, remember that the goal of Observability is to gain insights into your system’s behaviour and performance, so focus on collecting and analysing data that is relevant to your specific use cases and requirements.

Observability is effective on distributed systems, but can also be used on monolithic applications: in fact, even a single application can benefit from metrics, logs, and traces to monitor its performance and behaviour.

So, what should we implement first?

In my opinion, I would start with logging, as it’s the easiest to implement and provides the most immediate value for troubleshooting and debugging. Once you have a solid logging strategy in place, consider adding tracing to gain insights into the end-to-end flow of requests.

Finally, if you have specific KPI to track, then you can start adding metrics to the system and keep track of how they change over time.

I hope you enjoyed this article! Let’s keep in touch on LinkedIn, Twitter or BlueSky! 🤜🤛

Happy coding!

🐧

Code4IT

Davide's Code and Architecture Notes - Metrics, Logs, and Traces: the three pillars of Observability

Table of Contents

Metrics: plain numbers that describe your system over time

Logs: detailed records of events

Traces: following the journey of a request

Further readings

Wrapping up

About the author

Davide's Code and Architecture Notes - Metrics, Logs, and Traces: the three pillars of Observability

Table of Contents

Metrics: plain numbers that describe your system over time

Logs: detailed records of events

Traces: following the journey of a request

Further readings

Wrapping up

Related articles

About the author