In this article, we will examine some best practices to follow while logging microservices and the architecture to handle distributed logging in the microservices world.
Microservices architecture has become one of the most popular choices for large scale applications in the world of Software Design, Development, and Architecture, basically due to the benefits over its traditional counterpart, the monolithic architecture. These benefits arise due to a shift from one single large, tightly coupled unit (monolith) to multiple small loosely coupled services, wherein each service has limited and specific functionality to deliver. So, with smaller codebases, we get to leverage the power of distributed teams due to decreased dependencies and coupling, in turn reducing the time to market (or production) of any application. Other advantages include language-agnostic stack and selective scaling.
Logging in microservices comes with other advantages that can be shipped with the architecture, it also comes with its own set of complexities – the reason being a single request in this architecture could span across multiple services, and it might even travel back and forth. To trace the end-to-end flow and identify the source of originating errors of a request through our systems, we need a logging and monitoring system in place. We can adopt one of the two solutions, a centralized logging service and an individual logging service for each service.
Individual logging service VS Centralized logging service
Individual logging solutions for each service can become a pain point when the number of services starts growing. Because for every process-flow that you intend to look the logs for, you might need to go through each of the service logs involved in serving this process request, hence, making issue identification and resolution a tough job. However, on the other hand, in a centralized logging service, you have a single go-to place for the same, which, backed by enough information around logs and a thought-off design, can do wonders at achieving the same.
At Skeps, we use centralized logging solutions for our applications running on microservice architecture.
Centralized logging service
A single centralized logging service that aggregates logs from all the services should be a more preferred solution in a microservices architecture. In the software world, unique/unseen problems are not seldom, and we certainly do not want to be juggling through multiple log files or developed dashboards to get insights about what caused the same. While designing a standard centralized logging scheme, one could or in fact should refer to the following norms:
Using a Correlation Id for each request
A correlation id is a unique id that can be assigned to an incoming request, which can help us to identify this request uniquely in each service.
Defining a standard log structure
Defining the log structure is the most crucial part of logging effectively. In the first place, we need to identify why are we enabling logging? A few points could be:
- How did each service respond while delivering on its front – whether it succeeded or caused errors? Whatever the case may be, our aim should be to get most of the context around that.
- What process/function from the service generated the log?
- At what time during the process was the log generated.
- How crucial is the process that generated the log?
While answering these questions, we get to derive a format, which can include, but is not limited to the following things:
- Service name
- Correlation Id
- Log String (can include a short description, name of the generating method)
- Log Metadata (can include error stacks, successful execution response(s) of a subtask)
- Log Timestamp
- Log Level (DEBUG, WARN, INFO, ERROR)
Customized Alerts
When something critical breaks, we do not want it to get stored in our databases without us getting to know about it in real-time. So, it is pivotal to set up notifications on events that indicate a possible problem in the system, and its categorization can be done by keeping a reserved log level.
A set of rich querying APIs or Dashboard
Post storing all the information, it is important to make sense out of the stored information. Rich APIs can be developed to filter based on correlation id, log level, timestamp or any other parameter that can help to identify and resolve issues at a much faster pace.
Decide a Timeline to Clean Logs
Decide upon an appropriate timeline to clear the clutter and do cut down on your storage usage. This also depends on your application’s need and the reliability of your shipped code that has been in the past. This is when you are not storing any sensitive information. Otherwise, you need to follow the compliances in place. However, there are workarounds for that. You can keep an additional field to store such information, and only that can be cleared from each of the stored logs as per the compliance timeline.
How to make log aggregator fail-safe?
There can be instances when the logging service (log aggregator) goes down, and it becomes a single point of failure for our analysis and debugging needs. We do not want to miss out on any logs that were generated during that outage. So, we need to develop an alternate mechanism that stores those logs until our aggregator is back online.
Additional Pointers
- It is advisable to generate logs asynchronously, to reduce the latencies of our process flows.
- A pattern known as observability eases out the process of monitoring distributed environments. Observability includes application log aggregation, tracing, deployment logs, and metrics for getting a whole-some view of our system.
- Tracing identifies the latencies of the involved processes/components/services and helps to identify bottlenecks. This allows us to make better decisions regarding the applications scaling needs.
- Deployment Logs – Since microservice, we have multiple deployable services, we should also compile logs of the deployments that are made.
- Metrics refers to the performance and health of the infrastructure on which our application/services rely on. An example of the same could be the current CPU, memory usage.