In this article, we will examine some best practices to follow while logging microservices and the architecture to handle distributed logging in the microservices world.
Microservices architecture has become one of the most popular choices for large scale applications in the world of Software Design, Development, and Architecture, basically due to the benefits over its traditional counterpart, the monolithic architecture. These benefits arise due to a shift from one single large, tightly coupled unit (monolith) to multiple small loosely coupled services, wherein each service has limited and specific functionality to deliver. So, with smaller codebases, we get to leverage the power of distributed teams due to decreased dependencies and coupling, in turn reducing the time to market (or production) of any application. Other advantages include language-agnostic stack and selective scaling.
Logging in microservices comes with other advantages that can be shipped with the architecture, it also comes with its own set of complexities – the reason being a single request in this architecture could span across multiple services, and it might even travel back and forth. To trace the end-to-end flow and identify the source of originating errors of a request through our systems, we need a logging and monitoring system in place. We can adopt one of the two solutions, a centralized logging service and an individual logging service for each service.
Individual logging solutions for each service can become a pain point when the number of services starts growing. Because for every process-flow that you intend to look the logs for, you might need to go through each of the service logs involved in serving this process request, hence, making issue identification and resolution a tough job. However, on the other hand, in a centralized logging service, you have a single go-to place for the same, which, backed by enough information around logs and a thought-off design, can do wonders at achieving the same.
At Skeps, we use centralized logging solutions for our applications running on microservice architecture.
A single centralized logging service that aggregates logs from all the services should be a more preferred solution in a microservices architecture. In the software world, unique/unseen problems are not seldom, and we certainly do not want to be juggling through multiple log files or developed dashboards to get insights about what caused the same. While designing a standard centralized logging scheme, one could or in fact should refer to the following norms:
A correlation id is a unique id that can be assigned to an incoming request, which can help us to identify this request uniquely in each service.
Defining the log structure is the most crucial part of logging effectively. In the first place, we need to identify why are we enabling logging? A few points could be:
While answering these questions, we get to derive a format, which can include, but is not limited to the following things:
When something critical breaks, we do not want it to get stored in our databases without us getting to know about it in real-time. So, it is pivotal to set up notifications on events that indicate a possible problem in the system, and its categorization can be done by keeping a reserved log level.
Post storing all the information, it is important to make sense out of the stored information. Rich APIs can be developed to filter based on correlation id, log level, timestamp or any other parameter that can help to identify and resolve issues at a much faster pace.
Decide upon an appropriate timeline to clear the clutter and do cut down on your storage usage. This also depends on your application’s need and the reliability of your shipped code that has been in the past. This is when you are not storing any sensitive information. Otherwise, you need to follow the compliances in place. However, there are workarounds for that. You can keep an additional field to store such information, and only that can be cleared from each of the stored logs as per the compliance timeline.
There can be instances when the logging service (log aggregator) goes down, and it becomes a single point of failure for our analysis and debugging needs. We do not want to miss out on any logs that were generated during that outage. So, we need to develop an alternate mechanism that stores those logs until our aggregator is back online.