Logmind Logo

Clustering tools for log analytics

Marco Calizzi on May 20, 2022

Continuing our blog series on Machine Learning applications for log management, today I want to discuss about clustering techniques. As we learned in our previous article , log parsing is a great tool to sort millions of logs into hundreds of patterns, but this is still a relatively high number of items to work with for a person. Here is where clustering techniques become useful, further reducing the complexity from those hundreds of patterns down to few insights that can be more easily managed by IT staff.

First of all we need to define the problem from a Data Science point of view: we have log patterns which are essentially strings of text with other attributes (like timestamps, hosts, critical level, and other fields) and we want to cluster them in insights. This is an inherently unsupervised problem, because there is not a dataset with correct answers on which a model can train on. Moreover, we don’t even know how many clusters there are!

Depending on the needs there are several ways to tackle this problem. In some cases ML solutions are not even needed: for example, if the users want to organize the patterns by time, or if logs already carry information about which topic they belong to, then it is sufficient to separate them based on simple rules.

Usually though, this is not the case, and ML techniques can be applied to really discover new relations among the data available. Classic ML techniques like K-means , spectral or hierarchical clustering can be used. The rationale behind these algorithms is always that patterns that are close to each other will be grouped together, while distant patterns will not. In order to evaluate that, a distance function between patterns must be defined, and this is the real challenge, as logs and log patterns do not have properties that can naturally be converted into numerical quantities and compared.

Finally, I want to mention deep clustering , i.e. neural networks that learn how to cluster. This type of ML models are scalable and perform well when the dimensionality is high, which is the case for log patterns that have a lot of attributes. The drawbacks are the difficulty to tune the hyper-parameters of the model and the lack of interpretability, forcing users to place significant level of trust on results produced.

At Logmind we developed our own clustering method that is accurate, flexible, easy to setup and fast , the latter being a key requirement for our live monitoring platform.


Share :

Twitter
Linkedin
Linkedin

Related Posts

Log parsing basics

What is a log parser? In this post I will give an introduction to what a log parser does, why it is important, its applications and the different types of parsers available.

Big Data: challenges and new ways to tackle them

With Big Data come new challenges that require fresh automated approaches to tackle them in real-time.

How can Machine Learning help IT teams with log management?

ML can improve the IT workflow in several different areas: let's find the right model for the right task.