Continuing our blog series on Machine Learning applications for log management, today I want to discuss about clustering techniques. As we learned in our
previous article
, log parsing is a great tool to sort millions of logs into hundreds of patterns, but this is still a relatively high number of items to work with for a person. Here is where clustering techniques become useful, further reducing the complexity from those hundreds of patterns down to few insights that can be more easily managed by IT staff.
First of all we need to define the problem from a Data Science point of view: we have log patterns which are essentially strings of text with other attributes (like timestamps, hosts, critical level, and other fields) and we want to cluster them in insights. This is an inherently unsupervised problem, because there is not a dataset with correct answers on which a model can train on. Moreover, we don’t even know how many clusters there are!
Depending on the needs there are several ways to tackle this problem. In some cases ML solutions are not even needed: for example, if the users want to organize the patterns by time, or if logs already carry information about which topic they belong to, then it is sufficient to separate them based on simple rules.
Usually though, this is not the case, and ML techniques can be applied to really discover new relations among the data available. Classic ML techniques like
K-means
,
spectral
or
hierarchical
clustering can be used. The rationale behind these algorithms is always that patterns that are close to each other will be grouped together, while distant patterns will not. In order to evaluate that, a
distance function
between patterns must be defined, and this is the real challenge, as logs and log patterns do not have properties that can naturally be converted into numerical quantities and compared.
Finally, I want to mention
deep clustering
,
i.e.
neural networks that learn how to cluster. This type of ML models are scalable and perform well when the dimensionality is high, which is the case for log patterns that have a lot of attributes. The drawbacks are the difficulty to tune the hyper-parameters of the model and the lack of interpretability, forcing users to place significant level of trust on results produced.
At Logmind we developed our own clustering method that is accurate, flexible, easy to setup and
fast
, the latter being a key requirement for our live monitoring platform.
Share :
What is a log parser? In this post I will give an introduction to what a log parser does, why it is important, its applications and the different types of parsers available.
With Big Data come new challenges that require fresh automated approaches to tackle them in real-time.
ML can improve the IT workflow in several different areas: let's find the right model for the right task.