slct

Section: User Commands (1)
Updated: October 2003
Index Return to Main Contents

NAME

slct - simple logfile clustering tool

SYNOPSIS

slct: [-b <byte offset>]
[-c <clustertable size>]
[-d <regexp>]
[-f <regexp>]
[-g <slice size>]
[-i <seed>]
[-j]
[-o <outliers file>]
[-r]
[-t <template>]
[-v <wordvector size>]
[-w <wordtable size>]
[-z <clustervector size>]
-s <support> <input files>

DESCRIPTION

SLCT is a tool that was designed to find clusters in logfile(s), so that each cluster corresponds to a certain line pattern that occurs frequently enough. Here are some examples of the clusters that SLCT is able to detect:

Dec 18 * myhost.mydomain sshd[*]: log: Connection from * port *
Dec 18 * myhost.mydomain sshd[*]: log: Password authentication for * accepted.

With the help of SLCT, one can quickly build a model of logfile(s), and also identify rare lines that do not fit the model (and are possibly anomalous).

When SLCT is searching for clusters, it processes logfile(s) at the word level. Every cluster description that is reported by SLCT consists of constant words and variable words (e.g., in the above examples the words 'Dec', 'myhost.mydomain', and 'log:' are constant words, while the words '*' and 'sshd[*]' are variable words). A cluster contains lines that match the cluster description, and the number of lines in the cluster is called the support of the cluster. Lines that do not belong to any of the clusters detected by SLCT are called outliers.

The task of SLCT is to identify a partitioning of logfile lines into clusters, so that each cluster has a support equal or greater than the support threshold (the value given with the -s option). In other words, if the support threshold is N, SLCT reports clusters that contain at least N lines. Clusters are reported by just printing their descriptions, without outputting the individual lines that belong to each detected cluster. In order to find the clusters, the following key observation is employed: every constant word of each cluster must be frequent, i.e., the word must occur N or more times in logfile(s).

In order to identify clusters, SLCT first makes a pass over the data, and finds occurrence times of all words. Every distinct word is stored to in-memory vocabulary together with its occurrence counter. The word position is taken into account, e.g., if the word 'debug' is found to be the 1st word of some lines, and the 3rd word of other lines, those two occurrences are considered different. To minimize access times, the vocabulary is implemented as a move-to-front hash table. After the vocabulary has been built, SLCT finds frequent words from it (i.e., words with occurrence times equal or greater than the support threshold).

When frequent words have been found, SLCT makes a second pass over the data, and builds cluster candidates. For every input line that contains frequent word(s), a new candidate is created from these words, e.g., if the line is

Connection from 10.10.10.1 closed

and the words 'Connection', 'from', and 'closed' are frequent, then the candidate is

Connection from * closed

If the candidate is not present in the candidate hash table, it will be inserted there with the support value 1; if the candidate is present, its support value will be incremented. In both cases, the input line is assigned to the cluster candidate. After the second data pass SLCT processes the candidate table, in order to find clusters (candidates with support values equal or greater than the support threshold). If the -j option was given, SLCT first checks for every cluster candidate C whether the candidate table contains any of C's subclusters. A cluster C1 is considered to be a subcluster of C2, if all lines that belong to C1 match the description of C2, e.g., the cluster

Dec 18 * myhost.mydomain *: log: Connection from 10.10.10.1 port *

is a subcluster of the cluster

Dec 18 * myhost.mydomain *: log: Connection from * port *

If the cluster candidate C is found to have subclusters, all the lines belonging to the subclusters are also assigned to C (so that every such line will belong to two cluster candidates simultaneously), and the support values of the subclusters are added to the support value of C. After all candidates have been inspected, SLCT finds the clusters from the set of candidates by comparing the final support values with the support threshold. If the -j option was not given, the step described above is omitted, which means that candidates (and clusters) are not allowed to intersect.

If the -r option was given, SLCT makes another pass over the data, in order to refine cluster descriptions. If the -o option was given together with -r, SLCT also finds outliers (lines that do not belong to any detected cluster) during the pass, and writes them to the file given with the -o option. Cluster description refinement means that variable words of cluster descriptions are inspected more closely for constant heads and tails, e.g., the cluster description

Dec 18 * myhost.mydomain *: log: Connection from * port *

is converted to

Dec 18 * myhost.mydomain sshd[*]: log: Connection from * port *

The algorithm described above could consume a lot of memory when applied to a large data set, because when building the vocabulary, every distinct word together with its occurrence counter has to be kept in memory. Unfortunately, many words present in real life logfile(s) tend to appear only a few times (quite often just once). Therefore, storing those very infrequent words to memory is a waste of space; however, before the data set has been completely processed, it is impossible to tell which words will finally be infrequent.

To cope with this problem, SLCT can be instructed to make an extra pass over the data as its very first step, and build a word summary vector of size M before constructing the actual vocabulary. The size of the vector is given with the -v option (if the -v option is omitted or the vector size is set to zero, no summary vector will be built). The word summary vector is made up of M counters that are initialized to zero. During the pass over the data, every word is hashed to an integer value between 0 and M-1. Each time the hash value I is calculated for a word, the Ith counter in the vector will be incremented. Since efficient string hashing functions are known to distribute their input values uniformly across the output range, each counter in the vector will correspond roughly to W / M words (where W is the total number of different words). The value of the Ith counter equals to the sum of occurrence times of all words which hash to the value I.

After SLCT has constructed the word summary vector, it will start building the vocabulary, but only those words are inserted into the vocabulary for which the counter value in the summary vector is equal or greater than the support threshold. Words that do not fulfil this criteria can not be frequent, since they must occur less times than specified by the support threshold.

Though the employment of the summary vector involves an extra pass over the data set, it is a convenient way to reduce the memory costs, because even a large vector consumes a relatively small amount of memory (e.g., if the vector has 1,000,000 counters, and the size of the counter is 4 bytes on your architecture, the size of the vector is still less than 4MB). If many of the words in the data set are infrequent, many of the counter values will not exceed the support threshold, and the resulting vocabulary will consume much less memory. If the vector size is M, and K counter values exceed the support threshold, then the resulting vocabulary will be about M / K times smaller than the original vocabulary that was built without using the summary vector.

In order to determine the right size for the summary vector, one must take into account the value of the support threshold, and also the estimated total number of different words. If the support threshold is small, the vector must be larger for it to be effective, otherwise a relatively small vector can be used. The more there are different words, the larger the vector should be. The best way to determine the right size is to run SLCT several times with different vector size values, and to check which value gives the best result in terms of total memory cost.

When SLCT is invoked with a small support threshold value, the number of cluster candidates can sometimes be much larger than the number of frequent words and the number of actual clusters. In order to lower the memory costs of storing the candidate hash table, the summary vector method described above can be applied for cluster candidates as well, i.e., before the actual candidates are generated, an extra pass is made over the data, and a summary vector is built. The size of the summary vector is given with the -z option.

SLCT writes its output to standard output, and logs its messages to standard error.

OPTIONS

-b <byte offset>: when processing the input file(s), ignore the first <byte offset> bytes of every line. This option can be used to filter out the possibly irrelevant information in the beginning of every line (e.g., timestamp and hostname). The default value for the option is zero, i.e., no bytes are ignored.
-c <clustertable size>: the number of slots in the cluster candidate hash table (note that this option does not limit the actual size of the hash table, since multiple elements can be connected to a single slot). The default value for the option is (100 * the number of frequent words).
-d <regexp>: the regular expression describing the word delimiter. The default value for the option is '[ \t]+', i.e., words are separated from each other by one or more space or tabulation characters.
-f <regexp>: when processing the input file(s), ignore all lines that do not match the regular expression. The regular expression can contain ()-subexpressions, and when -t <template> option has also been given, the values of those subexpressions can be retrieved in <template> through $-variables. When -f and -b options are used together, the -b option is applied first.
-g <slice size>: when the -j option has been given, SLCT inspects the table of candidates and compares each candidate with others, in order to find its subclusters. This task has quadratic complexity, and if the candidate table is larger, substantial amount of time could be required to complete the task. In order to reduce the time complexity, SLCT will divide candidate table into slices, with each slice having <slice size> candidates, and all candidates in the same slice having the same number of constant words in their descriptions. A descriptive bit vector is then calculated for every slice that lists all constant words the candidates of a given slice have. If SLCT is inspecting the cluster candidate C for subclusters, and the descriptive vector of the slice S does not contain all the constant words of the candidate C, the candidates from the slice S will be skipped (i.e., they will not be compared with the candidate C). If the -j option has been given, the default value for the -g option is (the number of cluster candidates / 100 + 1).
-i <seed>: the value that is used to initialize the rand(3) based random number generator which is used to generate seed values for string hashing functions inside SLCT. The default value for the option is 1.
-j: when processing the table of cluster candidates, also consider the relations between candidates, and allow the candidates (and clusters) to intersect. This option and the option -z are mutually exclusive, since -j requires the presence of all candidates in order to produce correct results, but with -z not all candidates are inserted into the candidate table.
-o <outliers file>: the file where outliers are written. This option is meaningless without the -r option.
-r: after the clusters have been found from the set of candidates, refine the cluster descriptions.
-t <template>: template that is used to convert input lines, after they have matched the regular expression given with the -f option. Template is a string that will replace the original input line, after the $-variables in the template have been replaced with the values of ()-subexpressions from the regular expression. For example, if the regular expression given with the -f option is 'sshd\[[0-9]+\]: (.+)', and the template is "$1", then the line
sshd[1344]: connect from 192.168.1.1
will be converted to
connect from 192.168.1.1
This option is meaningless without the -f option.
-v <wordvector size>: the size of the word summary vector. The default value for the option is zero, i.e., no summary vector will be generated.
-w <wordtable size>: the number of slots in the vocabulary hash table. The default value for the option is 100,000.
-z <clustervector size>: the size of the cluster candidate summary vector. The default value for the option is zero, i.e., no summary vector will be generated. This option and the option -j are mutually exclusive, since -j requires the presence of all candidates in order to produce correct results, but when the summar vector is employed, not all candidates are inserted into the candidate table.
-s <support>: the support threshold value. The value can be either an integer, or a real number with a trailing %-sign.

AUTHOR

Risto Vaarandi <risto.vaarandi@eyp.ee>

ACKNOWLEDGMENTS

This software uses the fast Shift-Add-Xor string hashing algorithm by M. V. Ramakrishna and Justin Zobel.

This document was created by man2html, using the manual pages.
Time: 12:39:09 GMT, October 09, 2003