Dec 18 * myhost.mydomain sshd[*]: log: Connection from * port *
Dec 18 * myhost.mydomain sshd[*]: log: Password authentication for * accepted.
With the help of SLCT, one can quickly build a model of logfile(s), and also identify rare lines that do not fit the model (and are possibly anomalous).
When SLCT is searching for clusters, it processes logfile(s) at the word level. Every cluster description that is reported by SLCT consists of constant words and variable words (e.g., in the above examples the words 'Dec', 'myhost.mydomain', and 'log:' are constant words, while the words '*' and 'sshd[*]' are variable words). A cluster contains lines that match the cluster description, and the number of lines in the cluster is called the support of the cluster. Lines that do not belong to any of the clusters detected by SLCT are called outliers.
The task of SLCT is to identify a partitioning of logfile lines into clusters, so that each cluster has a support equal or greater than the support threshold (the value given with the -s option). In other words, if the support threshold is N, SLCT reports clusters that contain at least N lines. Clusters are reported by just printing their descriptions, without outputting the individual lines that belong to each detected cluster. In order to find the clusters, the following key observation is employed: every constant word of each cluster must be frequent, i.e., the word must occur N or more times in logfile(s).
In order to identify clusters, SLCT first makes a pass over the data, and finds occurrence times of all words. Every distinct word is stored to in-memory vocabulary together with its occurrence counter. The word position is taken into account, e.g., if the word 'debug' is found to be the 1st word of some lines, and the 3rd word of other lines, those two occurrences are considered different. To minimize access times, the vocabulary is implemented as a move-to-front hash table. After the vocabulary has been built, SLCT finds frequent words from it (i.e., words with occurrence times equal or greater than the support threshold).
When frequent words have been found, SLCT makes a second pass over the data, and builds cluster candidates. For every input line that contains frequent word(s), a new candidate is created from these words, e.g., if the line is
Connection from 10.10.10.1 closed
and the words 'Connection', 'from', and 'closed' are frequent, then the candidate is
Connection from * closed
If the candidate is not present in the candidate hash table, it will be inserted there with the support value 1; if the candidate is present, its support value will be incremented. In both cases, the input line is assigned to the cluster candidate. After the second data pass SLCT processes the candidate table, in order to find clusters (candidates with support values equal or greater than the support threshold). If the -j option was given, SLCT first checks for every cluster candidate C whether the candidate table contains any of C's subclusters. A cluster C1 is considered to be a subcluster of C2, if all lines that belong to C1 match the description of C2, e.g., the cluster
Dec 18 * myhost.mydomain *: log: Connection from 10.10.10.1 port *
is a subcluster of the cluster
Dec 18 * myhost.mydomain *: log: Connection from * port *
If the cluster candidate C is found to have subclusters, all the lines belonging to the subclusters are also assigned to C (so that every such line will belong to two cluster candidates simultaneously), and the support values of the subclusters are added to the support value of C. After all candidates have been inspected, SLCT finds the clusters from the set of candidates by comparing the final support values with the support threshold. If the -j option was not given, the step described above is omitted, which means that candidates (and clusters) are not allowed to intersect.
If the -r option was given, SLCT makes another pass over the data, in order to refine cluster descriptions. If the -o option was given together with -r, SLCT also finds outliers (lines that do not belong to any detected cluster) during the pass, and writes them to the file given with the -o option. Cluster description refinement means that variable words of cluster descriptions are inspected more closely for constant heads and tails, e.g., the cluster description
Dec 18 * myhost.mydomain *: log: Connection from * port *
is converted to
Dec 18 * myhost.mydomain sshd[*]: log: Connection from * port *
The algorithm described above could consume a lot of memory when applied to a large data set, because when building the vocabulary, every distinct word together with its occurrence counter has to be kept in memory. Unfortunately, many words present in real life logfile(s) tend to appear only a few times (quite often just once). Therefore, storing those very infrequent words to memory is a waste of space; however, before the data set has been completely processed, it is impossible to tell which words will finally be infrequent.
To cope with this problem, SLCT can be instructed to make an extra pass over the data as its very first step, and build a word summary vector of size M before constructing the actual vocabulary. The size of the vector is given with the -v option (if the -v option is omitted or the vector size is set to zero, no summary vector will be built). The word summary vector is made up of M counters that are initialized to zero. During the pass over the data, every word is hashed to an integer value between 0 and M-1. Each time the hash value I is calculated for a word, the Ith counter in the vector will be incremented. Since efficient string hashing functions are known to distribute their input values uniformly across the output range, each counter in the vector will correspond roughly to W / M words (where W is the total number of different words). The value of the Ith counter equals to the sum of occurrence times of all words which hash to the value I.
After SLCT has constructed the word summary vector, it will start building the vocabulary, but only those words are inserted into the vocabulary for which the counter value in the summary vector is equal or greater than the support threshold. Words that do not fulfil this criteria can not be frequent, since they must occur less times than specified by the support threshold.
Though the employment of the summary vector involves an extra pass over the data set, it is a convenient way to reduce the memory costs, because even a large vector consumes a relatively small amount of memory (e.g., if the vector has 1,000,000 counters, and the size of the counter is 4 bytes on your architecture, the size of the vector is still less than 4MB). If many of the words in the data set are infrequent, many of the counter values will not exceed the support threshold, and the resulting vocabulary will consume much less memory. If the vector size is M, and K counter values exceed the support threshold, then the resulting vocabulary will be about M / K times smaller than the original vocabulary that was built without using the summary vector.
In order to determine the right size for the summary vector, one must take into account the value of the support threshold, and also the estimated total number of different words. If the support threshold is small, the vector must be larger for it to be effective, otherwise a relatively small vector can be used. The more there are different words, the larger the vector should be. The best way to determine the right size is to run SLCT several times with different vector size values, and to check which value gives the best result in terms of total memory cost.
When SLCT is invoked with a small support threshold value, the number of cluster candidates can sometimes be much larger than the number of frequent words and the number of actual clusters. In order to lower the memory costs of storing the candidate hash table, the summary vector method described above can be applied for cluster candidates as well, i.e., before the actual candidates are generated, an extra pass is made over the data, and a summary vector is built. The size of the summary vector is given with the -z option.
SLCT writes its output to standard output, and logs its messages to standard error.