Usage: logcluster.pl [options] Options: --input= ... --support= --rsupport= --separator= --lfilter= --template= --lcfunc= --syslog= --wsize= --csize= --wweight= --weightf= --wfreq= --wfilter= --wsearch= --wreplace= --wcfunc= --outliers= --readdump= --writedump= --readwords= --writewords= --color[=] --wildcard --wfileint --wordsonly --aggrsup --debug --help, -? --version --input= Find clusters from files matching the , where each cluster corresponds to some line pattern, and print patterns to standard output. For example, --input=/var/log/remote/*.log finds clusters from all files with the .log extension in /var/log/remote. This option can be specified multiple times. Note that special file name - means finding line patterns from data from standard input, and reporting detected line patterns when EOF is read from standard input. However, processing data from standard input only makes sense for clustering modes which involve single pass over input data. For example, consider the following stream mining scenario: logcluster.pl --input=- --support=100 --readwords=dictionary.txt --support= Find clusters (line patterns) that match at least lines in input file(s). Each line pattern consists of word constants and variable parts, where individual words occur at least times in input files(s). For example, --support=1000 finds clusters (line patterns) which consist of words that occur at least in 1000 log file lines, with each cluster matching at least 1000 log file lines. --rsupport= This option takes a real number from the range 0..100 for its value, and sets relative support threshold in percentage of total number of input lines. For example, if 20000 lines are read from input file(s), --rsupport=0.1 is equivalent to --support=20. --separator= Regular expression which matches separating characters between words. Default value for is \s+ (i.e., regular expression that matches one or more whitespace characters). --lfilter= When clustering log file lines from file(s) given with --input option(s), process only lines which match the regular expression. For example, --lfilter='sshd\[\d+\]:' finds clusters for log file lines that contain the string sshd[]: (i.e., sshd syslog messages). This option can not be used with --lcfunc option. --template= After the regular expression given with --lfilter option has matched a line, convert the line by substituting match variables in . For example, if --lfilter='(sshd\[\d+\]:.*)' option is given, only sshd syslog messages are considered during clustering, e.g.: Apr 15 12:00:00 myhost sshd[123]: this is a test When the above line matches the regular expression (sshd\[\d+\]:.*), $1 match variable is set to: sshd[123]: this is a test If --template='$1' option is given, the original input line Apr 15 12:00:00 myhost sshd[123]: this is a test is converted to sshd[123]: this is a test (i.e., the timestamp and hostname of the sshd syslog message are ignored). Please note that supports not only numeric match variables (such as $2 or ${12}), but also named match variables with $+{name} syntax (such as $+{ip} or $+{hostname}). This option can not be used without --lfilter option. --lcfunc= Similarly to --lfilter and --template options, this option is used for filtering and converting log file lines. This option takes the definition of an anonymous perl function for its value. The function receives the log file line as its only input parameter, and the value returned by the function replaces the original log file line. In order to indicate that the line should not be processed, 'undef' must be returned from the function. For example, with --lcfunc='sub { if ($_[0] =~ s/192\.168\.\d{1,3}\.\d{1,3}/IP-address/g) { return $_[0]; } return undef; }' only lines that contain the string "192.168.." are considered during clustering, and all strings "192.168.." are replaced with the string "IP-address" in such lines. Longer filtering and conversion functions can be defined in a separate perl module. For example, with --lcfunc='require "/home/user/TestModule.pm"; sub { TestModule::myfilter(@_); }' all function parameters are passed to the function myfilter() from TestModule, and the return value from myfilter() is used during clustering. This option can not be used with --lfilter option. --syslog= Log messages about the progress of clustering to syslog, using the given facility. For example, --syslog=local2 logs to syslog with local2 facility. --wsize= Instead of finding frequent words by keeping each word with an occurrence counter in memory, use a sketch of counters for filtering out infrequent words from the word frequency estimation process. This option requires an additional pass over input files, but can save large amount of memory, since most words in log files are usually infrequent. For example, --wsize=250000 uses a sketch of 250,000 counters for filtering. --csize= Instead of finding clusters by keeping each cluster candidate with an occurrence counter in memory, use a sketch of counters for filtering out cluster candidates which match less than lines in input file(s). This option requires an additional pass over input files, but can save memory if many cluster candidates are generated. For example, --csize=100000 uses a sketch of 100,000 counters for filtering. This option can not be used with --aggrsup option. --wweight= This option enables word weight based heuristic for joining clusters. The option takes a positive real number not greater than 1 for its value. With this option, an additional pass over input files is made, in order to find dependencies between frequent words. For example, if 5% of log file lines that contain the word 'Interface' also contain the word 'eth0', and 15% of the log file lines with the word 'unstable' also contain the word 'eth0', dependencies dep(Interface, eth0) and dep(unstable, eth0) are memorized with values 0.05 and 0.15, respectively. Also, dependency dep(eth0, eth0) is memorized with the value 1.0. Dependency information is used for calculating the weight of words in line patterns of all detected clusters. The function for calculating the weight can be set with --weightf option. For instance, if --weightf=1 and the line pattern of a cluster is 'Interface eth0 unstable', then given the example dependencies above, the weight of the word 'eth0' is calculated in the following way: (dep(Interface, eth0) + dep(eth0, eth0) + dep(unstable, eth0)) / number of words = (0.05 + 1.0 + 0.15) / 3 = 0.4 If the weights of 'Interface' and 'unstable' are 1, and the word weight threshold is set to 0.5 with --wweight option, the weight of 'eth0' remains below threshold. If another cluster is identified where all words appear in the same order, and all words with sufficient weight are identical, two clusters are joined. For example, if clusters 'Interface eth0 unstable' and 'Interface eth1 unstable' are detected where the weights of 'Interface' and 'unstable' are sufficient in both clusters, but the weights of 'eth0' and 'eth1' are smaller than the word weight threshold, the clusters are joined into a new cluster 'Interface (eth0|eth1) unstable'. In order to quickly evaluate different word weight threshold values and word weight functions on the same set of clusters, clusters and word dependency information can be dumped into a file during the first run of the algorithm, in order to reuse these data during subsequent runs (see --readdump and --writedump options). --weightf= This option takes an integer for its value which denotes a word weight function, with the default value being 1. The function is used for finding weights of words in cluster line patterns if --wweight option has been given. If W1,...,Wk are words of the cluster line pattern, value 1 denotes the function that finds the weight of the word Wi in the following way: (dep(W1, Wi) + ... + dep(Wk, Wi)) / k Value 2 denotes the function that will first find unique words U1,...Up from W1,...Wk (p <= k, and if Ui = Uj then i = j). The weight of the word Ui is then calculated as follows: if p>1 then (dep(U1, Ui) + ... + dep(Up, Ui) - dep(Ui, Ui)) / (p - 1) if p=1 then 1 Value 3 denotes a modification of function 1 which calculates the weight of the word Wi as follows: ((dep(W1, Wi) + dep(Wi, W1)) + ... + (dep(Wk, Wi) + dep(Wi, Wk))) / (2 * k) Value 4 denotes a modification of function 2 which calculates the weight of the word Ui as follows: if p>1 then ((dep(U1, Ui) + dep(Ui, U1)) + ... + (dep(Up, Ui) + dep(Ui, Up)) - 2*dep(Ui, Ui)) / (2 * (p - 1)) if p=1 then 1 --wfreq= This option enables frequent word identification in detected line patterns. The option takes a positive real number not greater than 1 for its value. If the total number of line patterns which are reported to the user is N, the word W is regarded frequent if it appears in K detected line patterns and (K / N) >= . Setting to a higher value allows for identifying words that are shared by many detected line patterns (such as hostnames or specific parts of timestamps). In order to highlight frequent words in line patterns, use --color option. --wfilter= --wsearch= --wreplace= These options are used for generating additional words during the clustering process, in order to detect frequent words that match the same template. If the regular expression matches the word, all substrings in the word that match the regular expression are replaced with the string . The result of search- and-replace operation is treated like a regular word, and can be used as a part of a cluster candidate. However, when both the original word and the result of search-and-replace are frequent, original word is given a preference during the clustering process. For example, if the following options are provided --wfilter='[.:]' --wsearch='[0-9]+' --wreplace=N the words 10.1.1.1 and 10.1.1.2:80 are converted into N.N.N.N and N.N.N.N:N Note that --wfilter option requires the presence of --wsearch and --wreplace, while --wsearch and --wreplace are ignored without --wfilter. --wcfunc= Similarly to --wfilter, --wsearch and --wreplace options, this option is used for generating additional words during the clustering process. This option takes the definition of an anonymous perl function for its value. The function receives the word as its only input parameter, and returns a list of 0 or more words (all 'undef' values in the list are ignored). During the clustering process, the original word is used if it is frequent, otherwise the first frequent word from the list replaces the original word. For example, with --wcfunc='sub { if ($_[0] =~ /^Chrome\/(\d+)/) { return ("Chrome/$1", "Chrome"); } }' the word list ("Chrome/49", "Chrome") is generated for the word "Chrome/49.0.2623.87". If words "Chrome/49.0.2623.87" and "Chrome/49" are infrequent but "Chrome" is frequent, the word "Chrome" replaces the word "Chrome/49.0.2623.87" during the clustering process. This option can not be used with --wfilter option. --outliers= If this option is given, an additional pass over input files is made, in order to find outliers. All outlier lines are written to the given file. --readdump= Read clusters and frequent word dependencies from a dump file that has been previously created with the --writedump option. This option is useful for quick evaluation of different word weight thresholds and word weight functions (see --wweight and --weightf options), without the need of repeating the entire clustering process during each evaluation. --writedump= Write clusters and frequent word dependencies to a dump file . This file can be used during later runs of the algorithm, in order to quickly evaluate different word weight thresholds and functions for joining clusters. --readwords= Don't detect frequent words from file(s) given with --input option(s), but read them in from a word file . With --wfileint option, has to be in binary format previously produced with --writewords option, otherwise it is assumed that is a text file where each frequent word is provided in a separate line. --writewords= After frequent words have been detected, write them to . With --wfileint option, will be in binary format, otherwise a text format will be used where each word is provided in a separate line. --color[=[]:[]] If --wweight option has been used for enabling word weight based heuristic for joining clusters, words with insufficient weight are highlighted in detected line patterns with color . If --wfreq option has been used for finding frequent words in line patterns, frequent words are highlighted in detected line patterns with color . If --color option is used without a value, it is equivalent to --color=green:red. --wildcard If --wweight option has been used for enabling word weight based heuristic for joining clusters, words with insufficient weight in detected line patterns are replaced with wildcards when patterns are reported to the user. --wfileint If --readwords or --writewords option has been used for reading or writing the word file, it is assumed that the word file is in binary format. --wordsonly Terminate after frequent words have been detected. If --writewords option is also provided, this option allows for identifying and storing frequent words only, so that no extra time is spent for finding clusters. --aggrsup If this option is given, for each cluster candidate other candidates are identified which represent more specific line patterns. After detecting such candidates, their supports are added to the given candidate. For example, if the given candidate is 'Interface * down' with the support 20, and candidates 'Interface eth0 down' (support 10) and 'Interface eth1 down' (support 5) are detected as more specific, the support of 'Interface * down' will be set to 35 (20+10+5). This option can not be used with --csize option. --debug Increase logging verbosity by generating debug output. --help, -? Print this help. --version Print the version information.