If LogHound is started without the -g option, it assumes that input file(s) are raw event logs, and it mines frequent line patterns from the file(s). In that case each line from input file(s) is considered a transaction with word-position pairs as items. E.g., LogHound views the line
Password authentication for john accepted
as a transaction {(Password, 1), (authentication, 2), (for, 3), (john, 4), (accepted, 5)}. Note that each frequent itemset corresponds to a certain line pattern, e.g., the itemset {(Password, 1), (authentication, 2), (for, 3), (accepted, 5)} represents the line pattern
Password authentication for * accepted
If LogHound is started with the -g option, it assumes that each line from input file(s) represents a set of events (e.g., events from a user ftp session), and it mines frequent event type patterns from the file(s). In that case each line from input file(s) is considered a transaction with events as items (the order of events is not important). E.g., LogHound views the line
FanFailure LinkDown LinkUp
as a transaction {FanFailure, LinkDown, LinkUp}.
In order to mine frequent itemsets (patterns), LogHound uses a breadth-first search strategy and keeps detected frequent itemsets in a trie data structure. Although the popular Apriori algorithm also works in a breadth-first manner and its implementations often employ the itemset trie, the algorithm that is built into LogHound is in several respects different from Apriori. The differences will be described below.
As its first step, LogHound will detect frequent items. In order to do that, it makes a pass over the input data set and counts how many times each item occurs. The item occurrence counters are kept in a move-to-front hash table data structure (the number of slots in the table can be specified with the -w option). Unfortunately, since the number of items can be very large in event log data sets, this step could consume a lot of memory. If many items are infrequent, storing counters for them in memory is a waste of space; however, before the data set has been completely processed, it is impossible to tell which items will finally be infrequent.
To cope with this problem, LogHound can be instructed to make an extra pass over the input data set as its very first step, and build an item summary vector of size M before the item counting (the value of M is given with the -v option). The item summary vector is made up of M counters that are initialized to zero. During the pass over the data set, every item is hashed to an integer value from 0 to M-1. Each time the hash value I is calculated for an item, the Ith counter in the vector will be incremented. Since efficient hashing functions are known to distribute their input values uniformly across the output range, each counter in the vector will correspond roughly to N / M items (where N is the total number of items). The value of the Ith counter equals to the sum of occurrence times of all items which hash to the value I. After LogHound has constructed the item summary vector, it will find the occurrence times of only those items for which the counter values in the summary vector are equal or greater than the support threshold. If many of the items in the data set are infrequent, many of the counter values will not exceed the support threshold, and a significant amount of memory will be saved during the item counting.
After frequent items have been detected, LogHound will ignore infrequent items in transactions during further mining (since if an item is infrequent, it can't be a part of any frequent itemset). In order to speed up the mining process, LogHound will load a part of the data set into memory and store it in the cache trie. For each transaction, LogHound considers the set of all frequent items from that transaction. Cache trie is guaranteed to contain every such set which occurs at least for N transactions in the data set (the value of N is given with the -n option). Sets which are not stored to the cache trie are written to the out-of-cache file (the location of this file is given with the -o option). To detect, which sets should be loaded into memory, LogHound uses the summary vector technique described above, and loads the set into memory only if corresponding counter in the transaction summary vector is not below N (note that if we simply count the occurrence times of sets, all sets would end up being in memory together with their occurrence counters). In that way, more frequently used transaction data are guaranteed to be in main memory, while less frequently used data remain on disk.
During the mining process, LogHound detects frequent itemsets in a level-wise manner, building the itemset trie layer by layer. If the -l option is given, LogHound does not look for frequent itemsets that contain more items than specified by the option. Unlike Apriori, LogHound uses a heuristic for reducing the number of nodes in the trie, in order to reduce its memory consumption and runtime. When the mining is complete, LogHound derives all frequent itemsets from the nodes of the itemset trie and outputs them. Since a node in the trie can represent several itemsets that all have the same support, LogHound outputs a concise description for these itemsets, with variable items in parentheses. E.g., the description
(Nov) * * (myserver) * log:
Support: 1249
represents four frequent itemsets with the support 1249:
Nov * * myserver * log:
Nov * * * * log:
* * * myserver * log:
* * * * * log:
If the -c option is given, LogHound will output closed frequent itemsets only (closed itemset is an itemset that has no superset with the same support).
LogHound writes its output to standard output, and logs its messages to standard error.