With large volumes of network traffic, counting "stuff" becomes an expensive operation that needs lots memory. For example, to say how many different IP addresses we see, we need to keep a set of all those we have already observed, which quickly gets huge. Likewise, to find scanners, we need to count for each source address how many destinations that guy has already talked to. That all gets really expensive in the cluster setting where we need to track all counters in a distributed fashion and then aggregate them on the manager.
However, it turns out that if we can accept a small error in our results (e.g., up to 3% for a set’s cardinality), there are smarter probabilistic data structures than the naiive set-based approaches. That’s what we want to add to Zeek: probabilistic estimation of cardinality and frequency that trade-offs accuracy for memory.
A crucial constraint is that we need to support the cluster’s distributed counting: each worker counts what it sees, and the manager then merges all those sets into a single one.
Generally, the following primitives would be great to have available in Bro
- Set cardinality: how many different instances of something do we see?
- Frequency estimation.
- Determine top-k most frequent elements in a (multi-)set.
- Reports all elements with frequency larger than a threshold t; and/or trigger a callback when an element crosses the threshold.
- Set membership: have we seen element X already?
- Range queries: Elements with a value inside a range [min, max], assuming we have an ordering.
- Quantiles: Given a quantile q in {0,1}, find the value whose rank in the sorted sequence of the n values is qn. E.g., the median corresponds to q=.5. (Seems this can also be used to provide range queries as well.)
Some of these are more important than others, we should probably start with 1, then 2a, then 2b. Let’s skip 4 and 5 for the time being.
The following URLs provide background on corresponding approaches and data structures:
- Introductory blog article: http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
- Another block article: http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
- A Java library implementing much of this: https://github.com/clearspring/stream-lib (Note the paper list there).
- This seems missing in the stream-lib paper list although they use it: http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf
- A related book chapter: http://charuaggarwal.net/synopsis.pdf
- Matthias’ Bloom filter implementation: http://matthias.vallentin.net/blog/2011/06/a-garden-variety-of-bloom-filters/
- Zeek Cluster concept: http://www.icir.org/robin/papers/raid07.pdf
Generally, the string-lib library is pretty much what we need in terms of functionality. However it’s written in Java and hence not of much help. The proposal is to implement similar data structures in C++.
It seems useful to do that outside of the Zeek initially: let’s develop a generic C++ library that provides an interface suitable for Zeek to leverage. That way, we can develop and maintain the functionality independently, and others can leverage it too. We then integrate it into Zeek by writing a set of BiF that providing the functionality to the scriptin-layer.
What follows is a straw-man of a possible C++ API for such a library.
It seems that, generally, all the data structures use hashing as their initial step. That’s probably best split out so that one can use different hash functions tailored to type and data structures. (For Zeek, we may just use it’s existing hash functions, as long no specific properties are required).
typedef uint32_t hash_t; template<typename Element> hash_t hash(const Element& elem);
Below’s a proposed set interface for supporting the primitives listed above. For now, I’m putting everything into a single class. In practice, a single data structure is unlikely to support all the operations, and we should thus split up the class as appropiate, e.g., have one class that supports only cardinality estimation and another that does top-k. The overall interface would be the same for each, except with the non-supported operations removed. Ideally, that will eventually all turn into a class hierarchy, with the common methods encapsulated in a base class, and child classes providing different subsets of functionality. But I don’t have a clear enough picture of similarities and differences right now that I could lay that out already (though we may borrow from stream-lib here).
That said, here’s a suggested draft interface (I’m sure it’s too naiive, but it should give us a starting point):
template<typename Element, typename Hash> class ProbabilisticSet { // Constructor for an initially empty set, with a specified // error margin for its operations and a maximum number of // elements we expecct. ProbabilisticSet(double error_margin, uint64_t max_size); // Inserts an element into the set. void insert(const Element& elem); // Returns the current size estimate. uint64_t size(); // Returns the *k* most frequent elements, with their frequencies. // (Note: I'm sure in practice, we need to pass *k* to the // constructor. Similar note applies to other methods below.) list<pair<Element, uint64_t>> topK(int k); // Returns all elements with frequency larger than a threshold *t*. list<Element> largerThan(uint64_t t); // More general than *largerThan*: Returns all elements *e* // with *min <= freq(e) <= max*. list<Element> range(uint64_t min, uint64_t max); // Executes a callback when an element's frequency exceeds a // threshold *t*. (Specifics of the callback need to be figured // out). void frequencyThreshold(uint64_t t, Callback callback); // Returns true if the element is a member of the set. bool hasElement(const Element& elem); // Removes all elements from the set. void clear(); // Merges another set into this one. void merge(const CardinalitySet& other); // Serializes the current set content into a // platform-independent binary representation that we can send // around over the network, void serialize(ostream& out); // Restores a set's content from the binary representantion // that serialize() returns. void unserialize(istream& in); // Reports the current memory usage of this set in bytes; mainly for statistical purposes. size_t memoryUsage(); };
© 2014 The Bro Project.