Contents
The primary option for binary logging so far seems to be HP’s DataSeries, as documented here:
Source is here:
DataSeries comes with a BSD license (which is what we need).
The general advantages of DataSeries in terms of speed, data volume, and analysis flexibility seem to be clear. However, there are a number of issues/questions that we need to address/answer before going forward:
It seems that DataSeries works on collections of records, called "extends", not on individual records. That means we need to buffer a number of records first before we can pass them on to DataSeries. HP uses temporary files for that, which is not great. We could buffer in memory I guess?
But: Does this introduce CPU spikes at the time when we write the data out? We must not drop any packets just because we happen to be writing some data out …
How about using a separate thread or process that buffers log messages (in memory) and then passes them on the DataSeries?
If we buffer, we need to limit the maximum amount of time that data can stay in memory before being written out. Otherwise, things would get pretty much unpredictable and some things may even never be written out (except at termination).
This buffering is another argument in favor of keeping an ASCII version as well. With the binary format, there will always be a lag until things get written to disk, and there are situations where one doesn’t want that (in particular, when testing/debugging a script). That’s why we already have file-flush.bro.
Gregor: I would prefer to only use a single logging option. I.e., not using ASCII anymore. Can’t we have a file-flush.bro that works for DataSeries?
What are the options for working with a DataSeries log file? How can I write a shell script crunching the data in there? Python? Perl? Is it easy to filter for specific conditions? Do we need to write our own tools for this, or can we use existing ones coming with DataSeries?
How about having a (more or less universal) tool that reads DataSeries log files and outputs ASCII, that can then be processed with Python, Perl, awk, sed, etc. In the long run language bindings for some of these script languages might be nice, though.
One pretty cool thing to have would be consistent cross-log analysis, like "extract all activity from IP 1.2.3.4". How can we do that? Do we just need to make sure that we name fields consistently (e.g., all originator addresses are consistently called "orig" and we can filter for "orig==1.2.3.4" across all logs)?
Assuming we keep an ASCII format as well, can we provide a tool that converts a binary log file to exactly the same format as the ASCII? Likewise, can we convert an ASCII log to an equivalent binary log? If we can do both ways, ASCII and binary logs are completely interchangeable, which would be the ideal case.
Gregor: I don’t think that this will work in the long run. I think bit rot will ultimately take its toll and make the ASCII and binary format diverge. However, having a DataSeries to ASCII converter should do the trick.
How do we define the schemas? I don’t really want to put the burden of defining a schema on the person writing a Zeek script. If that person just defines fields for his log file, can we build the schema automatically? How does DataSeries deal with changes to the schema in this case?
Follow-up question: if we do that, what about additional {opt,pack}_* attributes?
Fields should be "nullable", e.g., potentially absent. For example, for connection sizes in conn.log, we want "number of bytes or ‘?’".
Do the pre-provided data types suffice? DataSeries offers:
bool (0 or 1), byte (0-255), int32 (signed 32bit integer), int64 (signed 64bit integer), double (IEEE 64 bit floating point), and variable32 (up to 231 bytes3 of variable length data, such as string).
What do we do with IP addresses (and in particular IPv6 addresses)? Convert to strings? Or do we need to add a new datatype?
Is variable32 2-to-the-power-of-31 bytes long? Or 231 bytes? Counters are generally uint32 or uint64. Maybe adding these might make sense (or somehow force them into signed integers) IPv4 could be stored as 32bit unsigned, but with regard to IPv6 IPs should probably be stored a strings.
The DataSeries TR says:
Supporting additional data types is a straightforward extension that does not require changing the version of DataSeries. In practice we have found these data types to be sufficient.
What abort ports (which include the protocol, i.e., tcp/udp/icmp). int32? The sounds reasonable. Use 24 bit of the int32 for protocol + port
Can we apply opt_doublebase=base-value to just time columns? Do we want a separate time type? Is pack_relative already all we need?
We need to integrate "remote printing": the cluster workers currently send all their log output as strings to the manager, which then prints them into its local log file. We need to have a similar mechanism for binary output.
One option here would be to just send an event from the worker to the master node, and the manager then just passes the information on into its local logging system. However, that can get quite expensive as suddenly we need to do a lot more serialization/deserialization to send log information (because now we can’t just send a string anymore).
I’m not sure I understand what Lintel does. Part of it seems to be enabling parallel log processing?
There are other libraries out there for storing binary data:
- Google’s Protocol Buffers.
- What else?
Is any of these a better fit? Can one of them address/answer the questions above more easily?
- buffer in memory: yes, that’s what we do for things that natively write DataSeries. Most of the published work talks about that form, the Chirp work in MASCOTS ‘09 does direct DataSeries logging but the only copies are behind a paywall. Work on each extent is in a separate thread. If you do lzf compression it can actually be a cpu win because you push less data into the filesystem.
- Just wanted to clarify that the use of temporary files is not a limitation of Zeek; we used these initially in his work with Apache as we were concerned about messing with/up Apache’s memory management. Buffering in memory is definitely the preferred option (which we tried to approximate in his work using tmpfs).
- buffer limiting: yes, you need to decide when to flush extents. We always did row count based, but our direct DS logging was generating tons of output. We also only analyzed at termination.
- keep ascii?: I’d probably keep ascii not for buffering but for debugging. It’s simple and easy and has fewer failure cases.
- limited lag: you could set a very short lag (e.g. 1s) with the only problem being that you’ll get bigger files (there was a file format mistake that made it inefficient for very small extents)
- working with ds log file: ds2txt converts to text, including csv format. We haven’t done scripting interfaces because all the intended uses were for large enough things that using a scripting language wouldn’t be plausible; the few times we’ve done it for comparison, it’s really just been ds2txt in csv mode and then parse in the scripting language.
- cross-log: so long as the column names and types you access are the same, ds should be able to operate on multiple log files. Having similar extent type names would help.
- ASCII <-binary: After a few early mistakes of not doing this, we always wrote precice reversable converters. We actually found the main problem was with bad ASCII descriptions that required more complex ds conversion to get it reversable, but we tested to make sure that orig-format -ds -orig-format produced bit identical files. ds2txt is a first cut converter, but if you want some special formatting you either have to extend ds2txt printspecs or write your own.
- define schemas: you can build schemas automatically, you probably want them to name them separately. If people want to evolve their schemas sanely they can try the versioning stuff. In practice so longs as the column names/types are the same, DS will operate on extents with different types. It will complain loudly though if you try to access the source_ip column and that column isn’t present in all the extents.
- opt/pack: opt should probably come from the writer, opt_nullable is the main one and is whether values in the column can be null. pack_unique should be on for all variable32 columns (small cost, ususally huge benefit, should have been the default), pack_relative is mainly useful for integer timestamp columns (I don’t know if you can auto-detect those).
- IPv6: there is a new fixed width entry that is now present, I don’t know if it’s in the released version, we got lazy about doing releases. (If it is not available in the public release, please let me know )
- variable32 2^31 bytes. It should have been 2^32-1, but there was a slight mis-coding; I haven’t checked to see if it’s fixable without a file format break.
- Data types: We went with the "use signed integers everywhere" philosophy because some languages, e.g. java don’t support unsigned.
- Data-types for ports: I’d use byte for protocol and int32 for port. Because of the compression compact values tends not to matter too much. Support for stuff like that has made us think about adding int16 from time to time.
- Time columns: I’d use int64 + microseconds | nanoseconds | 2^-32 seconds + unix epoch. Floating point has a bunch of issues with representing microsecond precision since you only get 53 bits. See doc/os-review-2008/*tex "We have now chosen to explicitly represent the units…" If you use floating point for time, you also need pack_scale, otherwise random low bits makes pack_relative not work so well.
- Remote printing: You could pass things around as extents.
- Lintel is a support library, it’s just got utility bits in it that are used by internal HP projects besides DataSeries.
- Other choices: protocol buffers, apache thrift, cisco etch, apache avro. PB/thrift/etch have much more flexible data models (they can represent trees directly), they are much slower, and likely to give you bigger files. Avro is more tabular/ds like, it’s supposed to be faster, but at least a year or so ago when I talked to cloudera and benchmarked it was much (5-20x IIRC) slower. CSV always works; I’m assuming you have large enough data that it matters to make processing faster.
- Better fit: I suspect a good intermediate plan is to log with text/csv, write a bro-csv -> DataSeries converter, play around with things there and see how it works and then make a decision as to whether it’s worth doing direct logging. How much time are you spending writing/analyzing now? If writing in ascii is cheap, I’d consider a long-term plan to write ascii, convert in hourly (or something like that) chunks to ds, and do analysis by running over the pre-converted logs + some instantaneously converted ones. If writing ascii is expensive, then I’d consider the all binary all the time approach.
© 2014 The Bro Project.