Contents
Bro’s file analysis system has a small interface in terms of supplying it information about a given file, but can be complex to use in terms of how information about one file is differentiated from that of another file.
A string of characters called a "file ID" is an abstraction that is used to be able to associate supplied information with file analysis processes that are ongoing over the lifetime of the file (e.g. as it’s seen being transferred over a network connection). Authors of new code that interface with the file analysis system have some freedom in how file IDs are chosen, so it can be a confusing task. The purpose of the rest of this guide is to clarify the options that are available.
In the general sense, it’s a string of characters, such as FkGyAQ385wUTKMP51k, that uniquely identifies a file that Bro is in the process of analyzing. The convention is that it begins with F (to set it apart from connection unique IDs) and that the rest of the string is the base62 encoding (alphanumeric representation) of a truncated (to 96 bits, the value of bits_per_uid at the time of writing) MD5 hash of some static salt string (given by Files::salt) concatenated with long string that can uniquely represent the file (such as a connection 5-tuple, the timestamp of the connection, and the network protocol over which the file is transferred). Described more functionally:
"F" + Base62(Truncate(MD5(salt + unique_string_handle)))
It’s done this way because humans are expected to have to interact with file IDs. For example, they may have to search for a file ID across log files or communicate with other humans about a particular file. For these tasks, using the full, unique string handle associate with the file is unwieldy.
While a "file ID" is a condensed string meant for human consumption, a "file handle" is a long string that is unique to a given file. It is decided by the author of code which feeds file information in to the file analysis system.
As already mentioned, it’s up to the coder to figure out a naming scheme for files that accurately differentiates them. For example, HTTP’s naming scheme may differ from FTP because several files can be transferred over the same connection, so HTTP may include extra information (e.g. depth of the MIME entity) in the file handle while FTP just relies on connection-related information (e.g. the 5-tuple and a timestamp).
All new code that interacts with the file analysis system will have a scheme for picking file handles.
Every time file-related information is supplied to the file analysis system via file_analysis::Manager::DataIn(), file_analysis::Manager::EndOfFile(), file_analysis::Manager::Gap(), or file_analysis::Manager::SetSize(), it needs an associated file ID. It can either be given explicitly in the method call or determined on-demand through script-layer mechanisms. Choosing which way to go depends on the use-case.
Leaving the choice of file handle up to script-layer logic may be desirable in some cases:
When such an on-demand file ID is needed, the internal order in which things happen is:
As an example, the author of a new network protocol analyzer may implement a script-layer function to derive file handles from connection-related information and then simply register that function as a callback via Files::register_protocol, which will invoke the callback from a get_file_handle handler whenever a file handle is needed for a file being transferred over that protocol.
Taking a detour through the script-layer to get a file handle can be avoided by directly passing a file ID in to the file analysis API calls, but code that goes that route would most likely fall under the following criteria:
Going this route just means making all file analysis API calls supply a pre-computed file ID.
One can also decide to take a hybrid approach of retrieving file handles on-demand via the script-layer the first time a file is seen and then caching the resulting file ID (return values from the file analysis API calls) for later re-use in the file analysis API.
It can be useful to do things this way if it’s easy to associate incoming data with a particular file ID (e.g. streaming protocols), but the choice of file ID is still somewhat arbitrary (e.g. a user may want to implement their own file handle naming scheme because the default one isn’t "right" by their definition). There’s a performance benefit this way since it eliminates the overhead of many script-layer call outs that end up just arriving at the same file handle over and over again.
© 2014 The Bro Project.