Note: Our current website is at www.zeek.org. This website is unmaintained and contains outdated information.

File Analysis Interface and File IDs

Contents

Concept
What is a "File ID" ?
What is a "File Handle" ?
How to Supply File Analysis with File ID/Handle ?

Concept

Bro’s file analysis system has a small interface in terms of supplying it information about a given file, but can be complex to use in terms of how information about one file is differentiated from that of another file.

A string of characters called a "file ID" is an abstraction that is used to be able to associate supplied information with file analysis processes that are ongoing over the lifetime of the file (e.g. as it’s seen being transferred over a network connection). Authors of new code that interface with the file analysis system have some freedom in how file IDs are chosen, so it can be a confusing task. The purpose of the rest of this guide is to clarify the options that are available.

What is a "File ID" ?

In the general sense, it’s a string of characters, such as FkGyAQ385wUTKMP51k, that uniquely identifies a file that Bro is in the process of analyzing. The convention is that it begins with F (to set it apart from connection unique IDs) and that the rest of the string is the base62 encoding (alphanumeric representation) of a truncated (to 96 bits, the value of bits_per_uid at the time of writing) MD5 hash of some static salt string (given by Files::salt) concatenated with long string that can uniquely represent the file (such as a connection 5-tuple, the timestamp of the connection, and the network protocol over which the file is transferred). Described more functionally:

"F" + Base62(Truncate(MD5(salt + unique_string_handle)))

It’s done this way because humans are expected to have to interact with file IDs. For example, they may have to search for a file ID across log files or communicate with other humans about a particular file. For these tasks, using the full, unique string handle associate with the file is unwieldy.

What is a "File Handle" ?

While a "file ID" is a condensed string meant for human consumption, a "file handle" is a long string that is unique to a given file. It is decided by the author of code which feeds file information in to the file analysis system.

As already mentioned, it’s up to the coder to figure out a naming scheme for files that accurately differentiates them. For example, HTTP’s naming scheme may differ from FTP because several files can be transferred over the same connection, so HTTP may include extra information (e.g. depth of the MIME entity) in the file handle while FTP just relies on connection-related information (e.g. the 5-tuple and a timestamp).

All new code that interacts with the file analysis system will have a scheme for picking file handles.

How to Supply File Analysis with File ID/Handle ?

Every time file-related information is supplied to the file analysis system via file_analysis::Manager::DataIn(), file_analysis::Manager::EndOfFile(), file_analysis::Manager::Gap(), or file_analysis::Manager::SetSize(), it needs an associated file ID. It can either be given explicitly in the method call or determined on-demand through script-layer mechanisms. Choosing which way to go depends on the use-case.

File IDs Supplied On-Demand via Script-Layer

Leaving the choice of file handle up to script-layer logic may be desirable in some cases:

If there’s not an absolute "right" way to do it, the script-layer mechanism allows for a user to supply their own logic.
If the logic for determining a file handle requires the tracking of a lot connection or network protocol state, then Bro’s scripting language is often better equipped to do that.

When such an on-demand file ID is needed, the internal order in which things happen is:

The get_file_handle event is generated and the event queue is flushed.
The last get_file_handle event handler at the script-layer to call the set_file_handle determines the file handle for the file-related information currently being fed to the file analysis system.
The file handle is converted in to a file ID via the process described earlier.

As an example, the author of a new network protocol analyzer may implement a script-layer function to derive file handles from connection-related information and then simply register that function as a callback via Files::register_protocol, which will invoke the callback from a get_file_handle handler whenever a file handle is needed for a file being transferred over that protocol.

Explicitly Supplying Pre-Computed File IDs

Taking a detour through the script-layer to get a file handle can be avoided by directly passing a file ID in to the file analysis API calls, but code that goes that route would most likely fall under the following criteria:

There’s only one way of file handle calculation that makes sense.
The file handle logic doesn’t require much connection state tracking, and it’s a large performance burden to call out to the script-layer for file handles.

Going this route just means making all file analysis API calls supply a pre-computed file ID.

Combining Approaches

One can also decide to take a hybrid approach of retrieving file handles on-demand via the script-layer the first time a file is seen and then caching the resulting file ID (return values from the file analysis API calls) for later re-use in the file analysis API.

It can be useful to do things this way if it’s easy to associate incoming data with a particular file ID (e.g. streaming protocols), but the choice of file ID is still somewhat arbitrary (e.g. a user may want to implement their own file handle naming scheme because the default one isn’t "right" by their definition). There’s a performance benefit this way since it eliminates the overhead of many script-layer call outs that end up just arriving at the same file handle over and over again.