Contents
There has been an increasing desire on the part of network security analysts to be able to extract and analyze files transferred over various protocols. Some changes can be made within Zeek that makes it not only possible to do this in a number of protocols, but also very easy from point of view of writing policy scripts.
The current model is that each protocol analyzer in the core will generate its own events if it is able to extract files. The primary example of this is the HTTP analyzer’s http_entity_data event. So far, a number of examples of file analysis have been built around that event including MD5 summing, file extraction, and "magic" byte file type detection. Much of the same effort has to be duplicated for other protocols that may have files transferred over them.
The following sections propose a new model for doing file analysis in a more generic way, with the C++ layer implementing a file analysis engine to do most of the work, but with the script layer allowing feedback to drive the core engine in different ways.
Mostly just needs a script to define how a user will set the file analysis policy which in turn will drive the analysis engine at the C++ layer.
scripts/base/frameworks/file-analysis/main.bro:
module FileAnalysis; @load base/frameworks/logging export { redef enum Log::ID += { ## Logging stream for file analysis. LOG }; ## The default buffer size used to reassemble files. # TODO: what's a reasonable default? const default_reassembly_buffer_size: count = 1024*1024 &redef; ## The default buffer size used for storing the beginning of files. # TODO: what's a reasonable default? const default_bof_buffer_size: count = 256 &redef; ## The default amount of time file analysis will wait for new file data ## before giving up. ## TODO: what's a reasonable default? const default_timeout_interval: interval = 2 mins &redef; ## The default amount of data that a user is allowed to extract ## from a file to an event with the ## :bro:see:`FileAnalysis::ACTION_DATA_EVENT` action. ## TODO: what's a reasonable default? const default_data_event_len: count = 1024*1024 &redef; ## An enumeration of possibly-interesting "events" that can occur over ## the course of analyzing files. The :bro:see:`FileAnalysis::policy` ## table is consulted to determine any actions to take type Trigger: enum { ## Raised when any part of a new file is detected. TRIGGER_NEW, ## Raised when file analysis has likely seen a complete file. That ## is when a number of bytes indicated by the *total_bytes* field of ## :bro:see:`FileAnalysis::Info` have been processed. Note that ## the *undelivered* field does not have to be zero for this to have ## occurred. TRIGGER_DONE, ## Raised when file analysis for a given file is aborted due ## to not seeing any data for it recently. Note that this doesn't ## necessarily mean the full file wasn't seen (e.g. if the ## :bro:see:`FileAnalysis::Info` record indicates the file *total_bytes* ## isn't known). ## It should be possible to extend the timeout time when "handling" ## this trigger. TRIGGER_TIMEOUT, ## Raised when the beginning of a file is detected. TRIGGER_BOF, ## Raised when the beginning of a file is available and that beginning ## is at least the number of bytes indicated by the *bof_buffer_size* ## field of :bro:see:`FileAnalysis::Info`. TRIGGER_BOF_BUFFER_AVAIL, ## Raised when the mime type of a file is matched based on magic ## numbers. TODO: re-purposing protocols/http/file-ident.sig for ## doing this is tricky since the signature engine doesn't expect ## to be decoupled from connections, so figure out what work needs ## done there. TRIGGER_MIME_TYPE, ## Raised when the end of a file is detected. If the file is not ## being transferred linearly, then this doesn't have to mean the full ## file has been transferred. TRIGGER_EOF, ## The reassembly buffer for the file filled and had to be discarded. ## The *undelivered* field of :bro:see:`FileAnalysis::Info` will ## indicate the number of bytes, if any, that were not all-in-sequence. ## TODO: Is it possible to extend the reassembly buffer when "handling" ## this trigger? TRIGGER_REASSEMBLY_BUFFER_FULL, } &redef; ## An enumeration of different types of actions that can be performed ## while analyzing a given file. These actions acts as flags to ## the file analysis engine that it needs to do something, but the ## fields of :bro:see:`FileAnalysis::Info` are used by it to determine ## more about how and what exactly needs to be done for a given action. type Action: enum { ## All current and future file data is forwarded to the Zeek instance ## named by the *delegate_node* field of :bro:see:`FileAnalysis::Info`. ## This effectively delegates file analysis to a single node which ## is useful for protocols that distribute file data over many ## connections (e.g. BitTorrent). TODO: better describe what node ## name refers to in cluster layout config. ACTION_DELEGATE, ## If the file is a compressed file that is supported for decompression ## this will enable the decompression and analysis of files contained ## within. There will need to be limits of the depth of analysis and ## support to deal with files like this: http://www.unforgettable.dk/ ## Maybe this should just be supported by file analyzers? ACTION_DECOMPRESS, ## Attaching file analyzers to this file. Similar to the existing ## Analyzer class in Zeek, but only intended for unidirectional byte ## streams. ACTION_ANALYZE, ## Begin calculating an MD5 digest of the file. It will be available ## in the *md5* field of :bro:see:`FileAnalysis::Info` at the time ## :bro:see:`FileAnalysis::TRIGGER_DONE` is raised. ## This digest will only be available if, in ## :bro:see:`FileAnalysis::Info`, the *seen_bytes* field is equal to ## *total_bytes* (or it's unknown) and *undelivered* is zero. ## But if :bro:see:`FileAnalysis::ACTION_EXTRACT` was added at a time ## when *undelivered* is zero, the digest can still be calculated ## from the file on disk even if *undelivered* becomes nonzero later, ## provided all the bytes were seen. ACTION_HASH_MD5, ## Begin calculating a SHA1 digest of the file. It will be available ## in the *sha1* field of :bro:see:`FileAnalysis::Info` at the time ## :bro:see:`FileAnalysis::TRIGGER_DONE` is raised. ## This digest will only be available if, in ## :bro:see:`FileAnalysis::Info`, the *seen_bytes* field is equal to ## *total_bytes* (or it's unknown) and *undelivered* is zero. ## But if :bro:see:`FileAnalysis::ACTION_EXTRACT` was added at a time ## when *undelivered* is zero, the digest can still be calculated ## from the file on disk even if *undelivered* becomes nonzero later, ## provided all the bytes were seen. ACTION_HASH_SHA1, ## Begin calculating an SHA256 digest of the file. It will be available ## in the *sha256* field of :bro:see:`FileAnalysis::Info` at the time ## :bro:see:`FileAnalysis::TRIGGER_DONE` is raised. ## This digest will only be available if, in ## :bro:see:`FileAnalysis::Info`, the *seen_bytes* field is equal to ## *total_bytes* (or it's unknown) and *undelivered* is zero. ## But if :bro:see:`FileAnalysis::ACTION_EXTRACT` was added at a time ## when *undelivered* is zero, the digest can still be calculated ## from the file on disk even if *undelivered* becomes nonzero later, ## provided all the bytes were seen. ACTION_HASH_SHA256, ## The file has been extracted and written to the local filesystem ## in a file named according to the value of the *extracted_file_name* ## field of :bro:see:`FileAnalysis::Info`. ACTION_EXTRACT, ## After either :bro:see:`FileAnalysis::TRIGGER_DONE` or ## :bro:see:`FileAnalysis::TRIGGER_TIMEOUT` have been processed, ## the file named by the value of the *extracted_file_name* field of ## :bro:see:`FileAnalysis::Info` will be deleted from the local ## filesystem. ACTION_EXTRACT_CLEANUP, ## Some portion of the data of the file can be provided to a script ## at the event layer. This action can be influenced with the ## *data_event* and *data_event_len* fields of the file's ## :bro:see:`FileAnalysis::Info` record. ACTION_DATA_EVENT, # TODO: other action types to support right now? } &redef; ## This table can be used to set up dependencies between ## :bro:see:`FileAnalysis::Action` values. Whenever an action is added ## to the *actions* field of :bro:see:`FileAnalysis::Info`, this any ## dependencies of the action specified in this table are also added. const action_dependencies: table[Action] of set[Action] = {} &redef; ## Contains all metadata related to the analysis of a given file, some ## of which is logged. type Info: record { ## Unique identifier associated with a single file. file_id: string &log; ## Unique identifier associated with the file if it was extracted ## from a container file as part of the analysis. parent_file_id: string &log &optional; ## The network protocol over which the file was transferred. protocol: string &log &optional; ## The set of connections over which the file was transferred, ## indicated by UID strings. conn_uids: set[string] &log; ## The set of connections over which the file was transferred, ## indicated by 5-tuples. conn_ids: set[conn_id]; ## Number of bytes provided to the file analysis engine for the file. seen_bytes: count &log &optional; ## Total number of bytes that are supposed to comprise the file content. total_bytes: count &log &optional; ## The MIME type of the file if any could be matched. mime_type: string &log &optional; ## MD5 digest of the file. md5: string &log &optional; ## SHA1 digest of the file. sha1: string &log &optional; ## SHA256 digest of the file. sha256: string &log &optional; ## Current set of actions that are being processed by the file analysis ## engine for the file as new data arrives for it. actions: set[Action]; ## Set of all actions that were ever requested over the course of ## analyzing the file. actions_requested: set[Action] &log; ## The Zeek node to which file analysis is delegated. delegate_node: string &log &optional; ## The name of the file as has been extracted to the local filesystem. extracted_file_name: string &log &optional; ## Maximum amount of bytes to dedicate to reassembling a file as ## it is transferred. A value of zero indicates no reassembly occurs. reassembly_buffer_size: count &default=default_reassembly_buffer_size; ## The number of bytes at the beginning of a file to save for later ## inspection in *bof_buffer*. bof_buffer_size: count &default=default_bof_buffer_size; ## The content of the beginning of a file up to *bof_buffer_size* bytes. bof_buffer: string &default=""; ## The number of not all-in-sequence bytes over the course of the ## analysis that had to be discarded due to a reassembly buffer size ## of *reassembly_buffer_size* being filled. undelivered: count &default=0; ## The amount of time between receiving new data for this file that ## the analysis engine will wait before giving up on it. timeout_interval: interval &default=default_timeout_interval; ## Set in conjunction with ## :bro:see:`FileAnalysis::ACTION_DATA_EVENT` to say what event ## should be raised when new file data is available. ## ## f: file metadata produced through the analysis process. ## ## data: the next available linear chunk of the file, which ## may be smaller than the full requested amount in the ## *data_event_len* field. data_event: event(f: Info, data: string) &optional; ## The amount of file data, in bytes, to make available to scripts ## handling the event specified by the *data_event* field. data_event_len: count &default=default_data_event_len; } &redef; ## A record to define the items that make up the policy that the file ## analysis engine uses to decide how to process a file. type PolicyItem: record { ## The "event" at which the *callback* function should be evauluated ## to determine if any actions need to be done for the file. trigger: Trigger; ## Evaluated each time the file analysis engine recieves file data ## that causes *trigger* to occur. The function may modify the ## :bro:see:`FileAnalysis::Info` record, including removing things from ## *actions*, but the set of actions that are returned from it are ## automatically merged in to the *actions* field by the file ## analysis engine. callback: function(rec: Info): set[Action]; }; ## Defines the file analysis policy that is extensible on a per-site basis. ## The file analysis engine consults this value each time it receives ## new data that causes a :bro:see:`FileAnalysis::Trigger` and the ## feedback from is used by the engine to determine how to proceed with ## the analysis. const policy: set[PolicyItem] = {} &redef; ## Defines a callback to be evaluated whenever the core file analysis ## engine is about to take an action. ## ## rec: The file metadata produced so far as a result of the analysis. ## ## Returns: T if the file analysis engine should proceed with processing ## the action, or F if it should not take the action. In either ## case, the action is not automatically removed from the ## *actions* field of *rec*, but can be done explicitly in ## function body. This can be used to abort processing ## of on-going/persistent actions, or to define an action ## as a one-time thing (e.g. the policy specifies an action ## to take place upon a trigger, but the callback for the action ## returns T and removes the action from the *actions* set). type ActionCallback: function(rec: Info): bool; ## A table of callback functions per action type that can be used ## to implement script-only action plugins or to otherwise change ## behavior of existing actions being performed on a given file's ## data stream. const action_callbacks: table[Action] of ActionCallback = {} &redef; }
An example to show how redefining the file analysis policy looks:
redef FileAnalysis::policy += { [$trigger = FileAnalysis::TRIGGER_MIME_TYPE, $callback(rec: FileAnalysis::Info) = { if ( rec$mime_type == "application/pdf" ) return set(FileAnalysis::ACTION_EXTRACT); else return set(); }], };
Questions:
Also a part of this work would be to deprecate the protocol specific events that deliver file data streams to scripts (e.g. http_entity_data).
Some pseudo code:
// wrapper class around FileAnalysis::Info record values from script layer class FileAnalysisInfo { public: // TODO: maybe some common getter/setter methods protected: RecordVal* val; }; class FileAnalysisInfoTimer : public Timer { // TODO: some stuff here to implement inactivity timeouts and trigger // FileAnalysisManager instance to cleanup and evaluate timeout policy }; // Singleton class for managing file analysis class FileAnalysisManager { public: // receiving file data from a protocol analyzer. *file_id* is a handle // the protocol analyzer is using to uniquely identify a single file. // If *file_id* isn't in *file_map*, then create a new FileAnalysisInfo // and put it in the map. Some of the "default" processing happens // here to determine if the new data causes a trigger. If a trigger // occurs, then the policy is evaluated and actions added to the // FileAnalysisInfo's *actions* set (taking into account // *action_dependencies*). Then it iterates of the *actions* set, // evaluates the script-layer callback, and if necessary performs // the C++ implementation of the action. void DataIn(string file_id, ConnID conn_id, string conn_uid, AnalyzerTag at, u_char* data, uint64 offset, uint64 len); // like the DataIn() above, but it's implied that the data is being // send in linearly since offset is not a parameter. For these types // of file analysis, I think reassembly buffer can just be disabled, // or maybe other optimizations done. void DataIn(string file_id, ConnID conn_id conn_id, string conn_uid, AnalyzerTag at, u_char* data, uint64 len); // The protocol analyzer calls this to provide the value for // a FileAnalysisInfo's *total_bytes* field, may cause creation of // a new record (and thus may cause some triggers to happen, so // this will do much of the same processing as DataIn() as far as // policy and action callback evaluation). void SetSize(string file_id, ConnID conn_id, string conn_uid, AnalyzerTag at, uint64 size); // TODO: the above methods can also have versions that don't take the // conn_id, conn_uid, or AnalyzerTag arguments. Those would // be used by the input framework if we wanted to support // analysis of files that are being read in from disk (as opposed // to being extracted from network traffic that Zeek is analyzing). protected: // maps a file_id string to a FileAnalysis::Info record value instance map<string, FileAnalysisInfo*> file_map; };
In addition, each protocol analyzer that supports file analysis needs to be modified to interface with the FileAnalysisManager by sending it pieces of a file as it sees them. Each is also responsible for being able to generate a string that is unique to the file(s) being sent to the FileAnalysisManager.
In order to support file analysis in a cluster environment where certain files can be carried over many connections (e.g. BitTorrent), Bro instances can choose to delegate file analysis to a single node. This would also allow file extraction to occur on a single node instead of scattered among the filesystems of various worker nodes.
The delegation is indicated by the file analysis policy simply setting the delegate_node field for stuff that has to be analyzed centrally, then the analysis engine would automaticaly send data corresponding to that file_id to the remote Zeek instance. Alteration to the remote communication protocol probably just needs support for data similar to what’s sent to the FileAnalysisManager::DataIn() and FileAnalysisManager::SetSize() functions.
Using libmagic for file type detection hasn’t been the most robust solution, primarily because it can produce different results depending on platform/version.
One idea would be to maintain a signature file of "magic number" similar to what is currently in scripts/base/protocols/http/file-ident.sig, but the signatures would be protocol-independent.
The challenge looks like the signature engine currently doesn’t support being decoupled from Connections very well and that it also doesn’t have a way to provide immediate feedback of a match, but rather through events. This needs to be investigated more, but at a glance I didn’t see a decent way for FileAnalysisInfo to interface with the signature engine. Suggestions?
The above leaves out recursing the analysis on files within container file types. Is there anything about the above design that’s a roadblock to that or can this functionality be implemented later without significant design changes?
We could have an analysis tree separate from the network protocol analysis tree, but this new tree would analyze file data streams and instantiate/attach specific file type analyzers as they’re matched (Seth, do you have more detail here?).
© 2014 The Bro Project.