Skip to content

zeek/binpac

Repository files navigation

BinPAC

BinPAC is a high level language for describing protocol parsers and generates C++ code. It is currently maintained and distributed with the Zeek Network Security Monitor distribution, however, the generated parsers may be used with other programs besides Zeek.

You can find the latest BinPAC release for download at https://www.zeek.org/download.

BinPAC's git repository is located at https://github.com/zeek/binpac

This document describes BinPAC 0.61.0-18. See the CHANGES file for version history.

BinPAC relies on the following libraries and tools, which need to be installed before you begin:

  • Flex (Fast Lexical Analyzer)
    Flex is already installed on most systems, so with luck you can skip having to install it yourself.
  • Bison (GNU Parser Generator)
    Bison is also already installed on many system.
  • CMake 2.8.12 or greater
    CMake is a cross-platform, open-source build system, typically not installed by default. See http://www.cmake.org for more information regarding CMake and the installation steps below for how to use it to build this distribution. CMake generates native Makefiles that depend on GNU Make by default

To build and install into /usr/local:

./configure
cd build
make
make install

This will perform an out-of-source build into the build directory using the default build options and then install the binpac binary into /usr/local/bin.

You can specify a different installation directory with:

./configure --prefix=<dir>

Run ./configure --help for more options.

To make this document easier to read, the following are the glossary and convention used.

  • PAC grammar - .pac file written by user.
  • PAC source - _pac.cc file generated by binpac
  • PAC header - _pac.h file generated by binpac
  • Analyzer - Protocol decoder generated by compiling PAC grammar
  • Field - a member of a record
  • Primary field - member of a record as direct result of parsing
  • Derivative field - member of a record evaluated through post processing

BinPAC language consists of:

  • analyzer
  • type - data structure like definition describing parsing unit. Types can built on each other to form more complex type similar to yacc productions.
  • flow - "flow" defines how data will be fed into the analyzer and the top level parsing unit.
  • Keywords
  • Built-in macros

There are two components to an analyzer definition: the top level context and the connection definition.

Each analyzer requires a top level context defined by the following syntax:

analyzer <ContextName> withcontext {
... context members ...
}

Typically top level context contains pointer to top level analyzer and connection definition like below:

analyzer HTTP withcontext {
   connection : HTTP_analyzer;
   flow     : HTTP_flow;
};

A "connection" defines the entry point into the analyzer. It consists of two "flow" definitions, an "upflow" and a "downflow".

connection <AnalyzerName>(optional parameter) {
 upflow = <UpflowConstructor>;
 downflow = <DownflowConstructor>;
}

Example:

connection HTTP_analyzer {
   upflow = HTTP_flow (true);
   downflow = HTTP_flow (false);
};

A "type" is the basic building block of binpac-generated parser, and describes the structure of a byte segment. Each non-primitive "type" generates a C++ class that can independently parse the structure which it describes.

Syntax:

type <typeName>{(<optional type parameter(s)>)} = <compositor or primitive class>{
  cases or members declaration.
} <optional attribute(s)>;

Example:

PAC grammar:

type myType = record {
   data:uint8;
};

PAC header:

class myType{
public:
   myType();
   ~myType();
   int Parse(const_byteptr const t_begin_of_data, const_byteptr const t_end_of_data);
   uint8 data() const  { return data_; }
protected:
   uint8 data_;
};

Primitive type can be treated as #define in C language. They are embedded into other type which reference them but do not generate any parsing code of their own. Available primitive types are:

  • int8
  • int16
  • int32
  • uint8
  • uint16
  • uint32
  • Regular expression ( type HTTP_URI = RE/[[:alnum:][:punct:]]+/; )
  • bytestring

Examples:

type foo = record { x: number; };

is equivalent to:

type foo = record { x: uint8[3]; };

(Note: this behavior may change in future versions of binpac.)

A "record" composes primitive type(s) and other record(s) to create new "type". This new "type" in turn can be used as part of parent type or directly for parsing.

Example:

type SMB_body = record {
   word_count  : uint8;
   parameter_words : uint16[word_count];
   byte_count  : uint16;
}

The "case" compositor allows switching between different parsing methods.

type SMB_string(unicode: bool, offset: int) = case unicode of {
   true  -> u: SMB_unicode_string(offset);
   false -> a: SMB_ascii_string;
};

A "case" supports an optional "default" label to denote none of the above labels are matched. If no fields follow a given label, a user can specify an arbitrary field name with the "empty" type. See the following example.

type HTTP_Message(expect_body: ExpectBody) = record {
       headers:     HTTP_Headers;
       body_or_not: case expect_body of {
               BODY_NOT_EXPECTED -> none: empty;
               default           -> body: HTTP_Body(expect_body);
       };
};

Note that only one field is allowed after a given label. If multiple fields are to be specified, they should be packed in another "record" type first. The other usages of case are described later.

A type can be defined as a sequence of "single-type elements". By default, array type continue parsing for the array element in an infinite loop. Or an array size can be specified to control the number of match. &until can be also conditionally end parsing:

# This will match for 10 element only
type HTTP_Headers = HTTP_Header [10];

# This will match until the condition is met
type HTTP_Headers = HTTP_Header [] &until(/*Some condition*/);

Array can also be used directly inside of "record". For example:

type DNS_message = record {
 header:      DNS_header;
 question:    DNS_question(this)[header.qdcount];
 answer:      DNS_rr(this, DNS_ANSWER)[header.ancount];
 authority:   DNS_rr(this, DNS_AUTHORITY)[header.nscount];
 additional:  DNS_rr(this, DNS_ADDITIONAL)[header.arcount];
}&byteorder = bigendian, &exportsourcedata

A "flow" defines how data is fed into the analyzer. It also maintains custom state information declared by %member. flow is configured by specifying type of data unit.

Syntax:

flow <Flow name>(<optional attribute>) {
  <flowunit|datagram> = <top level data unit> withcontext (<context constructor parameter>);
};

When "flow" is added to top level context analyzer, it enables use of &oneline and &length in "record" type. flow buffers data when there is not enough to evaluate the record and dispatches data for evaluation when the threshold is reached.

When flowunit is used, the analyzer uses flow buffer to handle incremental input and provide support for &oneline/&length. For further detail on this, see Buffering.

flowunit = HTTP_PDU(is_orig) withcontext (analyzer, this);

Opposite to flowunit, by declaring data unit as datagram, flow buffer is opted out. This results in faster parsing but no incremental input or buffering support.

datagram = HTTP_PDU(is_orig) withcontext (analyzer, this);
type RPC_Opaque = record {
   length: uint32;
   data:   uint8[length];
   pad:    padding align 4;    # pad to 4-byte boundary
};

User can define functions in binpac. Function can be declared using one of the three ways:

PAC style function prototype and embed the body using %{ %}:

function print_stuff(value :const_bytestring):bool
%{
   printf("Value [%s]\n", std_str(value).c_str());
%}

Pac style function with a case body, this type of declaration is useful for extending later by casefunc:

function RPC_Service(prog: uint32, vers: uint32): EnumRPCService =
   case prog of {
       default -> RPC_SERVICE_UNKNOWN;
   };

Function can be completely inlined by using %code:

%code{
EnumRPCService RPC_Service(const RPC_Call* call)
   {
   return call ? call->service() : RPC_SERVICE_UNKNOWN;
   }
%}

PAC code can be extended by using "refine". This is useful for code reusing and splitting functionality for parallel development.

Record can be extended to add additional attribute(s) by using "refine typeattr". One of the typical use is to add &let for split protocol parsing from protocol analysis.

refine typeattr HTTP_RequestLine += &let {
   process_request: bool =
       process_func(method, uri, version);
};
refine casetype RPC_Params += {
   RPC_SERVICE_PORTMAP -> portmap: PortmapParams(call);
};

Function which is declared as a PAC case can be extended by adding additional case into the switch.

refine casefunc RPC_BuildCallVal += {
   RPC_SERVICE_PORTMAP ->
       PortmapBuildCallVal(call, call.params.portmap);
};

Connection can be extended to add functions and members. Example:

refine connection RPC_Conn += {
   function ProcessPortmapReply(results: PortmapResults): bool
       %{
       %}
};

State is maintained by extending parsing class by declaring derivative. State lasts until the top level parsing unit (flowunit/datagram is destroyed).

C++ code can be embedded within the .pac file using the following directives. These code will be copied into the final generated code.

  • %header{...%}

    Code to be inserted in binpac generated header file.

  • %code{...%}

    Code to be inserted at the beginning of binpac generated C++ file.

  • %member{...%}

    Add additional member(s) to connection (?) and flow class.

  • %init{...%}

    Code to be inserted in flow constructor.

  • %cleanup{...%}

    Code to be inserted in flow destructor.

  • ${
  • $set{
  • $type{
  • $typeof{
  • $const_def{

"&until" is used in conjunction with array declaration. It specifies exit condition for array parsing.

type HTTP_Headers = HTTP_Header[] &until($input.length() == 0);

Process data dependencies before evaluating field.

Example: typically, derivative field is evaluated after primary field. However "&requires" is used to force evaluate of length before msg_body.

type RPC_Message = record {
   xid:        uint32;
   msg_type:   uint32;
   msg_body:   case msg_type of {
       RPC_CALL    -> call:    RPC_Call(this);
       RPC_REPLY   -> reply:   RPC_Reply(this);
   } &requires(length);
} &let {
   length = sourcedata.length();   # length of the RPC_Message
} &byteorder = bigendian, &exportsourcedata, &refcount;

Evaluate field only if condition is met.

type DNS_label(msg: DNS_message) = record {
   length:     uint8;
   data:       case label_type of {
       0 ->    label:  bytestring &length = length;
       3 ->    ptr_lo: uint8;
   };
} &let {
   label_type: uint8   = length >> 6;
   last: bool      = (length == 0) || (label_type == 3);
   ptr: DNS_name(msg)
       withinput $context.flow.get_pointer(msg.sourcedata,
           ((length & 0x3f) << 8) | ptr_lo)
       &if(label_type == 3);
   clear_pointer_set: bool = $context.flow.reset_pointer_set()
       &if(last);
};

There are two uses to the "case" keyword.

  • As part of record field. In this scenario, it allow alternative methods to parse a field. Example:

    type RPC_Reply(msg: RPC_Message) = record {
      stat:       uint32;
      reply:      case stat of {
          MSG_ACCEPTED -> areply:  RPC_AcceptedReply(call);
          MSG_DENIED   -> rreply:  RPC_RejectedReply(call);
      };
    } &let {
      call: RPC_Call = context.connection.FindCall(msg.xid);
      success: bool = (stat == MSG_ACCEPTED && areply.stat == SUCCESS);
    };
    
  • As function definition. Example:

    function RPC_Service(prog: uint32, vers: uint32): EnumRPCService =
        case prog of {
                default -> RPC_SERVICE_UNKNOWN;
        };
    

Note that one can "refine" both types of cases:

refine casefunc RPC_Service += {
       100000  -> RPC_SERVICE_PORTMAP;
};

This macro refers to the data that was passed into the ParseBuffer function. When $input is used, binpac generate a const_bytestring which contains the start and end pointer of the input.

PAC grammar:

&until($input.length()==0);

PAC source:

const_bytestring t_val__elem_input(t_begin_of_data, t_end_of_data);
if (  ( t_val__elem_input.length() == 0 )  )

$element provides access to entry of the array type. Following are the ways which $element can be used.

  • Current element. Check on the value of the most recently parsed entry. This would get executed after each time an entry is parsed. Example:

    type SMB_ascii_string       = uint8[] &until($element == 0);
    
  • Current element's field. Example:

    type DNS_label(msg: DNS_message) = record {
       length:     uint8;
       data:       case label_type of {
           0 ->    label:  bytestring &length = length;
           3 ->    ptr_lo: uint8;
       };
    } &let {
       label_type: uint8 = length >> 6;
       last:       bool  = (length == 0) || (label_type == 3);
    };
    type DNS_name(msg: DNS_message) = record {
       labels:     DNS_label(msg)[] &until($element.last);
    };
    

This macro refers to the Analyzer context class (Context<Name> class gets generated from analyzer <Name> withcontext {}). Using this macro, users can gain access to the "flow" object and "analyzer" object.

Do not create copy of the bytestring

type MIME_Line = record {
   line:   bytestring &restofdata &transient;
} &oneline;

Adds derivative field to a record

type ncp_request(length: uint32) = record {
   data        : uint8[length];
} &let {
   function    = length > 0 ? data[0] : 0;
   subfunction = length > 1 ? data[1] : 0;
};

Declares global value. If the user does not specify a type, the compiler will assume the "int" type.

PAC grammar:

let myValue:uint8=10;

PAC source:

uint8 const myValue = 10;

PAC header:

extern uint8 const myValue;

Grab the rest of the data available in the FlowBuffer.

PAC grammar:

onebyte: uint8;
value: bytestring &restofdata &transient;

PAC source:

// Parse "onebyte"
onebyte_ = *((uint8 const *) (t_begin_of_data));
// Parse "value"
int t_value_string_length;
t_value_string_length = (t_end_of_data) - ((t_begin_of_data + 1));
int t_value__size;
t_value__size = t_value_string_length;
value_.init((t_begin_of_data + 1), t_value_string_length);

Length can appear in two different contexts: as property of a field or as property of a record. Examples: &length as field property:

protocol    : bytestring &length = 4;

translates into:

const_byteptr t_end_of_data = t_begin_of_data + 4;
int t_protocol_string_length;
t_protocol_string_length = 4;
int t_protocol__size;
t_protocol__size = t_protocol_string_length;
protocol_.init(t_begin_of_data, t_protocol_string_length);

This was originally intended to implement the behavior of the superseding "&enforce" attribute. It always has and always will just be a no-op to ensure anything that uses this doesn't suddenly and unintentionally break.

Check a condition and raise exception if not met.

When parsing a long field with variable length, "chunked" can be used to improve performance. However, chunked field are not buffered across packet. Data for the chunk in the current packet can be access by using "$chunk".

Data matched for a particular type, the data matched can be retained by using "&exportsourcedata".

.pac file

type myType = record {
   data:uint8;
} &exportsourcedata;

_pac.h

class myType
{
public:
   myType();
   ~myType();
   int Parse(const_byteptr const t_begin_of_data, const_byteptr const  _end_of_data);
   uint8 myData() const    { return myData_; }
   const_bytestring const & sourcedata() const { return sourcedata_; }
protected:
   uint8 myData_;
   const_bytestring sourcedata_;
};

_pac.cc

sourcedata_ = const_bytestring(t_begin_of_data, t_end_of_data);
sourcedata_.set_end(t_begin_of_data + 1);

Source data can be used within the type that match it or at the parent type.

type myParentType (child:myType) = record {
    somedata:uint8;
} &let{
   do_something:bool = print_stuff(child.sourcedata);
};

translates into

do_something_ = print_stuff(child()->sourcedata());

binpac supports incremental input to deal with packet fragmentation. This is done via use of FlowBuffer class and maintaining buffering/parsing states.

FlowBuffer provides two mode of buffering: line and frame. Line mode is useful for parsing line based language like HTTP. Frame mode is best for fixed length message. Buffering mode can be switched during parsing and is done transparently to the grammar writer.

At compile time binpac calculates number of bytes required to evaluate each field. During run time, data is buffered up in FlowBuffer until there is enough to evaluate the "record". To optimize the buffering process, if FlowBuffer has enough data to evaluate on the first NewData, it would only mark the start and end pointer instead of copying.

  • void NewMessage();
    • Advances the orig_data_begin_ pointer depend on current mode_. Moves by 1/2 characters in LINE_MODE, by frame_length_ in FRAME_MODE and nothing in UNKNOWN_MODE (default mode).
    • Set buffer_n_ to 0
    • Reset message_complete_
  • void NewLine();
    • Reset frame_length_ and chunked_, set mode_ to LINE_MODE
  • void NewFrame(int frame_length, bool chunked_);
  • void GrowFrame(int new_frame_length);
  • void AppendToBuffer(const_byteptr data, int len);
    • Reallocate buffer_ to add new data then copy data
  • void ExpandBuffer(int length);
    • Reallocate buffer_ to new size if new size is bigger than current size.
    • Set minimum size to 512 (optimization?)
  • void MarkOrCopyLine();
    • Seek current input for end of line (CR/LF/CRLF depend on line break mode). If found append found data to buffer if one is already created or mark (set frame_length_) if one is not created (to minimize copying). If end of line is not found, append partial data till end of input to buffer. Buffer is created if one is not there.
  • const_byteptr begin()/end()
    • Returns buffer_ and buffer_n_ if a buffer exist, otherwise orig_data_begin_ and orig_data_begin_ + frame_length_.
  • buffering_state_ - each parsing class contains a flag indicating whether there are enough data buffered to evaluate the next block.
  • parsing_state_ - each parsing class which consists of multiple parsing data unit (line/frames) has this flag indicating the parsing stage. Each time new data comes in, it invokes parsing function and switch on parsing_state to determine which sub parser to use next.

To run binpac-generated code independent of Zeek. Regex library must be substituted. Below is one way of doing it. Use the following three header files.

/*Dummy file to replace Zeek's file*/
#include "binpac_pcre.h"
#include "bro_dummy.h"
#ifndef BRO_DUMMY
#define BRO_DUMMY
#define DEBUG_MSG(x...)  fprintf(stderr, x)
/*Dummy to link, this function suppose to be in Zeek*/
double network_time();
#endif
#ifndef bro_pcre_h
#define bro_pcre_h
#include <stdio.h>
#include <assert.h>
#include <string>
using namespace std;
// TODO: use configure to figure out the location of pcre.h
#include "pcre.h"
class RE_Matcher {
public:
   RE_Matcher(const char* pat){
       pattern_ = "^";
       pattern_ += "(";
       pattern_ += pat;
       pattern_ += ")";
       pcre_   = NULL;
       pextra_ = NULL;
   }
   ~RE_Matcher() {
       if (pcre_) {
           pcre_free(pcre_);
       }
   }
   int Compile() {
       const char *err = NULL;
       int erroffset = 0;
       pcre_ = pcre_compile(pattern_.c_str(),
                                    0,  // options,
                                    &err,
                                    &erroffset,
                                    NULL);
       if (pcre_ == NULL) {
           fprintf(stderr,
                   "Error in RE_Matcher::Compile(): %d:%s\n",
                   erroffset, err);
           return 0;
       }
       return 1;
   }

   int MatchPrefix (const char* s, int n){
       const char *err=NULL;
       assert(pcre_);
       const int MAX_NUM_OFFSETS = 30;
       int offsets[MAX_NUM_OFFSETS];
       int ret = pcre_exec(pcre_,
                                   pextra_,  // pcre_extra
                                   //NULL,  // pcre_extra
                                   s, n,
                                   0,     // offset
                                   0,     // options
                                   offsets,
                                   MAX_NUM_OFFSETS);
       if (ret < 0) {
           return -1;
       }
       assert(offsets[0] == 0);
       return offsets[1];
   }
protected:
   pcre *pcre_;
   string pattern_;
};
#endif

In your main source, add this dummy stub.

/*Dummy to link, this function suppose to be in Zeek*/
double network_time(){
   return 0;
}
  • Does &oneline only work when "flow" is used?

    Yes. binpac uses the flowunit definition in "flow" to figure out which types require buffering. For those that do, the parse function is:

    bool ParseBuffer(flow_buffer_t t_flow_buffer, ContextHTTP * t_context);
    

    And the code of flow_buffer_t provides the functionality of buffering up to one line. That's why &oneline is only active when "flow" is used and the type requires buffering.

    In certain cases we would want to use &oneline even if the type does not require buffering, binpac currently does not provide such functionality.

  • How would incremental input work in the case of regex?

    A regex should not take incremental input. (The binpac compiler will complain when that happens.) It should always appear below some type that has either &length=... or &oneline.

  • What is the role of Context_<Name> class (generated by analyzer <Name> withcontext)?

  • What is the difference between ''withcontext'' and w/o ''withcontext''?

    withcontext should always be there. It's fine to have an empty context.

  • Elaborate on $context and how it is related to "withcontext".

    A "context" parameter is passed to every type. It provides a vehicle to pass something to every type without adding a parameter to every type. In that sense, it's optional. It exists for convenience.

  • Example usage of composite type array.

    Please see HTTP_Headers in http-protocol.pac in the Zeek source code.

  • Clarification on "connection" keyword (binpac paper).

  • Need a new way to attach hook additional code to each class beside &let.

  • &transient, how is this different from declaring anonymous field? and currently it doesn't seem to do much

    type HTTP_Header = record {
        name:   HTTP_HEADER_NAME &transient;
        :       HTTP_WS;
        value:  bytestring &restofdata &transient;
    } &oneline;
    
    // Parse "name"
    int t_name_string_length;
    t_name_string_length =
        HTTP_HEADER_NAME_re_011.MatchPrefix(
            t_begin_of_data,
            t_end_of_data - t_begin_of_data);
    if ( t_name_string_length < 0 )
        {
        throw ExceptionStringMismatch( "./http-protocol.pac:96",
             "|([^: \\t]+:)",
             string((const char *) (t_begin_of_data), (const char *) t_end_of_data).c_str()
             );
        }
    int t_name__size;
    t_name__size = t_name_string_length;
    name_.init(t_begin_of_data, t_name_string_length);
    
  • Detail on the globals ($context, $element, $input...etc)

  • How does BinPAC work with dynamic protocol detection?

    Well, you can use the code in DNS-binpac.cc as a reference. First, create a pointer to the connection. (See the example in DNS-binpac.cc)

    interp = new binpac::DNS::DNS_Conn(this);
    

    Pass the data received from "DeliverPacket" or "DeliverStream" to "interp->NewData()". (Again, see the example in DNS-binpac.cc)

    void DNS_UDP_Analyzer_binpac::DeliverPacket(int len, const u_char* data, bool orig, int seq, const IP_Hdr* ip, int caplen)
        {
        Analyzer::DeliverPacket(len, data, orig, seq, ip, caplen);
        interp->NewData(orig, data, data + len);
        }
    
  • Explanation of &withinput

  • Difference between using flow and not using flow (binpac generates Parse method instead of ParseBuffer)

  • &check currently working?

  • Difference between flowunit and datagram, datagram and &oneline, &length?

  • Go over TODO list in binpac release

  • How would input get handle/buffered when length is not known (chunked)

  • More feature multi byte character? utf16 utf32 etc.

  • Provides a method to match simple ascii text.
  • Allows use fixed length array in addition to vector.
  • Remove anonymous field bytestring assignment.
  • Redundant overflow checking/more efficient fixed length text copying.

Things that compiler should flag out at code generation time

  • Give warning when &transient is used on none bytestring
  • Give warning when &oneline, &length is used and flowunit is not.
  • Warning when more than one "connection" is defined