Handling protocol versions

When you work on tools that are protocol-aware, at some point or another, you will have to make them handle different versions of the same protocol.

Why (create a tool)

´╗┐Let’s say I have a tool that can decode messages of SOP v.1 and am about to release SOP v.2. Most likely my tool will not decode the new messages. A simple way to handle it, is by updating the tool to be v.2 compatible. But what if you still want to process v.1 messages? And what if you want to analyze your old log files?

How (to deal with this problem)

You can either:

a. keep two different versions of the tool, or
b. make the tool version-aware.

The former (a) is simple but can become a headache after releasing a few versions. For example, fixing a bug in all versions of the tool. Another example is when you add a new feature to the latest version of your tool. What will happen to the older versions? Will you add the feature and re-process your log files?

The latter (b) is more complicated but might lead to the separation of policy from mechanism and in the long-run, is easier to test and maintain. Personally, I apply this separation by using CSV files for the protocol specifications and by creating tools that are ignorant of the protocol since it resides in the CSV files and not in the tools. As an example, have a look at how you can use CSV files in a Wireshark Dissector.

Design

No matter which approach you choose, you can still handle multiple protocol versions by:

  1. isolating each version (protocol specs and/or scripts) in a directory
  2. updating your tool, to use an environment variable pointing to one of these directories
  3. setting the environment variable depending on the protocol version you want to work with

Implementation

For an example implementation, let’s look at init.lua and sop.lua, the Wireshark Dissector for the SOP protocol.

As you can see in the method call to loadSpecs, the second parameter, SOP_SPECS_PATH, is declared in init.lua and set with the value of the environment variable with the same name.

-- From init.lua
SOP_SPECS_PATH = os.getenv("SOP_SPECS_PATH")
-- From sop.lua
local msg_specs, msg_parsers = helper:loadSpecs(msg_types, SOP_SPECS_PATH,
                                                columns, header:len(),
                                                ',', header, trailer)

Then we can set SOP_SPECS_PATH to the directory with the specs of a specific version of SOP.

$ SOP_SPECS_PATH=/path/to/version tshark -Y sop -r file_containing_sop_msgs.cap

Usage

The source code is on Github. The easiest way to test it is using Vagrant. After starting up the Vagrant box and connecting to it (ssh):

# Trying out an older capture file with the latest SOP version:
tshark -Y sop -r /vagrant/logs/sop_2016-01-21.pcapng | wc -l
# output: 8
# Then with the correct SOP version:
SOP_SPECS_PATH=$SOP_SPECS_PATH/1.0 tshark -Y sop -r /vagrant/logs/sop_2016-01-21.pcapng | wc -l
# output: 9

# To test a new capture file that supports the latest SOP version:
tshark -Y sop -r /vagrant/logs/sop_2017-07-17.pcapng | wc -l
# output: 9
# Then with the older SOP version:
SOP_SPECS_PATH=$SOP_SPECS_PATH/1.0 tshark -Y sop -r /vagrant/logs/sop_2017-07-17.pcapng | wc -l
# output: 8

Taking it one step further

We can take it one step further and automate the setup of this environment variable when a date is found in the filename. The script can then find the active version at this date and update SOP_SPECS_PATH.

So, with SOP protocol, we can:

  1. add versions.csv with all SOP versions:
    version release_date info
    1.0 2016-01-01 First release of SOP.
    2.0 2017-07-01 Second release with larger RJ.text!!!!
  2. create script sop_specs_path.sh, that given a filename containing a date (YYYY-MM-DD/YYYY_MM_DD), it looks in versions.csv for the SOP version released before that date.

  3. update cap2sop.sh, to use sop_specs_path.sh before calling tshark.

Then we can even call cap2sop.sh with capture files of different SOP versions.

cap2sop.sh /vagrant/logs/sop_20*.pcapng

Here’s the output. Notice that all RJ messages are decoded although sop_2016-01-21.pcapng has messages of SOP v.1 and sop_2017-07-17.pcapng has messages of v.2.

frame dateTime msgType clientId ethSrc ethDst ipSrc ipDst capFile
3 14:15:16.608027 NO SomeClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
3 14:15:16.608027 OC SomeClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
3 14:15:16.608027 NO AnotherClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
3 14:15:16.608027 RJ AnotherClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
3 14:15:16.608027 NO AnotherClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
3 14:15:16.608027 OC AnotherClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
7 14:15:18.617432 NO SomeClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
9 14:15:19.622369 OC SomeClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
11 14:15:20.627967 RJ AnotherClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
17 14:15:31.103467 NO SomeClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
17 14:15:31.103467 OC SomeClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
17 14:15:31.103467 NO AnotherClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
17 14:15:31.103467 RJ AnotherClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
17 14:15:31.103467 NO AnotherClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
17 14:15:31.103467 OC AnotherClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
17 14:15:31.103467 NO SomeClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
17 14:15:31.103467 OC SomeClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
17 14:15:31.103467 RJ AnotherClientId 00:0c:29:c8:76:1d 00:50:56:c0:00:00 192.168.58.128 192.168.58.1 sop_2016-01-21.pcapng
3 09:17:09.240402 NO SomeClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
4 09:17:10.246698 OC SomeClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
5 09:17:11.248273 NO AnotherClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
6 09:17:12.249938 RJ AnotherClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
7 09:17:13.252035 NO AnotherClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
8 09:17:14.254140 OC AnotherClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
11 09:17:17.260084 NO SomeClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
11 09:17:17.260084 OC SomeClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
11 09:17:17.260084 NO AnotherClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
11 09:17:17.260084 RJ AnotherClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
11 09:17:17.260084 NO AnotherClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
11 09:17:17.260084 OC AnotherClientId 08:00:27:6f:12:9e 0a:00:27:00:00:0e 192.168.56.11 192.168.56.1 sop_2017-07-17.pcapng
Advertisements

Viewing/Editing files of fixed width format

In an earlier post, I described a way to load a file with lines of fixed-width format into R. Using RStudio (and the View function), it is easy to view the data. On the other hand, it is not easy to edit the file when it has lines of different format. My multifwf package has a read.multi.fwf function but not a write.multi.fwf to write changes back to the file. On top of that, it is not easy for someone unfamiliar with R to filter/subset the loaded data.

Why

Reading a fixed-width log file can be very frustrating especially while you are troubleshooting. For example, look at three lines from a SOP log file:

09:20:05.034 < NOSLMT0000666    EVILCORP00010.77SomeClientId    SomeAccountId   
09:20:05.099 > OC000001BLMT0000666    EVILCORP00010.77SomeClientId    SomeAccountId   
09:20:13.421 < NOBLMT0000666 EVILCORP00001.10AnotherClientId AnotherAccountId

Confusing, right? And these lines are of a very simple protocol. At the place where I work, we have much longer messages (even longer than 300 characters). Imagine having to find a certain field in a 350 character message. And under pressure this can be hard even for experienced employees. It would definitely help to be able to view the message fields in a clear way. To be able to quickly and safely locate any field in a given line even if you are new to the message protocol. To be able to do queries and filter out any lines not needed.

Apart from reading a fixed-width log file, sometimes you might need to edit one. For example, you might want to replay a log file but only after changing some values.

How

A solution is to use Record Editor or its little brother reCsvEditor. Both of these editors can be used to view/edit fixed-width data files as long as they know how the data is laid out. This data layout can be in many formats called Copybooks. After some experimentation I decided to use the XML Copybook since it allows for multiple data layouts in a single file. This way a single XML file can hold the specs for all the message types of the protocol. Another reason behind this choice is that the XML examples of Record Editor were the most intuitive.

Having decided on the layout format, I only had to create one for each of the four protocols I work with. Of course, this is tedious and error prone and since I already have the message specs in CSV format), I decided to create a tool that converts CSV specs to the XML Copybook format.

Here’s a simplified version of the “ams PO Download.Xml” example that comes with Record Editor:

<?xml version="1.0" ?>
<RECORD RECORDNAME="ams PO Download" COPYBOOK="" DELIMITER="<Tab>" FILESTRUCTURE="Default" STYLE="0"
        RECORDTYPE="GroupOfRecords" LIST="Y" QUOTE="" RecSep="default">
    <RECORDS>
        <RECORD RECORDNAME="ams PO Download: Detail" COPYBOOK="" DELIMITER="<Tab>"
                DESCRIPTION="PO Download: Detail" FILESTRUCTURE="Default" STYLE="0" RECORDTYPE="RecordLayout"
            LIST="N" QUOTE="" RecSep="default" TESTFIELD="Record Type" TESTVALUE="D1">
            <FIELDS>
                <FIELD NAME="Record Type"  POSITION="1" LENGTH="2" TYPE="Char"/>
                <FIELD NAME="Pack Qty"     POSITION="3" LENGTH="9" DECIMAL="4" TYPE="Num Assumed Decimal (Zero padded)"/>
                <FIELD NAME="Pack Cost"    POSITION="12" LENGTH="13" DECIMAL="4" TYPE="Num Assumed Decimal (Zero padded)"/>
                <FIELD NAME="APN"          POSITION="25" LENGTH="13" TYPE="Num (Right Justified zero padded)"/>
                <FIELD NAME="Filler"       POSITION="38" LENGTH="1" TYPE="Char"/>
                <FIELD NAME="Product"      POSITION="39" LENGTH="8" TYPE="Num (Right Justified zero padded)"/>
                <FIELD NAME="pmg dtl tech key" POSITION="72" LENGTH="15" TYPE="Char"/>
                <FIELD NAME="Case Pack id" POSITION="87" LENGTH="15" TYPE="Char"/>
                <FIELD NAME="Product Name" POSITION="101" LENGTH="50" TYPE="Char"/>
            </FIELDS>
        </RECORD>
        <RECORD RECORDNAME="ams PO Download: Header" COPYBOOK="" DELIMITER="<Tab>"
                DESCRIPTION="PO Download: Header" FILESTRUCTURE="Default" STYLE="0" RECORDTYPE="RecordLayout" LIST="N"
            QUOTE="" RecSep="default" TESTFIELD="Record Type" TESTVALUE="H1">
            <FIELDS>
                <FIELD NAME="Record Type"     POSITION="1" LENGTH="2" TYPE="Char"/>
                <FIELD NAME="Sequence Number" POSITION="3" LENGTH="5" DECIMAL="3" TYPE="Num Assumed Decimal (Zero padded)"/>
                <FIELD NAME="Vendor"          POSITION="8" LENGTH="10" TYPE="Num (Right Justified zero padded)"/>
                <FIELD NAME="PO"              POSITION="18" LENGTH="12" TYPE="Num Assumed Decimal (Zero padded)"/>
                <FIELD NAME="Entry Date" DESCRIPTION="Format YYMMDD" POSITION="30" LENGTH="6" TYPE="Char"/>
                <FIELD NAME="Filler"          POSITION="36" LENGTH="8" TYPE="Char"/>
                <FIELD NAME="beg01 code"      POSITION="44" LENGTH="2" TYPE="Char"/>
                <FIELD NAME="beg02 code"      POSITION="46" LENGTH="2" TYPE="Char"/>
                <FIELD NAME="Department"      POSITION="48" LENGTH="4" TYPE="Char"/>
                <FIELD NAME="Expected Reciept Date" DESCRIPTION="Format YYMMDD" POSITION="52" LENGTH="6" TYPE="Char"/>
                <FIELD NAME="Cancel by date" DESCRIPTION="Format YYMMDD" POSITION="58" LENGTH="6" TYPE="Char"/>
                <FIELD NAME="EDI Type"       POSITION="68" LENGTH="1" TYPE="Char"/>
                <FIELD NAME="Add Date" DESCRIPTION="Format YYMMDD" POSITION="69" LENGTH="6" TYPE="Char"/>
                <FIELD NAME="Filler"         POSITION="75" LENGTH="1" TYPE="Char"/>
                <FIELD NAME="Department Name" POSITION="76" LENGTH="10" TYPE="Char"/>
                <FIELD NAME="Prcoess Type" DESCRIPTION="C/N Conveyable/Non-Conveyable" POSITION="86" LENGTH="1" TYPE="Char"/>
                <FIELD NAME="Order Type" POSITION="87" LENGTH="2" TYPE="Char"/>
            </FIELDS>
        </RECORD>
    </RECORDS>
</RECORD>

It is easy to see that for each record type there’s a separate RECORD tag. Each RECORD has a FIELDS tag that contains the fields of the record type. For each field there’s a separate FIELD tag with NAME, POSITION, LENGTH and TYPE attributes. Finally each RECORD tag has a TESTFIELD and a TESTVALUE attributes. These two will help Record Editor decide which RECORD to use for parsing a line by comparing the value of the field in TESTFIELD with the value of TESTVALUE. When they are equal, the FIELDS of the RECORD will be used to parse the line.

Design

Hence the following design:

  1. Print the XML part up to, and including, the opening RECORDS tag. The RECORDNAME attribute of the main RECORD tag should be the name of the protocol
  2. for each CSV file:
    1. add a RECORD tag with the RECORDNAME and DESCRIPTION attributes set to the protocol name followed by the CSV filename
    2. for each field:
      1. add a FIELD tag with the NAME and LENGTH attributes taken from the CSV and the POSITION set to the offset of the field. TYPE will be Char for every field.
  3. close the RECORDS and main RECORD tags.

Implementation

Here’s a BASH implementation for the SOP protocol.

Step 1:

cat <<EOF
<?xml version="1.0" ?>
<RECORD RECORDNAME="SOP" COPYBOOK="" DELIMITER="<Tab>" FILESTRUCTURE="Default" STYLE="0" 
        RECORDTYPE="GroupOfRecords" LIST="Y" QUOTE="" RecSep="default">
    <RECORDS>
EOF

Step 2:

for s in $*; do
    SPEC_NAME=${s/.csv/}
cat <<EOF
        <RECORD RECORDNAME="SOP: $SPEC_NAME" COPYBOOK="" DELIMITER="<Tab>" 
                DESCRIPTION="SOP: $SPEC_NAME" FILESTRUCTURE="Default" STYLE="0" RECORDTYPE="RecordLayout"
            LIST="N" QUOTE="" RecSep="default" TESTFIELD="MessageType" TESTVALUE="$SPEC_NAME">
            <FIELDS>
EOF

    awk '
    BEGIN {
        FS = ","
        f_position = 1
    }
    NR != 1 {
        f_name = $1
        f_length = $2
        printf "\t\t\t\t<FIELD NAME=\"%s\"  POSITION=\"%d\" LENGTH=\"%d\" TYPE=\"Char\"/>\n", f_name, f_position, f_length
        f_position += f_length
    }
    ' $s
cat <<EOF
            </FIELDS>
        </RECORD>
EOF
done

And finally step 3:

cat <<EOF
    </RECORDS>
</RECORD>
EOF

The complete csv2xmlcopybook.sh script can be found here. It includes options for the protocol name and header length.

Usage

To try out csv2xmlcopybook.sh on SOP:

  1. Download reCsvEditor.
  2. git clone https://github.com/prontog/SOP
  3. cd SOP
  4. ./csv2xmlcopybook.sh -H 15 -p “SOP log” *.csv > sop.xml
  5. Run reCsvEditor
  6. On the left part of the Open File window, select the sopsrv_2016_12_06.log found in the log directory of the SOP repo. Do not click the Open button yet.
  7. On the right part of the Open File window, select the Fixed Width tab.
  8. Select the sop.xml Copybook created in step 4.
  9. Click the Open button.

Fig 1: The Opened *sopsrv_2016_12_06.log* file

Notice that only a couple of lines are displayed correctly. This is because the SOP log we opened has lines of different formats. One line has an OC message while another a TR and for each message type there’s a different data layout. You can use the Layouts combobox to select the layout of the line you want to parse.

Fig 2: Changing layout

Clicking the small button on the left of the row will open a detail tab.

Fig 3: A detailed view of a line

It is also easy to filter lines. Click the filter button (with the horizontal arrows) to open the filter window.

Fig 3: Keeping only OC messages (filter)

Finally you can make changes and save them back to the original file!

For further info have a look at the locally installed help file. You can access it from Help menu.

Adding verbosity to your tool

Sometimes you will need to know what your tool is currently doing. This is often described as transparency and many popular tools offer it. cURL and ssh have the -v option while phantomjs has the –debug option. In any case, the effect is the same, making the program more talkative.

Why

Most likely this will be for troubleshooting. Perhaps, a command that takes forever to complete, or finishes but without doing what it was supposed to do. Personally I often need it to troubleshoot networking programs and complicated scripts that perform actions on many files/directories.

How

The usual way to add verbosity to a CLI program is by adding an option (such as -v or –verbose) and then, as the program executes, by printing useful info on the terminal if this option is passed as an argument.

Applications with a GUI can either append info to a log file or open a new window containing a text-area that plays the role of a “live” log.

For the rest of this article I’ll focus on CLI programs and specifically BASH scripts, since it’s my first choice whenever I decide to make a new tool.

Note that most tools (especially ones used in pipelines) follow the rule of silence and have it disabled by default.

Design

Hence the following design:

  1. By default, verbosity is disabled. The option -v will enable it.
  2. When this option is used, an environment variable will be set.
  3. If this environment variable is present, the tool will echo details to stderr. Echoing to stderr is necessary for programs that are usually part of a pipeline.

Implementation

Her’s a simple implementation for BASH:

  • Function set_verbosity will set the environment variable for verbosity. This should be called in the case statement handling the CLI options.
  • Function trace will echo to stderr if verbosity is enabled.
# Handle the verbosity option. Use it in the `case` 
# statement handling the program options.
set_verbosity() {
    verbosity=1
}

# Echo all passed arguments to stderr if verbosity is on.
trace() {
    [[ $verbosity -gt 0 ]] && echo $* >&2
}

A complete sample can be found here.

There are also tools that might need multiple levels of verbosity. For an example look at the -v option of ssh. Here's new versions of the set_verbosity and trace, extended to support many levels of verbosity:

# Handle the verbosity option. Use it in the `case` 
# statement handling the program options.
set_verbosity() {
    verbosity_level=$((verbosity_level + 1))
}

# Echo all passed arguments to stderr if verbosity is on. 
# If the first parameter is numeric then the message will 
# only be echoed if the verbosity level is >= to it (the 
# message level). Otherwise the message level will be 1.
trace() {
    local msg_level=$(($1 + 0))
    if [[ $msg_level -gt 0 ]]; then
        shift
    else
        msg_level=1
    fi

    verbosity_level=$(($verbosity_level + 0))

    if [[ $verbosity_level -ge $msg_level ]]; then
        echo $* >&2
    fi
}

A complete sample can be found here.

Usage

  1. Copy the set_verbosity and trace functions to your BASH script or even better move them to your bashrc file.
  2. Handle the -v option in the beginning of your script. A common approach is with the getopts BASH built-in:
    # Handle CLI options.
    while getopts "v" option
    do
    case $option in
        v) set_verbosity
        ;;
    esac
    done
    shift $(( $OPTIND - 1 ))
    
  3. Use trace in your script to print useful info:
    trace "This will go to stderr if -v is passed!"
    trace 2 This also if verbosity level is at least 2
    

Finally, you might want to enable verbosity of a program called inside your script. This can be done by setting an environment variable with the option that will enable verbosity on this other program. We only need to make a small addition to the case statement:

case $option in
    v) set_verbosity
       curl_verbosity=-v # This is for cURL!
    ;;
esac

Then use this env var when you call the program:

curl $curl_verbosity https://duckduckgo.com

You will find more details concerning transparency in Chapter 6. Transparency of The Art of Unix Programming by Eric S. Raymond.

Converting packet data to CSV using TShark

If you have created your own Wireshark dissector, you might want to further analyze your network captures. Let’s say to measure performance and although you could do this using Wireshark (MATE, listeners, statistics) it can be complicated and not flexible compared to a statistics-friendly environment (R, Octave, PSPP, iPython etc.). Furthermore if you use another tool to analyze your application’s log files, it’s easier to extract packet info to a text format and use your own toolset. In my case the most suitable text format is CSV since I can easily load it in R and use the same functions I use to analyze my application’s logs.

Why (create a tool)

TShark has options (-T, -E and -e) for printing the packet fields in a delimited text format but there’s a catch. If your application batches messages then TShark will export all messages from a packet to a single line. But what you will most likely need is one message per line.

Let’s see an example using a capture file containing packets of the SOP protocol:

tshark -Y sop -Tfields -e frame.number -e _ws.col.Time \
  -e sop.msgtype -E separator=',' -E aggregator=';'    \
  -E header=y -r sop.pcapng

This command prints fields (-T) frame.number, _ws.col.Time and sop.msgtype from the sop packets (-Y) using comma as a field separator and semicolon as an aggregator. Here’s the output:

frame.number _ws.col.Time sop.msgtype
1 14:15:15.603164 BO
3 14:15:16.608027 NO;OC;NO;RJ;NO;OC;TR;TR
5 14:15:17.612279 EN
7 14:15:18.617432 NO
9 14:15:19.622369 OC
11 14:15:20.627967 RJ
13 14:15:21.632463 TR
15 14:15:30.903667 BO
17 14:15:31.103467 NO;OC;NO;RJ;NO;OC;TR;TR;EN;NO;OC;RJ;TR

As you can see, there are two frames (packets) containing multiple SOP messages. Notice that TShark outputs a single line with the fields that appear multiple times (in our case in different messages in a frame) delimited by the character set with the aggregator option.

Furthermore, watch what happens when you export a field (sop.clientid) not present in every message type of the protocol.

tshark -Y sop -Tfields -e frame.number -e _ws.col.Time \
  -e sop.msgtype -e sop.clientid -E separator=','      \
  -E aggregator=';' -E header=y -r sop.pcapng

Here’s the output with the aggregated fields sop.msgtype and sop.clientid slightly modified so that each field is in a single line:

frame.number _ws.col.Time sop.msgtype sop.clientid
1 14:15:15.603164 BO
3 14:15:16.608027 NO;
OC;
NO;
RJ;
NO;
OC;
TR;
TR
SomeClientId;
SomeClientId;
AnotherClientId;
AnotherClientId;
AnotherClientId;
AnotherClientId
5 14:15:17.612279 EN
7 14:15:18.617432 NO SomeClientId
9 14:15:19.622369 OC SomeClientId
11 14:15:20.627967 RJ AnotherClientId
13 14:15:21.632463 TR
15 14:15:30.903667 BO
17 14:15:31.103467 NO;
OC;
NO;
RJ;
NO;
OC;
TR;
TR;
EN;
NO;
OC;
RJ;
TR
SomeClientId;
SomeClientId;
AnotherClientId;
AnotherClientId;
AnotherClientId;
AnotherClientId;
SomeClientId;
SomeClientId;
AnotherClientId

You can see that in the frames containing multiple messages, the number of sop.msgtype and the number of sop.clientid are not equal. For example, in the third frame there are 8 msgtypes but only 6 clientids which can be verified by counting the semicolons (the aggregator option) and adding 1. This means that TShark is not adding empty (or NA) values for missing fields. In frame 17 it’s more obvious since it looks like the messages with “TR” msgtype have a clientid value which is not true. The last three clientids are extracted from the “NO”, “OC” and “RJ” messages that follow.

How

A solution is a BASH script calling TShark and then awk to split the frames with multiple messages into separate lines. The script can take one or more capture files as argument and output a single line per message.

Design

Hence the following design:

  1. Print the header. This must be done once since the script accepts many files as arguments.
  2. for each capture file:
    1. run TShark to output fields in CSV format
    2. run awk to split frames with multiple messages into separate lines. Filter out any messages that don’t contain every exported field. This solves the problem of aggregated fields of unequal length. Of course, this last part is not necessary if the exported fields are common to every message of your protocol.

Implementation

Part 1 is a simple echo.

echo frame,dateTime,msgType,clientId

Part 2.1 is also straight forward. Note that if an argument is not a file, a warning is echoed in stderr (line 4). This way stdout is not polluted by warnings and only an error will stop the script (line 9).

until [[ -z $1 ]]
do
    if [[ ! -f $1 ]]; then
        echo $1 is not a file > /dev/stderr
    fi

    CAP_FILE=$1

    set -o errexit

    tshark -Y sop -Tfields -e frame.number -e _ws.col.Time \
        -e sop.msgtype -e sop.clientid -E separator=','    \
        -E aggregator=';' -E header=y -r $CAP_FILE | awk '
        # Part 2.2. See below for a detailed description.'

    shift
done

Now let’s focus on the AWK script of part 2.2. This is also straightforward if the exported fields are present in every message of your protocol:

BEGIN {
    # Input and output should be CSV.
    FS = ","
    OFS = ","
}

{
    frame = $1
    dateTime = $2
    # Message types are split into an array using the
    # aggregator delimeter from part 2.1.
    split($3, msgTypes, ";")

    # Print a separate line for each message type in 
    # the packet (frame).
    for(i in msgTypes) {
        print frame, dateTime, msgTypes[i]
    }
}

Nothing special here, the msgType column is split using the aggregator delimiter of TShark and for each msgType a separate line is printed. Note that for this simple case the 4th column from part 2.1 is ignored.

Let’s move to the more complicated case where the exported fields are not present in every message:

BEGIN { 
    FS = ","
    OFS = ","
    # These are the msg types that contain the clientId
    # field. All other message types will be discarded.
    msgTypesToPrint = "NO,OC,RJ"
}

{
    frame = $1
    dateTime = $2
    # Message types and clientIds are split into arrays.
    split($3, msgTypes, ";")
    split($4, clientIds, ";")

    # Copy the messages types that are included in 
    # msgTypesToPrint to array filteredMsgTypes.
    fi = 0
    for(i in msgTypes) {
        if (match(msgTypesToPrint, msgTypes[i])) {
            fi++
            filteredMsgTypes[fi] = msgTypes[i]          
        }
    }

    # Skip line if there was no messages to print.
    if (fi == 0) {
        next
    }

    # filteredMsgTypes should have the same length 
    # with clientIds.
    if (length(filteredMsgTypes) != length(clientIds)) {
        printf("Skipping frame %d because of missing fields (%d, %d).", 
                frame, 
                length(filteredMsgTypes), 
                length(clientIds)) > "/dev/stderr"
        next
    }

    for(i in filteredMsgTypes) {
        print frame, dateTime, 
              filteredMsgTypes[i], clientIds[i]
    }

    # Clean up array filteredMsgTypes before moving to 
    # the next line.
    delete filteredMsgTypes
}

As you can see string msgTypesToPrint (line 6) includes the names of the message types containing every exported field. The rest of the messages will be filtered out (lines 18-24) and new array filteredMsgTypes will contain the messages to output. If the remaining message types are not equal to the number of clientIds (not present in every message), then skip the frame entirely and print a warning in stderr (lines 33-39). Finally, array filteredMsgTypes is deleted before moving to the next line. This is necessary since it’s a global variable.

That’s about it. You can examine the whole script cap2sop.sh here.

Usage

To extract SOP capture data from a single file:

./cap2sop.sh sop.pcapng

Here’s the final output.

frame dateTime msgType clientId
3 14:15:16.608027 NO SomeClientId
3 14:15:16.608027 OC SomeClientId
3 14:15:16.608027 NO AnotherClientId
3 14:15:16.608027 RJ AnotherClientId
3 14:15:16.608027 NO AnotherClientId
3 14:15:16.608027 OC AnotherClientId
7 14:15:18.617432 NO SomeClientId
9 14:15:19.622369 OC SomeClientId
11 14:15:20.627967 RJ AnotherClientId
17 14:15:31.103467 NO SomeClientId
17 14:15:31.103467 OC SomeClientId
17 14:15:31.103467 NO AnotherClientId
17 14:15:31.103467 RJ AnotherClientId
17 14:15:31.103467 NO AnotherClientId
17 14:15:31.103467 OC AnotherClientId
17 14:15:31.103467 NO SomeClientId
17 14:15:31.103467 OC SomeClientId
17 14:15:31.103467 RJ AnotherClientId

To extract SOP capture data from multiple files:

./cap2sop.sh *.pcapng

Using pandoc and make to extract specs from a Word document

Pandoc is a fantastic tool for converting documents supporting most common markup formats. I first learned about it when I started using R Markdown but didn’t directly use it until I read ‘From Word to Markdown to InDesign’ by Dr. Wouter Soudan.

Why (create a tool)

As I’ve mentioned in earlier posts, in the company where I work we maintain several text protocols for inter-process communication. The specification for each protocol is described in a Word document from which we create a PDF file every time we make a new release. Each message type is described in a separate section in table format. For an example have a look at the specification for an imaginary protocol named SOP. Notice that for each type of message there is a table such as:

Field Length Type Description
msgType 2 STRING Defines message type ALWAYS FIRST FIELD IN MESSAGE PAYLOAD.
clientId 16 STRING Unique identifier for Order as assigned by the client.
rejectionCode 3 NUMERIC Rejection Code
text 48 STRING Text explaining the rejection

About my CSV addiction

As you can see from my post on creating Wireshark dissectors, I use CSV files to create dissectors for our protocols. I also use them to load log files into R and analyze them. The reason I use CSV extensively is that they allow me to separate the protocol specs from the tools I use. This way I can create a CSV-aware tool for fixed-width text protocols and then use it with any protocol of this type with no changes (hopefully).

Perhaps you’ve already noticed the problem in my workflow.

Managing the CSV specs

The specs are stored in Word documents, which means that I have to manually extract them into CSV files. Here’s my initial workflow:

  1. Copy a table (describing a message type) from the Word document.
  2. Paste it into a new Sheet in an Excel file. Name the Sheet after the message type.
  3. Use in2csv (part of csvkit) to extract each Sheet to a CSV file.

Then I had to make sure these steps where repeated every time a Word spec was edited. Not nice. And prone to errors.

How

The solution is to use Pandoc to convert the Word document to another format. One that will allow me to create the CSV files using common text manipulation tools (grep, sed, awk etc.). The most suitable format I can think of is a Markdown variant that can handle tables such as markdown_github , a GitHub-Flavored Markdown.

Design

The design is simple:

For each message type:

  1. Extract the spec table in Markdown format.
  2. Save it to a file using the message type for a name.
  3. Transform the Markdown table into CSV format.

Implementation

After many experiments I ended up with the following command to convert a Word document, such as sop.docx, to Markdown:

#         1      2                           3                       4
pandoc -smart --filter ./despan.py --to markdown_github sop.docx | iconv -f utf8 -t ascii//TRANSLIT > sop.md

Let’s break it down:

  1. The smart option will produce typographically correct output.
  2. The filter option will allow us to use a python filter to remove all span elements that pandoc ocassionally produces. See issue #1893 from the Pandoc repo. This step requires the pandocfilters Python module.
  3. The to option is the most important, specifying the markup of the transformed output. As I mentioned earlier, markdown_github supports tables and is a perfect choice.
  4. Convert characters from UTF8 to ASCII with transliteration so a character that can’t be represented in the target character set, will be approximated through one or more similar looking characters. You can omit this if you want but in my experience it’s the safest choice since some of the tools, I use with the CSV files, prefer ASCII characters.

By examining sop.md we can see that each message table is preceded by a header with the format: “### MT ” where MT is the message type. Hence we can extract each spec table with the following awk script:

/### / {
    header = $0
    match(header, /^### ([A-Z]{2}) /, results)
    messageType = results[1]
    if (messageType) {
        spec_file = sprintf("%s.mdtable", messageType)
        print "" > spec_file
    }
}
# Print the message table into a different file.
/^\| /,/^$/{
    if (messageType) {
        print >> spec_file
    }

    if ($0 ~ /^$/) {
        messageType = 0
    }
}

The script extracts the message type using the regex /^### ([A-Z]{2}) / on each line containing the pattern ‘/### /’ and stores it in variable messageType. Then create variable spec_file with the format MT.mdtable where MT is the message type. Finallys it print all lines starting with ‘| ‘ (/^\| /) to spec_file and stops on the first blank line (/^$/).

In our example, the script will create 7 new files with extension mdtable.

Then we need to transform the mdtable files to CSV format. The following sed script does the job:

# Delete ** from the first line.
s/\*//g
# Delete lines that start with space. These
# are multirow cells from Remarks column.
/^[[:space:]]/d
# Delete rows with |---.
/^|---/d
# Remove first | and trim.
s/^|[[:space:]]*//
# Remove final | and trim.
s/[[:space:]]*|[[:space:]]*$//
# Trim middle |.
s/[[:space:]]*|[[:space:]]*/|/g
# Delete empty rows.
/^$/d
# Replace | separator with ,
s/|/,/g

This might seem like overkill but after using Pandoc with 4 different Word documents, all with different formatting, I ended up needing these replacements and deletions.

Adding make to the mix

As a task runner I used GNU make which is ideal for such cases since it works with file time-stamps and will allow for multiple transformation steps. Here’s the final Makefile:

sop_types := NO OC TR RJ EN BO
sop_mdtables := $(foreach t, $(sop_types), $(t).mdtable)
sop_specs := $(foreach t, $(sop_types), $(t).csv)
sop_md := sop.md

sop: $(sop_specs)
    touch $@

%.csv: %.mdtable
    ./mdtable_to_csv.sh $?

$(sop_mdtables): $(sop_md)
    # Extract the md tables for each message type.
    ./sop_split_to_mdtable.sh $?

%.md: %.docx
    # Convert documentation from docx format to md.
    pandoc --smart --filter ./despan.py --to markdown_github $? | iconv -f utf8 -t ascii//TRANSLIT > $@

# Clean up rules.   

clean: clean_csv clean_mdtable clean_md 
    rm sop

clean_csv:
    rm $(sop_specs)

clean_mdtable: 
    rm $(sop_mdtables)

clean_md: 
    rm $(sop_md)

Note that:

  1. Rule %.md: %.docx is for conversion from docx to md.
  2. Rule $(sop_mdtables): $(sop_md) is for extracting the mdtable files from a single md. This is done by BASH script sop_split_to_mdtable.sh.
  3. Rule %.csv: %.mdtable is for converting from mdtable to csv. This is done by BASH script mdtable_to_csv.sh
    .
  4. Rule sop: $(sop_specs) is the final rule that simply touches the dummy file sop.
  5. The rest of the rules are for cleaning up.

Trying it yourself

  1. Install Pandoc.
  2. Install pandocfilters
  3. git clone https://github.com/prontog/SOP
  4. cd SOP/specs
  5. make clean
  6. make

A simpler way to create Wireshark dissectors in Lua

Wireshark is an amazing tool. It is open source, works on most major platforms, has powerful capture/display filters, has a strong developer and user communities and it even has an annual conference. At the company where I work, coworkers are using it daily to analyze packets and troubleshoot our network. Personally I don’t use Wireshark daily but when I need to troubleshoot the communication between our programs it becomes a valuable tool to have.

Why (create a tool)

As I mentioned before, Wireshark has filtering capabilities, which you can use to search for distinctive parts of your message. For example, you can use tcp.port == 9001 to get the communication on port 9001 (source or target). This type of filtering works because a TCP dissector is installed with Wireshark. In the Protocols section of the Preferences dialog you will find all the available dissectors.

If you want to filter messages of a protocol with no dissector you can use the frame object. For example to look for messages containing the string “EVIL” you can use frame contains "EVIL". To be exact, this filter will return all frames containing the string. Not the actual messages. If for example each frame has 10 messages, then good luck finding them. As you can imagine, this can become tiresome and sometimes, give you headaches.

How

A solution is to create a custom dissector for your protocol and there are many ways to do so, using the C API, the Lua API, from a CORBA IDL file or using the Generic Dissector. The last two had their own format for the protocol specification something I wanted to avoid since I had the specs in CSV format. This left the two APIs, both very flexible and would allow me to use the CSV specs. In the end I chose the Lua as an excuse to try out the language.

Preparation

After going through an introduction to Lua, I searched for a function/module to read a CSV file. The standard libraries do not include such a function so I chose Geoff Leyland’s lua-csv which had all the features I needed (and more). The next part was finding and reading tutorials and examples on Lua Dissectors. Here’s a list of the ones that helped me most:

  • The Lua/Dissectors from Wireshark’s Wiki. Apart from an example, it includes links to pages describing the most useful objects and a section describing TCP reassembly.
  • Lua Scripting in Wireshark by Stig Bjorlykke. A presentation covering not only the basics but also introducing protocol preferences, post-dissectors and listeners.
  • The Athena dissector by FlavioJS and the Athena Dev Teams. A complete implementation of a dissector that was a great influence.

As I mentioned earlier, in the company where I work, we maintain many text protocols. Each protocol includes many message types with different format, described in CSV files with columns for name, length, type and description.

Design

Hence the following design:

  1. For each message type, read its CSV spec and create field objects and a message parser
  2. Create a dissector function that locates the message type, reassembles it if needed, and calls the appropriate parser.

Before moving to the implementation, I decided that the first part could be encapsulated in a common module to be used by all dissectors.

Implementation

The first dissector I created was for a fixed width text protocol with fields of fixed size and type STRING.

Then I worked on another text protocol which included repeating groups. These are groups of fields that are repeated N times where N is specified from another field in the message.

For example, imagine a message describing a contact. The contact can have many phone numbers, each described by a name and number:

Field Length Type Description
First name 20 STRING First name of the contact
Last name 40 STRING Last name of the contact
Number of phones 2 NUMERIC The number of phones in the message
Phone Number of phones REPEATING
PhoneName 16 STRING The phone name (Home, Work, Mobile1, Mobile2 etc.)
PhoneNumber 16 STRING The phone number

As we can see:

  1. The ‘Phone’ field is of type REPEATING. This signifies a repeating group.
  2. The Length of the ‘Phone’ field is not a number, but it references another field in the message.
  3. The end of the repeating group is implicitly the last field of the message. Otherwise we need to add a “fake” field with type REPEATING-END. Then the group will contain all fields between the ones with type REPEATING and REPEATING-END.

The third dissector was for a protocol that containing string of variable length. I added the type VARLEN. Fields of this type have to reference another field that specifies their length. The same way a repeating group references the number of repeats:

Field Length Type Description
Last name length 2 NUMERIC The length of the Last Name field.
Last name Last name length VARLEN Last name of the contact
First name 20 STRING First name of the contact

Here we see that:

  1. The ‘Last name’ field is of type VARLEN.
  2. The Length of the ‘Last name’ field is not a number, but it references another field in the message.
  3. Other fields can follow a ‘VARLEN’ field.

Naturally, the implementation of dissectors for three protocols helped me locate more parts that could be moved to the common module named ws_dissector_helper.

Did you know: That Wireshark can be used in the command line with the TShark utility?

Usage

The source code is available on Github. You can clone the repo or download it as a zip file.

Here’s an example on how ws_dissector_helper can help create a Wireshark dissector for an imaginary protocol called SOP(Simple Order Protocol):

Create a lua script for our new dissector. Let’s name it sop.lua since the dissector we will create will be for the SOP protocol (an imaginary protocol used in this example).

Add the following lines at the end of Wireshark’s init.lua script:

WSDH_SCRIPT_PATH='path to the directory src of the repo'
SOP_SPECS_PATH='path to the directory of the CSV specs'
dofile('path to sop.lua')

Then in the sop.lua file:

Create a Proto object for your dissector. The Proto class is part of Wireshark’s Lua API.

sop = Proto('SOP', 'Simple Order Protocol')

Load the ws_dissector_helper script. We will use the wsdh object to access various helper functions.

local wsdh = dofile(WSDH_SCRIPT_PATH..'ws_dissector_helper.lua')

Create the proto helper. Note that we pass the Proto object to the createProtoHelper factory function.

local helper = wsdh.createProtoHelper(sop)

Create a table with the values for the default settings. The values can be changed from the Protocols section of Wireshark’s Preferences dialog.

local defaultSettings = {
    ports = '9001-9010',
    trace = true
}
helper:setDefaultPreference(defaultSettings)

Define the protocol’s message types. Each message type has a name and file property. The file property is the filename of the CSV file that contains the specification of the fields for the message type. Note that the CSV files should be located in SOP_SPECS_PATH.

local msg_types = { { name = 'NO', file = 'NO.csv' }, 
                    { name = 'OC', file = 'OC.csv' },
                    { name = 'TR', file = 'TR.csv' },
                    { name = 'RJ', file = 'RJ.csv' } }

Define fields for the header and trailer. If your CSV files contain all the message fields then there is no need to manually create fields for the header and trailer. In our example, the CSV files contain the specification of the payload of the message.

local SopFields = {
    SOH = wsdh.Field.FIXED(1,'sop.header.SOH', 'SOH', '\x01','Start of Header'),
    LEN = wsdh.Field.NUMERIC(3,'sop.header.LEN', 'LEN','Length of the payload (i.e. no header/trailer)'), 
    ETX = wsdh.Field.FIXED(1, 'sop.trailer.ETX', 'ETX', '\x03','End of Message')
}

Then define the Header and Trailer objects. Note that these objects are actually composite fields.

local header = wsdh.Field.COMPOSITE{
    title = 'Header',
    SopFields.SOH,
    SopFields.LEN
}

local trailer = wsdh.Field.COMPOSITE{
    title = 'Trailer',    
    SopFields.ETX
}

Now let’s load the specs using the loadSpecs function of the helper object. The parameters of this function are:

  1. msgTypes this is a table of message types. Each type has two properties: name and file.
  2. dir the directory were the CSV files are located
  3. columns is a table with the mapping of columns:
    • name is the name of the field name column.
    • length is the name of the field legth column.
    • type is the name of the field type column. Optional. Defaults to STRING.
    • desc is the name of the field description column. Optional.
  4. offset the starting value for the offset column. Optional. Defaults to 0.
  5. sep the separator used in the csv file. Optional. Defaults to ‘,’.
  6. header a composite or fixed length field to be added before the fields found in spec.
  7. trailer a composite or fixed length field to be added after the fields found in spec.

The function returns two tables. One containing the message specs and another containing parsers for the message specs. Each message spec has an id, a description and all the fields created from the CSV in a similar fashion to the one we used previously to create SopFields. Each message parser is specialized for a specific message type and they include the boilerplate code needed to handle the parsing of a message.

-- Column mapping. As described above.
local columns = { name = 'Field', 
                  length = 'Length', 
                  type = 'Type',
                  desc = 'Description' }

local msg_specs, msg_parsers = helper:loadSpecs(msg_types,
                                                SOP_SPECS_PATH,
                                                columns,
                                                header:len(),
                                                ',',
                                                header,
                                                trailer)

Now let’s create a few helper functions that will simplify the main parse function.

-- Returns the length of the message from the end of header up to the start 
-- of trailer.
local function getMsgDataLen(msgBuffer)
    return helper:getHeaderValue(msgBuffer, SopFields.LEN)
end

-- Returns the length of whole the message. Includes header and trailer.
local function getMsgLen(msgBuffer)
    return header:len() + 
           getMsgDataLen(msgBuffer) + 
           trailer:len()
end

One of the last steps and definatelly the most complicated is to create the function that validates a message, parses the message using one of the automatically generated message parsers and finally populates the tree in the Packet Details pane.

local function parseMessage(buffer, pinfo, tree)
    -- The minimum buffer length in that can be used to identify a message
    -- must include the header and the MessageType.
    local msgTypeLen = 2
    local minBufferLen = header:len() + msgTypeLen
    -- Messages start with SOH.

    if SopFields.SOH:value(buffer) ~= SopFields.SOH.fixedValue then
        helper:trace('Frame: ' .. pinfo.number .. ' No SOH.')
        return 0
    end 

    -- Return missing message length in the case when the header is split 
    -- between packets. 
    if buffer:len() <= minBufferLen then
        return -DESEGMENT_ONE_MORE_SEGMENT
    end

    -- Look for valid message types.
    local msgType = buffer(header:len(), msgTypeLen):string()
    local msgSpec = msg_specs[msgType]
    if not msgSpec then
        helper:trace('Frame: ' .. pinfo.number .. 
                     ' Unknown message type: ' .. msgType)
        return 0
    end

    -- Return missing message length in the case when the data is split 
    -- between packets.
    local msgLen = getMsgLen(buffer)
    local msgDataLen = getMsgDataLen(buffer)
    if buffer:len() < msgLen then
        helper:trace('Frame: ' .. pinfo.number .. ' buffer:len < msgLen')
        return -DESEGMENT_ONE_MORE_SEGMENT
    end

    local msgParse = msg_parsers[msgType]
    -- If no parser is found for this type of message, reject the whole 
    -- packet.
    if not msgParse then
        helper:trace('Frame: ' .. pinfo.number .. 
                     ' Not supported message type: ' .. msgType)
        return 0
    end

    local bytesConsumed, subtree = msgParse(buffer, pinfo, tree, 0)
    -- Parsing might fail if a field validation fails. For example the
    -- validation of a field of type Field.FIXED.
    if bytesConsumed ~= 0 then
        subtree:append_text(', Type: ' .. msgType)    
        subtree:append_text(', Len: ' .. msgLen)

        pinfo.cols.protocol = sop.name  
    else
        protoHelper:trace('Frame: ' .. pinfo.number .. ' Parsing did not complete.')
    end

    return bytesConsumed
end

Now that the parse function for the SOP protocol is ready, we need to create the dissector function using the getDissector helper function which returns a dissector function containing the basic while loop that pretty much all dissectors need to have.

sop.dissector = helper:getDissector(parseMessage)

Finally enable the dissector. enableDissector registers the ports to the TCP dissector table.

helper:enableDissector()

Testing your dissector

What I usually do to test my dissector is to create a text file with many messages and do the following:

  1. Start a server with nc -l 9001
  2. Start tshark with a display filter with the protocol name: tshark -Y 'sop'. Note that sometimes this approach might hide some Lua errors. Then you can repeat the test using Wireshark instead of tshark.
  3. Connect with a client and send one or more messages from a file: cat messages.txt | nc SERVER_IP 9001
  4. If lines appear in the filtered tshark output then the test was successful.

If you finish testing, you can save the captured frame to a file for future tests.

Installing your dissector

Add the following lines at the end of Wireshark's init.lua script:

WSDH_SCRIPT_PATH='path to the directory src of the repo'
SOP_SPECS_PATH='path to the directory of the CSV specs'
dofile('path to your dissector file')

Reading a file with lines of different fixed width formats

During the last couple of years, I’ve been relearning statistics through various online courses most of which had assignments in the R programming language. Soon enough I grew fond of the simplicity of the language, of its powerful interpreter, of its documentation and base library with its many useful functions for importing, analyzing, plotting and exporting data.

Why (create a tool)

As I was exploring the import functions (most of them are named read.XXX) I fell across read.fwf that reads files with lines of fixed width format. You just pass it a filename and two vectors, one with the name of the columns and another with the size (in characters) of its column. It returns a data.frame with each line split into columns. Nice! In the company where I work we maintain several text protocols for inter-process communication, and one of them is fixed width. Yoo-Hoo, I could read a log file into R and fool around with it. But wait, I forgot, these logs are more complicated. Each line is fixed width, but the protocol includes different message types each one with different formatting. This means that the log file has lines of different fixed width format and read.fwf cannot handle such a file. What I needed was a function capable of handling multiple fixed width formats in a single file. Searching in R’s base package, CRAN, Github and Google didn’t result anything useful.

How

I decided to create it myself and try building it around read.fwf since the base package developers have made such a good job with it.

Design

After some thought, I arrived at the following design:

  1. Split the file according to message type.
  2. Read each new file with read.fwf.

So instead of passing the name and width vectors I would pass a list of name and width of vectors with each pair specifying a message type. On success, the function would return a list of data.frames. Here’s the signature of the function:

read.multi.fwf(file, multi.specs, select, header = FALSE, sep = “”, skip = 0, n = -1, buffersize = 2000, …)

Before starting with the implementation, I studied read.fwf thoroughly and decided to follow its basic structure and argument handling.

Implementation

The first implementation (v0.1) was buggy because of an optimization during the handling of the temp files (from step 1 of the design). To write to these files I use the cat function for each line read from the whole file. Instead of calling cat with the temp filename as parameter file, I used connection objects (think of it as file descriptors). This meant that I had to use the open function to open each temp file, use the connection and close it (using the close function) before returning. This might sound simple but in reality it wasn’t. Although the function returned the correct result, some files were left opened (and were eventually closed by R) while others were closed multiple times (throwing exceptions). After trying to debug this for a while, I gave up and used cat with filenames instead of file handles, which solved the problem. In hidsight, this was a classic case of premature optimization. On the downside, I measured, a performance drop of around 20% which, to be honest, is fine by me.

Did you know: That you can see the source of an R function simply by typing its name on the R console?

Usage

The function is available on CRAN in the package multifwf and the source code in Github.

Here’s an example on how to use read.multi.fwf:

library(multifwf)

# Create a temp file with a few lines from a SOP (Simple Order Protocol,
# an imganinary protocol) log file.
ff cat(file = ff,
'10:15:03:279NOSLMT0000666 EVILCORP00010.77SomeClientId SomeAccountId ',
'10:15:03:793OC000001BLMT0000666 EVILCORP00010.77SomeClientId SomeAccountId ',
'10:17:45:153NOBLMT0000666 EVILCORP00001.10AnotherClientId AnotherAccountId',
'10:17:45:487RJAnotherClientId 004price out of range ',
'10:18:28:045NOBLMT0000666 EVILCORP00011.00AnotherClientId AnotherAccountId',
'10:18:28:472OC000002BLMT0000666 EVILCORP00011.00AnotherClientId AnotherAccountId',
'10:18:28:642TR0000010000010000666 EVILCORP00010.77',
'10:18:28:687TR0000010000020000666 EVILCORP00010.77',
sep = '\n')

# Create a list of specs. Each item contains the specification for
# each message type of this simple protocol.
specs specs[['newOrder']] widths = c(12, 2, 1, 3, 7, 12, 8, 16, 16),
col.names = c('timestamp', 'msgType', 'side', 'type', 'volume',
'symbol', 'price', 'clientId', 'accountId'))
specs[['orderConf']] widths = c(12, 2, 6, 1, 3, 7, 12, 8, 16, 16),
col.names = c('timestamp', 'msgType', 'orderId', 'side', 'type',
'volume', 'symbol', 'price', 'clientId', 'accountId'))

specs[['rejection']] widths = c(12, 2, 16, 3, 48),
col.names = c('timestamp', 'msgType', 'clientId',
'rejectionCode', 'text'))

specs[['trade']] widths = c(12, 2, 6, 6, 7, 12, 8),
col.names = c('timestamp', 'msgType', 'tradeId', 'orderId',
'volume', 'symbol', 'price'))

# The selector function is responsible for identifying the message type
# of a line.
myselector s $newOrder
#> timestamp msgType side type volume symbol price
#> 1 10:15:03:279 NO S LMT 666 EVILCORP 10.77
#> 2 10:17:45:153 NO B LMT 666 EVILCORP 1.10
#> 3 10:18:28:045 NO B LMT 666 EVILCORP 11.00
#> clientId accountId
#> 1 SomeClientId SomeAccountId
#> 2 AnotherClientId AnotherAccountId
#> 3 AnotherClientId AnotherAccountId
#>
#> $orderConf
#> timestamp msgType orderId side type volume symbol price
#> 1 10:15:03:793 OC 1 B LMT 666 EVILCORP 10.77
#> 2 10:18:28:472 OC 2 B LMT 666 EVILCORP 11.00
#> clientId accountId
#> 1 SomeClientId SomeAccountId
#> 2 AnotherClientId AnotherAccountId
#>
#> $rejection
#> timestamp msgType clientId rejectionCode
#> 1 10:17:45:487 RJ AnotherClientId 4
#> text
#> 1 price out of range
#>
#> $trade
#> timestamp msgType tradeId orderId volume symbol price
#> 1 10:18:28:642 TR 1 1 666 EVILCORP 10.77
#> 2 10:18:28:687 TR 1 2 666 EVILCORP 10.77

unlink(ff)