Sunday, December 8, 2013

Advanced Analysis of Modbus Traffic

It can be said that a proper network trace (from, say, Wireshark or TCPDump) probably has the answer to the network problem we are facing.  Of course this is not always true, but at the very least it more than likely cuts the problem in half.  With proper understanding of the network protocol, we can usually figure which end is not living up to its requirement.

By way of example, let's have a look at a trace with the following features, as captured from a mirror port at the logical 'master' end, also known as the ModbusTCP client.

The stats are (from Statistics->Summary option in Wireshark):


So we have more than quarter of a million packets, taken over 11 minutes.  95% of them are ModbusTCP packets, so for an analysis of timing, we have quite a ways to go if we want to determine the response time of every query, as well as the time between queries.  This could take quite some time - but is possible by hand.  As an example, to find the response time for a specific query, I will filter in Wireshark by 'Follow TCP stream' as the default display has the packets in chronological order, but I need to find a particular query and it's matching response.  The default view:


We can see many IPs in the source and destination fields, but since we want a particular query and response which would come from the same IP address pair, and most often from the same TCP connection, I can right click on a frame in the window and select follow TCP stream:


 Once we follow the TCP stream, we see only packets for this TCP connection (this is IPaddress,Port <-> IPaddress,Port pair).  For this case we have


Now, in general, the queries and responses are directly related and we can use Wireshark to find some timing information for us.  Choosing a query - recall this is just a sample - I select frame 39 and right click to select Set Time Reference:


Doing this, we can directly find the response time and have Wireshark calculate it for us:


We can see here (helped by the fact that Transaction ID is in use, though this is optional per the ModbusTCP spec so it could always be zero) that the Modbus Response came out 0.038635sec after the query.  This would be the response time, and is then about 39ms.  Great - for sure, Wireshark gave us useful information.  But here's the catch.  I want to use the timings to see the big picture.  We have 150200 total Modbus transactions (a transaction being a query/response pair, which provides for a single response time measurement) across many different hosts (recall we saw different IP addresses in use before filtering to a single TCP connection).

The tool developed automates these calculations and provides for CSV file format output so that the results can be sent to nearly any graphing tool that can import CSV files (comma separated values).  Excel will work, but is not optimal.  A better tool is Minitab, which has great graphs for exploratory data analysis.  The idea is to get the big picture so we can compare what normal might look like for this network compared to what is happening when a problem is indicated.

The tool executes in perl and acts like a processor - it natively reads pcap files (as generated from libpcap, which is what Wireshark and TCPDump can use) and sorts the packets and does the calculations for us.  Note that newer versions of Wireshark use pcapng format, which is not the same - so if this is in use, open the file in Wireshark and save as pcap format.  Also note that the tool expects little endian format, as would be the case if used with Intel or other x86 format processors.  If you have a packet capture from an early MAC (before they went to Intel chips) or other big endian processors (such as SPARC on SUN, MIPS, etc) there be trouble due to endianess issue (these would be rare, but could happen).

Executing the program we get some information but the real value is in the CSV files generated which contain the timing data we want.  Taking a look at the response time as a dot plot (similar to histogram) for ALL the data, we have:

   
Based on this, we might have three separate populations.  I expect random error in network communications - small amount of deviations based on things we have no control over.  Random error is typically Gaussian in nature so we might expect to see some type of bell curve.  At the scaling we have here, it's hard to tell.  Zooming in (remember, this is exploratory data analysis - we don't know what we will find, so we are just looking around the data set to see what it can tell us):


In this range from 0.14 to 0.28sec, it's clear it is not a single population but likely two.  Be careful: scaling has an impact on how the results are viewed.  It is possible (and recommended) to continue down this path to see what is happening, but it's also useful to to have a look at what information is provided.  These past two graphs are for all response time data that has the MasterIP (or client) as shown.  But how many servers (or slaves, in old Modbus terminology) are there?  Can we make separate plots for each one of these to compare?  The power of Minitab is in the by-variables feature - those familiar with SQL will have seen this before as it is similar to the group-by feature - so instead of plotting versus a single master (or many, though in this case there is only a single), let's compare dotplots for each server:


Here we can start to see many slave or server IPs are typical - we don't know what is correct or incorrect - we are only noting what is 'different'.  But some are not typical - some look very different than others.  This greatly narrow down the issues.  For all those that look normal or are consistent with the majority, there is usually little need to dig through these as they can be discarded.  We are looking for the needle in a haystack, but the haystack is getting significantly smaller, with very little effort.  Another view of the data is to look at the response times as a function of time.  Does it vary?  Are there trends or patterns?  The trace is about 11min long - is the response time at the end the same as the beginning?  Let's plot all the response times as a function of time:



Here we have all response times for all Modbus transactions as a function of time in the trace.  Based on the dot plots, we know we will have several different modes and we can see them here.  But which specific devices are different than the others?  Using the by-variables feature again in Minitab, we have:


Immediately apparent from this is that some devices (see yellow circles on graph) have no data - there is no red data point because there is no transaction here.  This doesn't tell us why there is no data - we may need to go back into the trace, but we have narrowed the problem down from 150000+ transactions from 29 slave devices to only those transactions with four slave devices.  This becomes a much more tractable problem.

The tool can also help us index into the trace.  Next time, we will use Minitab to tell us exactly which frame in the trace we should look at.