Sunday, February 23, 2014

TCP Timing and the RTO

There are two general transport layer protocols in use with Industrial Fieldbus Protocols, TCP and UDP.

Both of these are at layer 4 of the 7-layer ISO-OSI network model, and they provide different services to hosts desiring to send data.  In general, layer 2 and layer 3 (datalink layer and IP layer, respectively), send frames on Ethernet as 'best effort'.  There is no guarantee that they arrive at their destination- quality and reliably is generally high in Ethernet networks, so frames sent usually get to there destination, but this is not a guarantee.  At layer 1, it is possible to have some amount of reliability built-in to the protocol.  For example, if the network detects collisions at layer 1 with CSMA-CD, there is a simple retransmission algorithm to send the frames a number of times until it arrives safely.  Also with 802.11 wireless networks, which are CSMA-CA, most frames sent over the wireless link are typically acknowledged, and failing an acknowledgement, data is resent.

UDP does not provide any type of reliability service.  Data is packaged with a simple header and sent down the stack to to the IP layer to forwarded along to it's destination.  It can be thought of as almost using the IP layer directly.  Several examples of use of UDP in ICS protocols is Class 1 EtherNet/IP messaging, also commonly known as I/O or Implicit messaging.  It is designed to be high speed, and their are a series of counters and a timeout mechanism at the application layer (layer 7) which indicate packet loss.  Also Schneider Electric's Global Data service uses UDP, as does Codesys' Network Variable concept, where in both cases, data is published using UDP, often using multicast addressing at layer 3 (IP layer).   Usually a single UDP datagram contains a single message or some related collection of data.  This is not always true, as the application using UDP can spread data over several UDP datagrams, which can end up in several IP packets, and be delivered over multiple Ethernet frames, but still a UDP segment can be used as a type of container where the applications can infer message or data boundary by being contained in a single UDP datagram.  For instance, if we only want to send a short message of a few bytes, but many of them - we could use distinct UDP datagrams and then the receiving application would be able to tell that each datagram (or packet, or frame) was a unique message.  The packet itself provides information that a logical boundary exists.

TCP, on the other hand, provides a reliable byte stream of data, as well as many other services such as flow control, full duplex operation, and in-order delivery.  One point of note in comparison to UDP is the byte stream concept.  TCP guarantees a stream of data, delivered in order, with no gaps.  There is no concept here of a message boundary - all boundary assumptions have to come from the data itself based on the specific protocol and how it is defined at the application layer.  TCP counts each separate byte of data, not segments, packets or frames, and this is fine based on the service provided - the same byte stream needs to arrive at the receiver, but it does not have to be in a single message.

There are several ways TCP provides reliability to the byte stream it is transporting.  In particular, when data is put on the wire (a generalization for network media, in fact it could be a wireless network so also when data is sent out over the airwaves, etc...) a timer is started.  This timer is called the RTO, or round trip timeout.  This is a very important number in TCP when considering reliability because it controls retransmissions, which is how lost data is recovered in the byte stream.

If the RTO expires without the data being explicitly acknowledged it is usually retransmitted (acknowledgment is done through the receiver providing feedback to the sender that it has, in fact, received the data as indicated by acknowledging the sequence number of the sent data).  Lost data will be retransmitted a certain number of times until it is either acknowledged, or the sender gives up and moves on.  In either case, there are a number of parameters, some dynamically calculated, others fixed by configuration, that control the specifics of this process.

Where does this RTO timing come from?  Often when a TCP connection is initiated, little is known about the path between two communication hosts, and little may be known about the receiver in general.  Therefore it is difficult to estimate what the optimum value should be: too small, and we get retransmissions when we don't need them and this just consumes more bandwidth.  Too long, and it can slow down the data communications and provide poor quality to the data link.  In fact there is much research in optimizing this and other TCP timer-based mechanisms to improve data flow.

Most hosts, when initiating a TCP connection, will set a default RTO of some value.  Historically, this was 3 seconds, and still is for some hosts (almost always set by the operating system in use).  So if a host puts some data on the wire at the beginning of the communication stream, and no ACK is received for these bytes in 3sec, the data would be resent.  It is also a dynamic mechanism - what if data is delivered almost immediately with almost no packet loss: then 3 sec is very long to wait in this case.  So the RTO value is adjusted as data is transferred: measurements are taken, called the RTT, or round trip time, and a formula is used to adjust the RTO to optimize the communication channel.  If data is very slow to be ACKed, then we would want to move to a longer RTO.  If data is ACKed very fast, we would want a smaller RTO.  There are many good references on how this is calculated.  Many hosts now have an initial RTO of 1sec, while there are some ICS products designed for high speed use that have initial RTOs in the 25-50ms range.

However, though dynamic, there are limits to what values the RTO can take.  Since ICS protocols are typically 'local' type applications in close proximity and on generally small networks (not always true, but it's common to have these communication protocols running on a single machine, or in a single building, or in a campus).  It's much less common, though possible, to have these protocols running around the world with high latency satellite links.  Therefore, the parameter for us to focus on is the minimum allowed RTO.  There would be a maximum as well, but in my experience the minimum RTO has a bigger impact on communications in the industrial space.

As a test, I used a relatively recent Linux system running on a PC, communicating with ModbusTCP to a Schneider Electric Momentum PLC.  This type of communication that was done simulates typical SCADA type traffic where a SCADA system, usually running on some type of PC, will use an ICS protocol to pull data from a controller for visualization/storage and also send data to the controller to implement specific actions.  Many SCADA systems run on Windows systems, but some work on other OSs, including Linux, Solaris, etc.

For a recent Linux kernel, the initial RTO seems to be around 1sec.  This is difficult to see with on-wire behavior because usually when the connection builds, data starts to send quickly, and the dynamic mechanism of the RTO based on the RTT starts to work.  It is easiest to visualize the steady-state RTO value, which in my test, is hopefully the minimum RTO.

Baseline RTO Test

As a baseline, this is what we get from the Linux client:

[george@linux george]$ uname -a
Linux linux 3.11.2-201.fc19.x86_64 #1 SMP Fri Sep 27 19:20:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux 

This is the kernel in use.  Linux actually allows us to adjust the minimum RTO on a per-route basis, so lets have a look at the routing table now before any adjustments are made:

[george@linux george]$ ip route show
default via 10.171.182.1 dev wlan0  proto static
10.171.182.0/23 dev wlan0  proto kernel  scope link  src 10.171.183.155  metric 9
192.168.10.0/24 dev em1  proto kernel  scope link  src 192.168.10.11  metric 1

In this case, the 192.168.10.0/24 network is in use.  With all of this in place, start communications with the Linux PC as the ModbusTCP client and the Momentum PLC as the server.  We are able to look at the current RTO for this particular TCP connection:

[george@linux george]$ watch -n 1 ss -tn -o dst 192.168.10.99




We can see with the ss command, wrapped inside the watch command, every second the current RTO will be displayed.  In this case, it is at 203ms, which is the middle number of the timer field.

Looking at the raw data from Wireshark, we get the following for the ModbusTCP Response time:


In this case, we can see the typical response is on average about 1.5ms, but there is a significant mode of response up around 2.5ms.  Based on the typical RTOs observed with this data set in Linux, I propose the RTO is calculated for this case as minRTO + RTT.  The minimum RTO is 200ms by default (as evidenced by Linux kernel source code):


#define TCP_RTO_MIN ((unsigned)(HZ/5))

from include/net/tcp.h.

For this network, there are only a few devices and everything is in close proximity with high quality cabling in a lab environment. Therefore, we expect really no packet loss so the RTO should migrate to the minimum value very quickly, which is what was observed.

Packet Drop Test at Default RTO

Now it would be useful to see the RTO in action.  However, this presents a minor problem as it is quite difficult to just drop packets, on command, to test retransmission behavior.  Most of the time I see engineers pull the cable from a device and try to see what the TCP stack does once this link is lost.  This is one particular type of failure event, however, I argue it is only one type of event, and is actually not the most common.  What I have observed that is more common is the loss of a single Ethernet frame that then requires TCP to initiate recovery operations (which is usually a retransmission).  Pulling the cable does not represent this; and the single packet drop on the network is usually not done by the network itself.  Though the network can drop a frame if it is too busy, nearly all switches and bridges in use today make use of dedicated switch chips which can forward frames in hardware at full line rate.  In contrast, most Ethernet devices used in industrial control operate at MUCH lower levels than, say, 100MBps Ethernet is capable of doing - most of the times less than 10% of line rate.  Therefore, it's likely not the network dropping a packet, but an end device not being able to handle it within it's TCP/IP stack so drops a frame due to overload or a bug.

So how to drop a single packet?  The best way is with custom hardware designed to do just this - one such machine is a Spirent or Ixia GEM impairment tool.  These have great control over a TCP connection and can be programmed to drop specific frames and the retransmission behavior can then be analyzed experimentally.

But these can cost thousands of dollars - so what if we don't have one?  Aside from pulling the cable, can I just drop a frame now and again to see what retransmission will be, and see how the RTO affects these retransmissions?

A small Latvian company, Mikrotik, makes some very nice, inexpensive, routers and other networking equipment.  Perusing one of their $120 (street price) routers, RB-450G, I ran across a feature in the IP Firewall configuration section that could be of use in this case.  Note the product has three different firewalls - one at the switch chip, one at the bridge level, and the third at the IP layer.  Here is the configuration of the IP Firewall rule - the first is the selection rule for choosing TCP data with a destination port of 502.  This should be familiar to those who know ModbusTCP and TCP/IP as a selector for client traffic, or Modbus queries.  The ModbusTCP server, which accepts queries and issues responses, is usually listening on TCP port 502, while the client will usually use an ephemeral port (some number, perhaps greater than 32K) as it's port.



Selecting the forward chain, as data is passing through the firewall device, and TCP with a destination port of 502 for client traffic to server traffic:


The interesting part for us is the random field:


This will select random packets at this frequency (based on 100%) - so this is to select 1/100 packets, on average, from the traffic selector we have, that of packets (mostly ModbusTCP queries) destined for the server device. 

Finally, we need to disposition our selected packets... in this case we want to drop, and see what the TCP stack does with the RTO setting:


Let's be clear: this is not as good as a dedicated machine purpose-built to drop packets.  With the GEM tool, you have very specific traffic selectors with much better control, and can do a lot more impairment (multiple drops, change any bit within the frame, duplicate packets, reorder, delay, etc.)  However, for the price, the Mikrotik is tough to beat.  Note that if you are faced with a specific issue, it may take some time to actually drop the specific frame you are looking for with this method due to the lack of richness in the selector tools.  However, probability says if we let it run long enough, we will probably see the event we are looking for.

So now we can drop packets and look at the RTO in action.  

Hypothesis: we will use the Mikrotik to drop a ModbusTCP query, and the Linux client will wait about one RTO period and then retransmit.  

Let's have a look with Wireshark, from the viewpoint of the Linux PC client.  Where you sniff traffic is important: we know the ModbusTCP software will produce the query, and it will get transmitted onto the network.  But before it gets to the server, it will be dropped by the firewall.  So if we sniff traffic at the client end, we will see this query, but it will appear from behavior that it was ignored as Wireshark will then show the same frame again, indicating a retransmission.  If we sniffed traffic at the server end, we would never see the original query, so Wireshark would not indicate a retransmission.  Note that the network has no concept of retransmission: it is a state held by a particular TCP/IP stack.  Even with these two devices, they both do not have to agree on what a retransmission is: based on where you sniff here in this test, the client knows it is a retransmission and Wireshark shows it.  But I propose the retransmitted query from the server's point of view is not a retransmission - the first one never arrived so to this device, the second retransmitted query is the original query.  Dropping a query downstream from here we se a delay of 204ms in retransmit, just as expected:


Adjusting the Default minimum RTO

Linux allows us to adjust the min RTO for a given route using the ip suite of tools.  To do this we perform the following.  Display route:

[george@linux george]$ ip route show
default via 10.171.182.1 dev wlan0  proto static
10.171.182.0/23 dev wlan0  proto kernel  scope link  src      10.171.183.155  metric 9
192.168.10.0/24 dev em1  proto kernel  scope link  src    192.168.10.11  metric 1
192.168.10.20 via 192.168.10.11 dev em1

Note in particular the addition of specific host route now - the last entry for 192.168.10.20.  To change the minimum RTO:  

[root@linux george]# route change 192.168.10.20/32 via 192.168.10.11 dev em1 rto_min 50 

The documentation for this command indicates I need to add units to the new RTO value, in this case, 50.  However, the command will error when I add units (such as 50ms, or 50 ms, etc.) but this seems to work.  Verifying the change:

[george@linux george]$ ip route show
default via 10.171.182.1 dev wlan0  proto static
10.171.182.0/23 dev wlan0  proto kernel  scope link  src 10.171.183.155  metric 9192.168.10.0/24 dev em1  proto kernel  scope link  src 192.168.10.11  metric 1
192.168.10.20 via 192.168.10.11 dev em1  rto_min lock 50ms

Notice we need root privileges to make the route change.  The last host specific route now shows the minimum RTO at 50ms.  So time to try it. 

Packet Drop Test at 50ms RTO

Using our ss command, as we did earlier, we can see what Linux has for the current RTO now:

The RTO is now 53ms, inline with before: minRTO+RTT.

Wireshark shows, on packet drop:


As expected, a dropped query resulted in a retransmission at the RTO time, 53ms.  

Summary

We have explored use of TCP with ModbusTCP and how a TCP system retransmits data.  We have also shown how it is possible to use Linux to adjust the minimum RTO on a per route basis, and how inexpensive hardware can be used to drop selective packets in a TCP stream to study the TCP protocol under actual use conditions.