Monday, May 26, 2014

Packet Generation at Layer 2

Packet generation / reception

Looking at different ways to generate the three basic types of layer-2 Ethernet traffic on a wired network:

1. Unicast: destination MAC address to a specific NIC
2. Multicast: destination MAC address starting with 01:00:5e
3. Broadcast: destination MAC address that is ff:ff:ff:ff:ff:ff

I found some tools that work well to send/receive various types of traffic.

 

 Background

Working on some course material for Wireshark, being able to create these three traffic types is useful to test various concepts related to

1. Promiscuous mode
2. IGMP snooping
3. Switching/bridging
4. Hubs
5. Broadcast domains
6. How CPU load varies with traffic type
7. OS and driver differences

 

Some tools

I categorize packet creation utilities into two broad categories - those that are raw packet generators, and those that utilize host services.  For instance, some tools will just create a raw frame and put it on the wire, so all fields must be defined.  This is an example of what a raw packet generator might do.  There are no OS defaults, as their really is no interaction with the host OS.  For instance, the typical programmer might decide to send data to another host, and would select that host by either hostname or IP address.  Even if a host name is used, DNS (or hosts file) will be used to lookup the IP address, so in either case, data is sent to an IP address.  However, with the next hop model of Ethernet and TCP/IP, to actually forward this frame along to another host some MAC address is required, so the sending host OS will lookup the destination IP in it's route table and determine the next hop.  To find the MAC address of the next hop, the sending host OS will usually use ARP to determine the MAC for a given IP address.  With the raw packet generation tools, none of this support infrastructure works - there is no DNS lookup, no route tables to consult, no ARP protocol with cache support, etc.

Contrast this with other tools that utilize some of the host OS's services.  Many of these are for packet generation, but will utilize host services to varying degrees to actually send and/or receive frames.  This includes use of ARP and the ARP cache, the host route table and gateways, etc.  

These lists are not exhaustive; they are just some of the tools that I have used with varying degrees of success.  Indeed, a google search will yield MANY options to send test data on a network for testing.

 

Pure Packet Generators

Ostinato - this is nice, and works on Windows and Linux.  Creates custom packets and can define streams to send.  Obviously limited to system resources, so really happy frame rates are difficult to sustain depending on PC hardware.  Concept of interface selection is not that obvious, either.  I have only used with the GUI, not sure if anything else is available.  Supports sending packets from a pcap file, but struggles with a file of any real size.  

packETH - similar to Ostinato, but looks native to Linux (there may be ports, but don't know how fresh they are for other OSs.)  Rather simple to use, and has GUI or CLI capability, and has capability to read pcap files (functionality not tested).

Colasoft Packet Builder - free Windows tool.

Smartbits, IXIA, and other hardware tools - these solutions tend to be the most expensive, but are purposed designed for this type of operation so have great capabilities.  Always nice to have one of these if you can fit it into the budget.

Bit-Twist - multi-OS support and will easily send pcap capture file packets on to the wire; actually this s my preferred tool for capture file replay.  Includes tool to edit capture files - say you want to change all the destination MAC addresses in a pcap trace file to replay them to a different host: this will allow us to easily make these types of changes.  Note that replay of TCP really is not effective without significant support that I have not seen in any of these tools (it may exist, but I have not seen nor verified it for replay).

 

Generators that utilize host OS services

netcat, ncat, nc - tool for sending and receiving either unicast or multicast traffic.  Would not work with broadcast traffic.  All these tools do so much more than my specific needs for this project so a search of the various features based on what you need is recommended.

iperf/jperf - command line (or Java GUI with jperf, which is recommended as it has graphing capability) for testing performance.  While doing performance testing over Ethernet for unicast or multicast, it is sending packets so could be used as a packet generator depending on what is required: if only layer 2 destination generation is important, than it could fit the bill.  If more customization is needed, then it might not work.  This runs on nearly all platforms, including Windows, Linux, RaspberryPi w/ Pidora, Android tablets and phones, Mac OS X...

hping3 - CLI tool to send and receive packets.  Functionally flexible, but has a somewhat unique capability to send unicast, multicast, and broadcast traffic.  Also has listener services, but these have never been tested here.

socat - CLI tool for Linux that makes a great generic server.  Will listen on a port and accept unicast, multicast or broadcast traffic and echo the contents to the shell.  The broadcast reception capability seems to be somewhat unique.

 

Test Setup



To see how other hosts might view the layer-2 test traffic under various configurations, we can attach other devices to the L2 switch and run Wireshark or other packet sniffing tool (tcpdump, OmniPeak, etc.).

On the server side, to be able to receive all three types of layer-2 traffic - unicast, multicast, or broadcast - we use the socat command:

[root]# socat - UDP4-RECV:54321,ip-add-membership=224.10.20.30:0.0.0.0,broadcast

and we can see an IGMP join is generated, for this group:


This command will accept data from UDP port 54321, whether it is sent via layer-2 unicast, multicast or broadcast.  Some diagnostic commands to see how things are going.  First, what multicast groups is the pidora server now listening to:

[root@pidora ~]# netstat -g
IPv6/IPv4 Group Memberships
Interface       RefCnt Group
--------------- ------ ---------------------
lo              1      all-systems.mcast.net
eth0            1      224.10.20.30
eth0            1      all-systems.mcast.net
lo              1      ff02::1
lo              1      ff01::1
eth0            1      ff02::1:ff0f:8b09
eth0            1      ff02::1

eth0            1      ff01::1

What UDP port is the pidora server listening on:

[root@pidora ~]# netstat -lnu
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
udp        0      0 0.0.0.0:46359           0.0.0.0:*                          
udp        0      0 0.0.0.0:54321           0.0.0.0:*                          
udp        0      0 127.0.0.1:323           0.0.0.0:*                          
udp        0      0 0.0.0.0:68              0.0.0.0:*                          
udp        0      0 0.0.0.0:123             0.0.0.0:*                          
udp6       0      0 ::1:323                 :::*                               
udp6       0      0 :::3690                 :::*                               
udp6       0      0 :::123                  :::*                           
                          
Now on the client side, we need different commands to create the different traffic streams.  If we use a pure traffic generator, we have to fill in all the information: source and destination MAC and IP addresses, select packet types, etc.  If we use a tool that can utilize host services, we need to provide less information.  As the command will utilize the routing table and ARP services/cache, we just need to worry about layer-3 information and above (destination IP, UDP port, etc.).

To create unicast traffic from the client:

[george@clt]$ hping3 192.168.10.101 -2 -c 100 -i u10000 -p 54321 -s 1025 -k -e 1
[open_sockraw] socket(): Operation not permitted
[main] can't open raw socket

Obviously this command results in an error - we need to be root:

[root@clt]# hping3 192.168.10.101 -2 -c 100 -i u10000 -p 54321 -s 1025 -k -e 1
HPING 192.168.10.101 (p16p1 192.168.10.101): udp mode set, 28 headers + 1 data bytes
[main] memlockall(): Success

Some of the options used:

Destination IP: 192.168.10.101
-2 for UDP mode
-c 100 as packet count, send 100 packets
-i u10000 as packet send delay, in microseconds, so send a packet every 10ms
-p 54321 is UDP destination port 
-s 1025 is UDP source port
-k is to fix the source port - if not used, each packet sent will increment source port by 1
-e 1 is the data to send - so the UDP segment will contain an ASCII value of 1, or 0x31

So this is better - we can now see traffic flow across the network, and the console prints out '1' on the pidora server side.  Note the display of '1's as packets come in - socat will display the UDP data sent with the -e switch as part of the hping3 command:

[root]# socat - UDP4-RECV:54321,ip-add-membership=224.10.20.30:0.0.0.0,broadcast
11111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

Note that the hping3 command gives frequent segmentation faults when running on this platform (Fedora20 in a VirtualBox VM, on a MacBook host with bridged networking to the Apple thunderbolt GigE adapter).

To create multicast traffic:

[root@clt]# hping3 224.10.20.30 -2 -c 100 -i u10000 -p 54321 -s 1025 -k -e 2
HPING 224.10.20.30 (p16p1 224.10.20.30): udp mode set, 28 headers + 1 data bytes
[main] memlockall(): Success

The only difference is we send an ASCII '2' as the data this time, as well as the destination IP address.  Note that since we are using a tool that can use OS services, specifying the IP as multicast will cause the packet to be generated to have the correct multicast MAC, per the RFC.  It does not need to be specified as the Linux kernel on the client knows how to do this, and will do it for us.  Also for broadcast traffic:

[root@clt]# hping3 192.168.10.255 -2 -c 100 -i u10000 -p 54321 -s 1025 -k -e 3
HPING 192.168.10.255 (p16p1 192.168.10.255): udp mode set, 28 headers + 1 data bytes
[main] memlockall(): Success

sending an ASCII '3' with this data set, and choosing a subnet broadcast address.  The host OS will map this subnet local broadcast IP address to a broadcast MAC of ff:ff:ff:ff:ff:ff.  Subnet local broadcast is defined by the IP address and the subnet mask, where the subnet mask here is class C, \24 or 255.255.255.0.  And the results from the server:

[root]# socat - UDP4-RECV:54321,ip-add-membership=224.10.20.30:0.0.0.0,broadcast
111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111122222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222223333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

Here we can see all three packet type were accepted by the presence of the ASCII data sent.  

The broadcast address used is the subnet broadcast.  The Linux system client has two interfaces, and if the general broadcast address is used, I could not generate traffic, even if I specify the interface. Disabling the interface not used in the test: 

[root@clt]# ifconfig p7p1 down

allows this command to now work:

[root@clt]# hping3 255.255.255.255 -2 -c 100 -i u10000 -p 54321 -s 1025 -k -e 4
HPING 255.255.255.255 (p16p1 255.255.255.255): udp mode set, 28 headers + 1 data bytes
[main] memlockall(): Success

With these results:

[root]# socat - UDP4-RECV:54321,ip-add-membership=224.10.20.30:0.0.0.0,broadcast
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222233333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333334444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444

Wireless

This proved to be much more difficult.  We need capability for packet injection, and required specific hardware support and the right software tools.  Next time, we will show how to inject wireless packets on Windows and Linux, and show some of the issues present when doing so.  Also up for discussion is the concept of sniffing traffic on 802.11 wireless interfaces, which can be challenging.

Sunday, February 23, 2014

TCP Timing and the RTO

There are two general transport layer protocols in use with Industrial Fieldbus Protocols, TCP and UDP.

Both of these are at layer 4 of the 7-layer ISO-OSI network model, and they provide different services to hosts desiring to send data.  In general, layer 2 and layer 3 (datalink layer and IP layer, respectively), send frames on Ethernet as 'best effort'.  There is no guarantee that they arrive at their destination- quality and reliably is generally high in Ethernet networks, so frames sent usually get to there destination, but this is not a guarantee.  At layer 1, it is possible to have some amount of reliability built-in to the protocol.  For example, if the network detects collisions at layer 1 with CSMA-CD, there is a simple retransmission algorithm to send the frames a number of times until it arrives safely.  Also with 802.11 wireless networks, which are CSMA-CA, most frames sent over the wireless link are typically acknowledged, and failing an acknowledgement, data is resent.

UDP does not provide any type of reliability service.  Data is packaged with a simple header and sent down the stack to to the IP layer to forwarded along to it's destination.  It can be thought of as almost using the IP layer directly.  Several examples of use of UDP in ICS protocols is Class 1 EtherNet/IP messaging, also commonly known as I/O or Implicit messaging.  It is designed to be high speed, and their are a series of counters and a timeout mechanism at the application layer (layer 7) which indicate packet loss.  Also Schneider Electric's Global Data service uses UDP, as does Codesys' Network Variable concept, where in both cases, data is published using UDP, often using multicast addressing at layer 3 (IP layer).   Usually a single UDP datagram contains a single message or some related collection of data.  This is not always true, as the application using UDP can spread data over several UDP datagrams, which can end up in several IP packets, and be delivered over multiple Ethernet frames, but still a UDP segment can be used as a type of container where the applications can infer message or data boundary by being contained in a single UDP datagram.  For instance, if we only want to send a short message of a few bytes, but many of them - we could use distinct UDP datagrams and then the receiving application would be able to tell that each datagram (or packet, or frame) was a unique message.  The packet itself provides information that a logical boundary exists.

TCP, on the other hand, provides a reliable byte stream of data, as well as many other services such as flow control, full duplex operation, and in-order delivery.  One point of note in comparison to UDP is the byte stream concept.  TCP guarantees a stream of data, delivered in order, with no gaps.  There is no concept here of a message boundary - all boundary assumptions have to come from the data itself based on the specific protocol and how it is defined at the application layer.  TCP counts each separate byte of data, not segments, packets or frames, and this is fine based on the service provided - the same byte stream needs to arrive at the receiver, but it does not have to be in a single message.

There are several ways TCP provides reliability to the byte stream it is transporting.  In particular, when data is put on the wire (a generalization for network media, in fact it could be a wireless network so also when data is sent out over the airwaves, etc...) a timer is started.  This timer is called the RTO, or round trip timeout.  This is a very important number in TCP when considering reliability because it controls retransmissions, which is how lost data is recovered in the byte stream.

If the RTO expires without the data being explicitly acknowledged it is usually retransmitted (acknowledgment is done through the receiver providing feedback to the sender that it has, in fact, received the data as indicated by acknowledging the sequence number of the sent data).  Lost data will be retransmitted a certain number of times until it is either acknowledged, or the sender gives up and moves on.  In either case, there are a number of parameters, some dynamically calculated, others fixed by configuration, that control the specifics of this process.

Where does this RTO timing come from?  Often when a TCP connection is initiated, little is known about the path between two communication hosts, and little may be known about the receiver in general.  Therefore it is difficult to estimate what the optimum value should be: too small, and we get retransmissions when we don't need them and this just consumes more bandwidth.  Too long, and it can slow down the data communications and provide poor quality to the data link.  In fact there is much research in optimizing this and other TCP timer-based mechanisms to improve data flow.

Most hosts, when initiating a TCP connection, will set a default RTO of some value.  Historically, this was 3 seconds, and still is for some hosts (almost always set by the operating system in use).  So if a host puts some data on the wire at the beginning of the communication stream, and no ACK is received for these bytes in 3sec, the data would be resent.  It is also a dynamic mechanism - what if data is delivered almost immediately with almost no packet loss: then 3 sec is very long to wait in this case.  So the RTO value is adjusted as data is transferred: measurements are taken, called the RTT, or round trip time, and a formula is used to adjust the RTO to optimize the communication channel.  If data is very slow to be ACKed, then we would want to move to a longer RTO.  If data is ACKed very fast, we would want a smaller RTO.  There are many good references on how this is calculated.  Many hosts now have an initial RTO of 1sec, while there are some ICS products designed for high speed use that have initial RTOs in the 25-50ms range.

However, though dynamic, there are limits to what values the RTO can take.  Since ICS protocols are typically 'local' type applications in close proximity and on generally small networks (not always true, but it's common to have these communication protocols running on a single machine, or in a single building, or in a campus).  It's much less common, though possible, to have these protocols running around the world with high latency satellite links.  Therefore, the parameter for us to focus on is the minimum allowed RTO.  There would be a maximum as well, but in my experience the minimum RTO has a bigger impact on communications in the industrial space.

As a test, I used a relatively recent Linux system running on a PC, communicating with ModbusTCP to a Schneider Electric Momentum PLC.  This type of communication that was done simulates typical SCADA type traffic where a SCADA system, usually running on some type of PC, will use an ICS protocol to pull data from a controller for visualization/storage and also send data to the controller to implement specific actions.  Many SCADA systems run on Windows systems, but some work on other OSs, including Linux, Solaris, etc.

For a recent Linux kernel, the initial RTO seems to be around 1sec.  This is difficult to see with on-wire behavior because usually when the connection builds, data starts to send quickly, and the dynamic mechanism of the RTO based on the RTT starts to work.  It is easiest to visualize the steady-state RTO value, which in my test, is hopefully the minimum RTO.

Baseline RTO Test

As a baseline, this is what we get from the Linux client:

[george@linux george]$ uname -a
Linux linux 3.11.2-201.fc19.x86_64 #1 SMP Fri Sep 27 19:20:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux 

This is the kernel in use.  Linux actually allows us to adjust the minimum RTO on a per-route basis, so lets have a look at the routing table now before any adjustments are made:

[george@linux george]$ ip route show
default via 10.171.182.1 dev wlan0  proto static
10.171.182.0/23 dev wlan0  proto kernel  scope link  src 10.171.183.155  metric 9
192.168.10.0/24 dev em1  proto kernel  scope link  src 192.168.10.11  metric 1

In this case, the 192.168.10.0/24 network is in use.  With all of this in place, start communications with the Linux PC as the ModbusTCP client and the Momentum PLC as the server.  We are able to look at the current RTO for this particular TCP connection:

[george@linux george]$ watch -n 1 ss -tn -o dst 192.168.10.99




We can see with the ss command, wrapped inside the watch command, every second the current RTO will be displayed.  In this case, it is at 203ms, which is the middle number of the timer field.

Looking at the raw data from Wireshark, we get the following for the ModbusTCP Response time:


In this case, we can see the typical response is on average about 1.5ms, but there is a significant mode of response up around 2.5ms.  Based on the typical RTOs observed with this data set in Linux, I propose the RTO is calculated for this case as minRTO + RTT.  The minimum RTO is 200ms by default (as evidenced by Linux kernel source code):


#define TCP_RTO_MIN ((unsigned)(HZ/5))

from include/net/tcp.h.

For this network, there are only a few devices and everything is in close proximity with high quality cabling in a lab environment. Therefore, we expect really no packet loss so the RTO should migrate to the minimum value very quickly, which is what was observed.

Packet Drop Test at Default RTO

Now it would be useful to see the RTO in action.  However, this presents a minor problem as it is quite difficult to just drop packets, on command, to test retransmission behavior.  Most of the time I see engineers pull the cable from a device and try to see what the TCP stack does once this link is lost.  This is one particular type of failure event, however, I argue it is only one type of event, and is actually not the most common.  What I have observed that is more common is the loss of a single Ethernet frame that then requires TCP to initiate recovery operations (which is usually a retransmission).  Pulling the cable does not represent this; and the single packet drop on the network is usually not done by the network itself.  Though the network can drop a frame if it is too busy, nearly all switches and bridges in use today make use of dedicated switch chips which can forward frames in hardware at full line rate.  In contrast, most Ethernet devices used in industrial control operate at MUCH lower levels than, say, 100MBps Ethernet is capable of doing - most of the times less than 10% of line rate.  Therefore, it's likely not the network dropping a packet, but an end device not being able to handle it within it's TCP/IP stack so drops a frame due to overload or a bug.

So how to drop a single packet?  The best way is with custom hardware designed to do just this - one such machine is a Spirent or Ixia GEM impairment tool.  These have great control over a TCP connection and can be programmed to drop specific frames and the retransmission behavior can then be analyzed experimentally.

But these can cost thousands of dollars - so what if we don't have one?  Aside from pulling the cable, can I just drop a frame now and again to see what retransmission will be, and see how the RTO affects these retransmissions?

A small Latvian company, Mikrotik, makes some very nice, inexpensive, routers and other networking equipment.  Perusing one of their $120 (street price) routers, RB-450G, I ran across a feature in the IP Firewall configuration section that could be of use in this case.  Note the product has three different firewalls - one at the switch chip, one at the bridge level, and the third at the IP layer.  Here is the configuration of the IP Firewall rule - the first is the selection rule for choosing TCP data with a destination port of 502.  This should be familiar to those who know ModbusTCP and TCP/IP as a selector for client traffic, or Modbus queries.  The ModbusTCP server, which accepts queries and issues responses, is usually listening on TCP port 502, while the client will usually use an ephemeral port (some number, perhaps greater than 32K) as it's port.



Selecting the forward chain, as data is passing through the firewall device, and TCP with a destination port of 502 for client traffic to server traffic:


The interesting part for us is the random field:


This will select random packets at this frequency (based on 100%) - so this is to select 1/100 packets, on average, from the traffic selector we have, that of packets (mostly ModbusTCP queries) destined for the server device. 

Finally, we need to disposition our selected packets... in this case we want to drop, and see what the TCP stack does with the RTO setting:


Let's be clear: this is not as good as a dedicated machine purpose-built to drop packets.  With the GEM tool, you have very specific traffic selectors with much better control, and can do a lot more impairment (multiple drops, change any bit within the frame, duplicate packets, reorder, delay, etc.)  However, for the price, the Mikrotik is tough to beat.  Note that if you are faced with a specific issue, it may take some time to actually drop the specific frame you are looking for with this method due to the lack of richness in the selector tools.  However, probability says if we let it run long enough, we will probably see the event we are looking for.

So now we can drop packets and look at the RTO in action.  

Hypothesis: we will use the Mikrotik to drop a ModbusTCP query, and the Linux client will wait about one RTO period and then retransmit.  

Let's have a look with Wireshark, from the viewpoint of the Linux PC client.  Where you sniff traffic is important: we know the ModbusTCP software will produce the query, and it will get transmitted onto the network.  But before it gets to the server, it will be dropped by the firewall.  So if we sniff traffic at the client end, we will see this query, but it will appear from behavior that it was ignored as Wireshark will then show the same frame again, indicating a retransmission.  If we sniffed traffic at the server end, we would never see the original query, so Wireshark would not indicate a retransmission.  Note that the network has no concept of retransmission: it is a state held by a particular TCP/IP stack.  Even with these two devices, they both do not have to agree on what a retransmission is: based on where you sniff here in this test, the client knows it is a retransmission and Wireshark shows it.  But I propose the retransmitted query from the server's point of view is not a retransmission - the first one never arrived so to this device, the second retransmitted query is the original query.  Dropping a query downstream from here we se a delay of 204ms in retransmit, just as expected:


Adjusting the Default minimum RTO

Linux allows us to adjust the min RTO for a given route using the ip suite of tools.  To do this we perform the following.  Display route:

[george@linux george]$ ip route show
default via 10.171.182.1 dev wlan0  proto static
10.171.182.0/23 dev wlan0  proto kernel  scope link  src      10.171.183.155  metric 9
192.168.10.0/24 dev em1  proto kernel  scope link  src    192.168.10.11  metric 1
192.168.10.20 via 192.168.10.11 dev em1

Note in particular the addition of specific host route now - the last entry for 192.168.10.20.  To change the minimum RTO:  

[root@linux george]# route change 192.168.10.20/32 via 192.168.10.11 dev em1 rto_min 50 

The documentation for this command indicates I need to add units to the new RTO value, in this case, 50.  However, the command will error when I add units (such as 50ms, or 50 ms, etc.) but this seems to work.  Verifying the change:

[george@linux george]$ ip route show
default via 10.171.182.1 dev wlan0  proto static
10.171.182.0/23 dev wlan0  proto kernel  scope link  src 10.171.183.155  metric 9192.168.10.0/24 dev em1  proto kernel  scope link  src 192.168.10.11  metric 1
192.168.10.20 via 192.168.10.11 dev em1  rto_min lock 50ms

Notice we need root privileges to make the route change.  The last host specific route now shows the minimum RTO at 50ms.  So time to try it. 

Packet Drop Test at 50ms RTO

Using our ss command, as we did earlier, we can see what Linux has for the current RTO now:

The RTO is now 53ms, inline with before: minRTO+RTT.

Wireshark shows, on packet drop:


As expected, a dropped query resulted in a retransmission at the RTO time, 53ms.  

Summary

We have explored use of TCP with ModbusTCP and how a TCP system retransmits data.  We have also shown how it is possible to use Linux to adjust the minimum RTO on a per route basis, and how inexpensive hardware can be used to drop selective packets in a TCP stream to study the TCP protocol under actual use conditions.