Tuesday, December 25, 2018

802.11 Channel Utilization


I was recently asked about a product that requires 90Mbps (bits per sec) of throughput. With a Gig link, no problem. If the link is 100Mbps, technically it could work but that is a little beyond my engineering safety factor of 20%, so about the most I would want to see on a link is 80% of the line rate, or 80Mbps. But if we put in all Gig links, it should be fine.

Then they broke the news that what they really want is to be able to go wireless, too. That's a lot of throughput for wireless; even with new devices that claim super speeds:

https://www.linksys.com/ru/p/P-EA9500/Wi-Fi speeds up to 5.3 Gbps*

Trying to find the asterisk note at the bottom of the page (and not singling out Linksys for any particular reason as all the commodity product vendors do similar things):

*The standard transmission rates–1000 Mbps or 2166 Mbps (for 5 GHz), 1000 Mbps (for 2.4 GHz), 54 Mbps, and 11 Mbps–are the physical data rates. Actual data throughput will be lower and may depend on the mix of wireless products used and external factors.

So you don't really get 5.3Gbps… these are devices that have three radios

Tri-Band (5 GHz + 5GHz + 2.4 GHz)

And the speeds listed are datarates, and then they sum the maximum datarate on each radio:

Wi-Fi Speed: AC5400 (N1000 + AC2166 + AC2166)

I ask: what type of WiFi system do you have available? Turns out the product will have an 802.11 abgn 1x1:1 radio, that supports 20MHz & SGI (short guard interval). So now we know some capabilities. From http://mcsindex.com/, we know the maximum datarate this device can support is 72.2Mbps:


The description tells us:
  1. 802.11a means device can utilize the 5GHz band. Exactly how many channels that can be used depends on a number of factors: regulatory domain (i.e. US uses FCC, Europe has ETSI, etc), whether DFS channels are supported, etc. So some information here, but not the complete picture. See for more details https://en.wikipedia.org/wiki/List_of_WLAN_channels.
  2. 802.11bg means the device works on 2.4GHz. There is less room for interpretation here, but it still does depend on the regulatory domain. For example, FCC provides for channels 1-11, where three are practically usable simultaneously in an infrastructure deployment (the so called three channel plan, channels 1/6/11). But other regions get additional 2.4GHz channels for use.
  3. 802.11n means that the device uses enhancements to both 2.4 and 5GHz bands for performance improvement. Note that n is not a band, but a series of performance-enhancing features. For practical reasons, those enhancements do more for 5GHz channels than 2.4GHz, but even 2.4GHz operation benefits from these enhancements. To see all of the specific performance capabilities of n, look at the MCS Index table for the HT (high throughput) Index values. Note that there are many; indeed, there are many options and different levels of support so just saying n support does not provide a lot of detail as to the maximum datarate capability of the device. In this case, the device supports 1 spatial stream, 20MHz channel bandwidth, and SGI=400ns (so short guard interval) and a maximum MCS Index of 7, so maximum datarate is 72.2Mbps. One popular device used in the WiFi diagnostic industry claims

802.11 a/b/g/n packet injection at all rates



yet does not support SGI, so how could it support all rates?

So inevitably, after showing the MCSIndex table, the gap to close is 90 Mbps required Vs. 72.2 Mbps capability. But not so fast… who is getting 72.2Mbps?

“But you just showed us the table and told us we can do 72.2!”  they say.



Not really. There are liars, damn liars, and 802.11 WiFi engineers. No one gets real throughput that matches the datarate. Let's be clear on the choice of words: datarate is the rate at which data can be transferred. Throughput is the actual amount of data of that is transferred; make no mistake - users care about throughput:

“How fast can I download the file?”
“The Internet seems slow today”
“Why is my network game so slow?”
“This Youtube video keeps stopping!”

Of course there is some relation: datarate would be the upper bound of throughput and the confusion stems from this fact; in the wired world, in general, throughput generally equals datarate.

Let's try iperf between two hosts, wired, with Gig links. All switched network:

user@host:~$ iperf3 -c 192.168.20.21 -f m -i 1 -t 10
Connecting to host 192.168.20.21, port 5201
[ 4] local 192.168.20.14 port 41333 connected to 192.168.20.21 port 5201
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-1.00 sec 113 MBytes 944 Mbits/sec
...
[ 4] 9.00-10.00 sec 113 MBytes 949 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth
[ 4] 0.00-10.00 sec 1.10 GBytes 949 Mbits/sec sender
[ 4] 0.00-10.00 sec 1.10 GBytes 948 Mbits/sec receiver

So throughput here is about 950Mbps for a link speed of 1000Mbps, so about 95% of line rate, where line rate is datarate. In this case, throughput is basically the same as datarate. There is still some overhead when using TCP so we would never reach 100% for actual data throughput. The link may be saturated, but due to overhead when using TCP/IP, it's not all available for data. For full size frames with an assumed MTU of 1500bytes (so max frame size is 1514, which is MTU+Ethernet Header),

TCP/IP Frame Component
Size [bytes]
Ethernet
14
IP (no options)
20
TCP (no options)
20
Data (TCP MSS)
1460

We can calculate the efficiency. However, it's not that important for this topic as 95% of line rate is good enough for our needs at this point to assume throughput=datarate.

However, this is not always the case in the wired world. It's very common, but sometimes we can run into products where the bottleneck is upstream of the network interface, so the maximum throughput could be less than the datarate. For example, take an iMX6 maker board. There are several manufacturers, for example,


The errata (https://www.nxp.com/docs/en/errata/IMX6DQCE.pdf) for the CPU includes a note regarding maximum throughput:

ERR004512
Description: The theoretical maximum performance of 1 Gbps ENET is limited to 470 Mbps (total for Tx and Rx). The actual measured performance in an optimized environment is up to 400 Mbps.

And indeed, using flowcontrol to even attain these speeds on the infrastructure switch, we can get for the forward and reverse directions:

user@host:~$ iperf -c 192.168.20.21 -f m -i 1 -t 10
------------------------------------------------------------
Client connecting to 192.168.20.21, TCP port 5001
TCP window size: 0.02 MByte (default)
------------------------------------------------------------
[ 3] local 192.168.20.224 port 35568 connected with 192.168.20.21 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 46.8 MBytes 392 Mbits/sec
...
[ 3] 0.0-10.0 sec 501 MBytes 420 Mbits/sec
user@host:~$ iperf -c 192.168.20.21 -f m -i 1 -t 10 -R
------------------------------------------------------------
Client connecting to 192.168.20.21, TCP port 5001
TCP window size: 0.02 MByte (default)
------------------------------------------------------------
[ 3] local 192.168.20.224 port 35569 connected with 192.168.20.21 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 47.8 MBytes 401 Mbits/sec
...
[ 3] 0.0-10.0 sec 504 MBytes 423 Mbits/sec

So we see in this case that we have a link speed of 1Gig, so datarate is 1Gbps, but maximum throughput is somewhat less. We know we do not have a 100Mbps link, or the max would be much lower than what is observed. This validates the errata from the chip manufacturer, as expected, and we match the advertised performance. However, this is an edge case. It does come up, probably most notably in the consumer space when adding an Ethernet interface via USB. If USB 1.1 or 2.0, these speeds can be much less than 1Gbps, so we would run into the same problem as the data flow limitation will come from the USB bus, not the network interface datarate.

So where does that leave us with WiFi and 802.11? It turns actual throughput will be MUCH less than sticker (i.e. datarate) for three reasons:

  1. Sticker performance will always be the maximum possible datarate that is physically possible from the system, and includes best case. Best case in this context means fantastic signal to noise ratio (SNR). So basically if you are in sealed chamber, and are sitting on top of the access point (AP), this will be the datarate. It's not really that bad, but the maximum requires a very healthy SNR and datarate will fall as one moves away from the AP.  5GHz does not travel as far as 2.4GHz signals, so will drop faster for a given distance change than would be observed if using 2.4GHz.
  2. Protocol overhead is significant in the 802.11 world to improve robustness;  one of the biggest issues with wireless is the pervasive issue of packet loss. In the wired world, especially on LANs, packet loss is minimal from the network (not always true, but usually on a good network). However, wireless networks are transient as things come and go that affect the RF environment so we have to manage this packet loss as well as provide for other signaling capabilities. For example, in general, every unicast frame is ACKd per the 802.11 protocol which takes time and bandwidth. Also APs send beacons and other control and management frames either per schedule or as needed to maintain the functioning of the network (e.g. beacons, probe responses, RTS/CTS, etc). All of this extra traffic takes bandwidth away from clients trying to send data.
  3. Finally, the RF environment used by WiFi on a given channel is shared. It's inherently half duplex: if one station is communicating, all others must defer; only one gets the network at a time. Obviously, the more stations that are trying to communicate, the less time available for any one station to make use of the network to transmit data. This is evaluated as channel utilization but can be a somewhat elusive number; in some ways it can be useful to think of channel quantity as time; there is only so much available time to access the channel and when it is gone, it is gone. So we want to optimize what is done during that time; the more we can transfer in a given period of time, the more we can get out of the limited resource available.
So what can we get out of a given WiFi radio in a particular environment? What types of throughput are possible? What is the limiting resource that we have to manage?

In evaluating the maximum throughput, we can look at our three reasons of why we don't get sticker performance and see if we can fix some of them. Those that are left we can then design some experiments to observe and evaluate the behavior.

For SNR, we can fix the test devices in known locations ensuring a healthy SNR so that communications will typically be at highest possible datarates. For protocol overhead, there is not much to do as the protocol is what it is: it behaves in a particular way, per the standard, and that is it. We will just try not to make design choices that would exacerbate this effect: use typical best practice settings. For instance, beacon intervals are typically at just about100ms; we could manipulate the amount of protocol overhead in a number of ways:

  1. Decrease the beacon interval, so more beacons are present for a given time period. This also affects the channel utilization as more frames means higher utilization.
  2. Decrease the datarate of the control and management frames: for a fixed size frame, a lower datarate will take more time to transmit. This takes time away from other stations that want access to the shared RF resource.
  3. Add additional SSIDs, where each SSID will send a beacon at the given datarate. There are spreadsheets available on the Internet which caclulate the overhead due to beacon datarate and number of SSIDs (http://revolutionwifi.blogspot.com/p/ssid-overhead-calculator.html).

However, if we fix these items to what best practice might be, this can eliminate some factors in our evaluation. So we can plan to:
  1. Leave beacon timing at default
  2. Use a channel with few other interfering devices
  3. Use a datarate for management and control frames that is consistent with a high density design, so we want high datarate here (the physics of it shows that higher datarate frames do not travel as far, and for high density designs, we want small, high performance WiFi cells). On 5GHz, lets choose 24Mbps by setting the lowest mandatory rate to 24. All other 802.11a rates are supported.

So the last factor to evaluate is channel utilization. This is in fact the precious resource that needs to be managed; there is only so much, and we really want to optimize the time that we have when accessing the resource to maximize the throughput.

Thought experiment – in a vacuum, compare the following scenarios:
  1. Sending 100Mb at 1Mbps datarate – expect that it might take 100sec to transfer 100Mb (bit, just so numbers are easy to calculate). What would channel utilization be?
  2. Sending 100Mb at 100Mbps datarate – expect that it might take 1sec to transfer 100Mb at this datarate. This is obviously better than the 1Mbps datarate case as it goes much faster. So what would channel utilization be in this case?

For case 1, since the station is transmitting for 100sec, the network is blocked the whole time (ignoring other network effects for simplicity). So the channel utilization will be 100% for this 100sec; there is no time left for other devices to do anything.

For case 2, only while station is transmitting will the utilization be counted; so for the 1sec it is 100%, then for the other 99sec, it would be 0%, and we could do a weighted average and show that it is much lower, on average, than 100%.

I often hear for low throughput requirements: “we don't need 802.11n, or 802.11ac – we don't move a lot of traffic”. It's not really true; using the higher datarate modulations available from 802.11n/ac allow for more optimum use of the channel; though throughput requirement may be low, a high datarate allows for channel access, transmission, then release of the channel faster, allowing more time for others to access the channel and do useful work.

A couple of points:
  1. It's tough to get 100% channel utilization; maximum usable appears to be about 85% when in heavy operation. Values higher usually indicate some type of problem. Values this high will starve out stations as well, so operating even close to hear will be problematic.
  2. It would be nice to trend channel utilization over time on all channels to see what is happening. In the event of an issue, we could look here first to see if heavy utilization correlates to the problem.
For a simple test, we can use the beacons to tell us channel utilization if the QBSS load element is available:



Experiment: how does actual throughput and channel utilization change as we vary datarate?  We can vary the datarate by changing the supported MCS Index values at the AP; since wifi client and AP communication is a sort of negotiation (i.e. both sides announce what they support, and the highest common parameters are usually chosen).  We can use a laptop for test with an Intel 7265 wifi radio; this supports 802.11abgn-ac 2x2:2 adapter and we will work on a 5GHz/20MHz channel.  The chipset will actually do 40 and 80MHz bandwidth, but for testing, we will limit to 20MHz so we can easily see the effect on channel utilization on a single channel.

The MCS Index table for support in this case: 144.4Mbps should be max with 802.11ac capability disabled (Intel 7265). Wireless client is iperf3 test client (so TCP client, and sends data) and report read receiver values (they do differ, slightly - due to packet loss maybe?)


admin@kali:~$ iperf3 -c 192.168.30.21 -f m -i 1 -t 10
Connecting to host 192.168.30.21, port 5201
[ 5] local 192.168.30.157 port 38148 connected to 192.168.30.21 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 5.63 MBytes 47.2 Mbits/sec 1 269 KBytes
[ 5] 1.00-2.00 sec 13.7 MBytes 115 Mbits/sec 0 570 KBytes
[ 5] 2.00-3.00 sec 13.5 MBytes 114 Mbits/sec 0 706 KBytes
[ 5] 3.00-4.00 sec 13.8 MBytes 115 Mbits/sec 0 706 KBytes
[ 5] 4.00-5.00 sec 12.5 MBytes 105 Mbits/sec 0 748 KBytes
[ 5] 5.00-6.00 sec 12.5 MBytes 105 Mbits/sec 0 783 KBytes
[ 5] 6.00-7.00 sec 5.00 MBytes 41.9 Mbits/sec 41 256 KBytes
[ 5] 7.00-8.00 sec 12.5 MBytes 105 Mbits/sec 0 554 KBytes
[ 5] 8.00-9.00 sec 12.5 MBytes 105 Mbits/sec 0 621 KBytes
[ 5] 9.00-10.00 sec 13.8 MBytes 115 Mbits/sec 0 738 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 115 MBytes 96.8 Mbits/sec 42 sender
[ 5] 0.00-10.04 sec 112 MBytes 93.3 Mbits/sec receiver

To get at RSSI we want to look at both sides as we need frames to go back and forth; if the AP power is much higher than the wireless client, then it's entirely possible for the client to continue to hear the AP, but the AP cannot hear the client which causes roaming problems as roaming is typically a client decision.  Buying a more powerful AP for better coverage is not always the answer; actually turning down the AP power to balance with the clients is often a better design decision.  From the client we can get RSSI signal strength from the AP:

admin@kali:~$ iwconfig

wlan0 IEEE 802.11 ESSID:"LWAPV6"
Mode:Managed Frequency:5.825 GHz Access Point: A0:E0:AF:4E:B0:6F
Bit Rate=144.4 Mb/s Tx-Power=22 dBm
Retry short limit:7 RTS thr:off Fragment thr:off
Power Management:on
Link Quality=66/70 Signal level=-44 dBm
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
Tx excessive retries:0 Invalid misc:1 Missed beacon:0

The Cisco AP will give us the signal strength of the wireless client from the AP point of view:


It's also common that AP power is greater than wireless client power, so the client RSSI as measured at the AP will likely be the minimum of the two values.  However, there is no guarantee of this being true.

The Intel client will use the highest datarate available, legacy or HT, not necessarily the highest MCS Index rate.  Therefore, to configure the test let's set as follows:


For a typical high density deployment, I would likely choose different values here.

Results






To interpret, we can see that as MCS Index increases, so does the actual throughput (red squares). Indices 8-11 show some abnormality, to be explained. Notice, though, the channel utilization is not very sensitive, at all, to the actual throughput or the datarate in use; it hovers just over 90% for all tests, regardless of the actual datarates in use.

To explain the issue with indices 8-11, we can see that the maximum projected datarate actually falls. Reference the MCS index table; indices 0-7 for HT (i.e. 802.11n) are for single stream; when we move to indices 8-15, these are two spatial streams. But note that Index 8 has a maximum datarate of 14.4 Mbps; this is well below index 7 value of 72.2Mbps. What is observed via an OTA capture is that the client (in this case an Intel 7265) seems to choose the highest datarate available, not necessarily the highest MCS Index. So until the two stream datarate exceeds the single stream, the maximum single stream rate of 72.2Mbps is selected for transmission. Note that this behavior could be very chipset and version dependent. Other systems could behave very different; it appears that the Cisco AP utilizes the maximum MCS Index as part of the rate selection algorithm. Since the bulk data transfer is upstream with wireless as a client to a wired server, the maximum throughput is heavily dependent on the rate selection of the client, in this case the Intel chipset. Perhaps this explain the slight increase in throughput as MCS Index increases from 8 to 12; the client would be fixed in Tx datarate (MCS 7 single stream datarate of 72.2) but the AP increases it's datarate. The last datapoint is with 802.11ac rates enabled; in this case, it would be 2SS (spatial stream) VHT Index 8 for a datarate of 173.3Mbps.


Observe also that actual iperf throughput (in TCP mode) is always below the datarate; due to the extensive amount of control and management traffic, which uses a rate selection of relatively low basic rates so it's difficult to match the actual throughput with datarate. Note, also, that this implies a number of things when trying to capture traffic: it's very unlikely you won't see the client at all; if so, it's likely not due to a modulation mismatch but more likely that the capture is for the wrong channel, or the adapter does not support promiscuous mode, SGI not supported, etc. The ACKs/Block ACKs are at low datarates so would be picked up even if the data frames (usually QoS-Data) might be missed due to too high a modulation, but evidence will still exist for this MAC address in the form of control traffic.

So we can see that we can get more real throughput with higher datarates, but the channel utilization is the same: we consume channel time (the precious resource to conserve) when transmitting so the higher the datarates in use, the more real throughput is available for a given amount of channel utilization.

Some ideas as to go forward:
  1. Datarate is usually not artificially limited in this way by configuration; this is a test.  What really happens is the SNR is reduced due to distance or other obstructions in the RF path.  This will cause a reduction in throughput as well.
  2. It would be nice to get a rollup of channel utilization, on all channels, as part of a comprehensive data collection initiative that we might get with a typical network manager.  Something like this:

Graphs for all channels would be useful.