Simulating Network Latency

This post is rather verbose and i might need to trim a few sections and tighten it up a little. If you notice any errors or mistake please mention it in a comment for this post. 

One of the hardest things to come to terms with when working in a globally distributed organizations is the latency of communication. Everyone understands the latency in human communication when working in geographically different zones, but for a Programmer or a Systems engineer understanding the way network latency affects applications or services is tricky. 

A simple example to put this in perspective: 

You have a tool/script that queries Employee data like Name, Department, UserName etc from a database. Typically on most LAN's, if  the tool is being run on a desktop or workstation it has a typical network latency of communication with the database server (which is also on the LAN) in the order of around 0.25 milliseconds or lower. Assuming that the processing overhead of database server adds another .5 milliseconds you expect the tool to return its results in about 1 millisecond i.e.

  1. 0.25 ms to send the request to the Database
  2. 0.5 ms for the Database to process the query request
  3. 0.25 to return the resultant data back to the tool

Add up all the time you get 1 millisecond.

The above example assumes you are requesting data for a single employee i.e. you expect to get a single row of data back.  Now if the database server was geographically further away from the machine running the tool (eg: Database is located in Los Angeles and tool is running in Mumbai).

  1. 125 ms to send the request to the Database
  2. 0.5 ms for the Database to process the query request
  3. 125 ms to return the resultant data back to the tool

Now with this shift of the database your turnaround time for the tool has gone from 1 ms to 250.5 ms, which may not be perceivable to a user, but that's still a 250% increase in time.

Now adding to the above example, if we were to query details for say 100 employee's we would see the following latency flow

  1. 125 ms to send the request to the Database
  2. 50 ms for the Database to process the query request
  3. 200 ms to return the resultant data back to the tool

Now the turnaround time has jumped further to 350ms. Even now this latency would not be perceivable and will be completely within the tolerance level of a user expecting to get back some data. But most tools or applications are not this simple. As you add more queries with greater complexity querying copious amounts of data, this latency starts to really affect application responsiveness.

Now its very clear that i have generalized a lot in the above exampled but if you are building distributed applications where data sources like Fileservers, Webservers or Databases are geographically distant from the client application you are going to have to spend a lot of time optimizing your application to handle this network latency efficiently while providing your user with a responsive application.

There are numerous ways that you can deal with this in the application, from client caching, database and file replication to higher end network connections that lower latency.  There are numerous web sites and blogs dedicated to delving into such solutions. The first step towards dealing with such problems is to have a test environment that can simulate such network latency.

This blog post describes a simple way of emulating various network conditions like latency and packet loss on your LAN environment. 

To setup this environment you will need 3 Linux machines (or VM's). One of them will serve as your development machine(DEV-1) which executes the client tool/application and the second will serve as the destination server(DB-1) hosting a database, webserver or fileserver. Sitting in between these two machines is the third linux machine which will play the role of the gateway(GW-1) which can inject latency and simulate packet drops.

network-latency-image1.jpg

We will be using the Linux Netfilter tool iptables and Traffic Control & Routing tool tc . Most modern linux distributions come with this preinstalled.

For this example we assume we have a script which queries a database running on DB-1 for some data. Assuming the database server running on DB-1 is Mysql it defaults to using port 3306 for its client connections. We would like to inject latency on all packets flowing between DEV-1 and DB-1.

The way we inject latency into the packets between DEV-1 and DB-1 is to redirect the traffic through GW-1. The simplest way to do this is modify the client tool querying the data to think GW-1 is running the database i.e. GW-1 accepts packets on port 3306 and forwards those along to DB and the results flow back down the same path. Along the way the latency is injected into the data flow.

We start by configuring the GW-1.

  1. The first real step is to switch on IP forwarding. It would'nt be much of a gateway if it did not forward IP traffic

    echo 1 > /proc/sys/net/ipv4/ip_forward
    
  2. Next its a good idea to flush any existing rules setup in iptables

    iptables -t filter --flush
    iptables -t nat --flush
    iptables -t mangle --flush
    
  3. Now we should NAT (both DNAT and SNAT) packets flowing between DEV-1 and DB-1 on port 3306. This ensures that packets arriving on port 3306 at GW-1 are sent across to the port 3306 on DB-1 and the same goes for the return traffic. The first line given below NAT's all traffic arriving from DEV-1 on port 3306 and marks its destination as DB-1. The second line performs a Source NAT to mark all traffic heading to DB-1 to look like its source was GW-1 (This is so that the return traffic flows back to GW-1 and not DEV-1).

    iptables -t nat -A PREROUTING -p tcp -s 192.168.1.5 -d 192.168.1.254 --dport 3306 -j DNAT --to-destination 192.168.1.20:3306
    iptables -t nat -A POSTROUTING -p tcp -s 192.168.1.5 -d 192.168.1.20 --dport 3306 -j SNAT --to-source 192.168.1.254
    
  4. Finally we need to mark/tag all traffic thats flowing between DEV-1 and DB-1 (through GW-1) with a tag/identifier so that our Traffic Control rules can affect latency only on that traffic. If you notice the second line you will see that we still look for source address as 192.168.1.20. This is because the 'mangle' table is evaluated before the 'nat' table in the POSTROUTING chain. Take a gander at the image below, it explains the flow of traffic through the various stages of traffic control and filtering in Linux.

    iptables -t mangle -A PREROUTING -p tcp -s 192.168.1.5 -d 192.168.1.254 --dport 3306 -j MARK --set-mark 115
    iptables -t mangle -A POSTROUTING -p tcp -d 192.168.1.5 -s 192.168.1.20 --sport 3306 -j MARK --set-mark 115
    

This ends the iptables configuration, before we move on to the traffic control configuration its wise to verify that your settings are correct.

Flow of packets through Linux Traffic Control and Filtering pipeline - Courtesy l7-filter (http://l7-filter.sourceforge.net/PacketFlow.png)

Now assuming that you have a mysql server running on DB-1, you can test that your NAT'ing is working as expected by using the mysql command line tool to test against it. Run the following command you should see similar output.

    > mysql -h GW -P 3306
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    Your MySQL connection id is 89178093
    Server version: 5.0.67-log SUSE MySQL RPM

    Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

    mysql>

To recap, what we have done so far: 

  1. We have 3 linux machines (DEV-1, GW-1 & DB-1) 
  2. We have configured GW-1 to NAT mysql connections arriving to it on port 3306
  3. We have configure GW-1 to tag all packets flowing between DEV-1 & DB-1 with a identifier '115

Now on to the fun bit of injecting latency into this traffic.

To perform Traffic Control on your Linux machine you use the 'tc' tool. Below are the steps that you need to perform on GW-1

  1. Start by looking at the traffic control queues on your machine. By default you should have a pfifo_fast queue setup. Following is a sample output

    > sudo tc qdisc
       qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    
  2. Next you create a root PRIORITY queue, this will be default queue through which all untagged packets will flow through. We add this queue to the 'eth0' interface with a named handle '1:'

    tc qdisc add dev eth0 handle 1: root prio
    
  3. Next we create a NETEM (network emulation) queue parented to the PRIO queue where we specify that we want to add 100ms of latency with random variation of +-10ms.

    tc qdisc add dev eth0 handle 30: parent 1:3 netem delay 100ms 10ms
    
  4. Finally we need to look for packets tagged by iptables. We look for all packets tagged with '115' and then put them into the 1:3 queue:class, which means they will be affected by the latency

    tc filter add dev eth0 protocol ip parent 1:0 prio 3 handle 115 fw flowid 1:3
    
  5. That should be it. Any traffic on port 3306 between DEV-1 and DB-1 will have a 100ms of latency injected into their flow. Something to remember when you use 100ms the RTT will be 200ms i.e. 100ms of latency will be added on the packet going forward and the return packet as well.

Having a testing platform where you can simulate network conditions is invaluable to any developer or systems engineer, i highly recommend having this setup handy if you are writing any globally distributed applications.