Hunting down a SSD latency bug
Swapping out my old 500GB SATA HDD with a Corsair M4 256GB SSD was by far the biggest upgrade in terms of performance my 2 year old 15" Macbook Pro has ever received.
Switching to an SSD was like a breadth of fresh air to my aged laptop. I saw an immediate improvement in application load times and boot times. For a while the CPU on my laptop was throttled by the I/O latency of the 5400RPM SATA HDD, the shift to SSD meant my CPU was able to spend more time doing something useful rather than waiting for I/O.
Everything was great with my laptop, until i started to hit the 256GB space barrier (which was really easy to fill up). I am able to keep about 10-15GB free by being very disciplined about cleaning up regularly, but its a constant fight.
Running your laptop close to the edge means that you SSD controller has to work extra hard to do more frequent garbage collection runs as well as minimize the effect of write amplification(great article about SSD's). At near full capacity i started to a notice "sticky mouse" syndrome, which meant every once in a while i would notice my mouse pointer freeze up while trying to open a new tab in Chrome or reading some email. Initially it was infrequent enough for me to ignore it, but with the frequent fill and clean cycles i began to notice the frequency of the problem was increasing.
I was fairly certain that this sticky mouse problem was due to an sudden spike in IO latency or the SSD controller not responding while it tried to perform a GC run. To try and better understand i turned to dtrace. DTrace is an amazing instrumentation and profiling system built first on Solaris but is also available on Mac OS X and other UNIX operating systems. Linux however is still lacking native support for it, but there is work underway
If anyone is using a Mac and needs to get a peek at what their system is doing, i strongly recommend using dtrace and some of the example scripts that ship with it. You can get a quick preview of the various dtrace tools available on your machine by running a simple 'man -k dtrace'.
If you want to learn more about Dtrace and the above utilities you can read this excellent blog post by Brendan Greg.
For this investigation i picked the iosnoop dtrace script. This tool gives you the I/O requests being issued to the underlying hardware i.e. it excludes any filesystem cache hits. Some of the important information reported by iosnoop is
- Time taken for each I/O request
- PID & Name of program issuing the I/O request
- Type of I/O reuest (Read or Write)
- Full path and file name for the I/O request
- Size of I/O request
There is a lot more info reported by iosnoop, but with the above data i was able to gather some great latency data.
Given that i spent a large amount of time surfing the net, i end up using Chrome (my browser of choice) a lot. Therefore the sticky mouse pointer problems were fairly visible within it.
For my data gathering i ran the following. Below is screenshot of what the output would look like.
sudo iosnoop -A -n Google | tee -a chrome-io.csv
I dragged the output from the CSV file into Google Docs. The key thing that i was looking for was some way to judge what the average latency (DELTA column) i was seeing for a particular kind of IO request. So for the spreadsheet i gathered all 4KB (4096 byte) Read requests from Chrome and graphed them. What i found was rather interesting.
Just to verify my results i ran another gather operation
In both of the above graphs the majority of I/O requests were turned around in the 250 - 320 microseconds range with few outliers. Then there are samples that are way off the average, some of them are 10x or 20x the average. These spikes clearly cause the application to stall while it waits for the read to return. I would expect variable reads from an magnetic HDD but not from SSD's. This clearly seems to indicate that the SSD controller is stalling those reads while it shuffles data around.
Next Step:
- Even with application stalling for certain reads im not certain why would the mouse pointer drawing stall. Maybe the SSD controller fails to respond to the kernel while its executing a critical code path, causing all operations to stall.
- I need to run some synthetic tests via tools like iozone and measure I/O latency.
- Measure Read & Write latencies as well