07-14-2016 01:40 PM
We have been testing two Intel DC P3700 U.2 800GB NVMe SSDs to see the impact of the emulated sector size to throughput (512 vs 4096). Using fio 2.12, we observed a puzzling collapse of performance. The steps are given below.
Steps:
1. Copy or write sequentially single large file (300G or larger)
2. Start fio test with the following config:
[readtest]
thread=1
blocksize=2m
filename=/export/beegfs/data0/file_000000
rw=randread
direct=1
buffered=0
ioengine=libaio
nrfiles=1
gtod_reduce=0
numjobs=32
iodepth=128
runtime=360
group_reporting=1
percentage_random=90
3. Observe extremely slow performance:
fio-2.12
Starting 32 threads
readtest: (groupid=0, jobs=32): err= 0: pid=5097: Thu Jul 14 13:00:25 2016
read : io=65536KB, bw=137028B/s, iops=0, runt=489743msec
slat (usec): min=4079, max=7668, avg=5279.19, stdev=662.80
clat (msec): min=3, max=25, avg=18.97, stdev= 6.16
lat (msec): min=8, max=31, avg=24.25, stdev= 6.24
clat percentiles (usec):
| 1.00th=[ 3280], 5.00th=[ 4320], 10.00th=[ 9664], 20.00th=[17536],
| 30.00th=[18816], 40.00th=[20352], 50.00th=[20608], 60.00th=[21632],
| 70.00th=[21632], 80.00th=[22912], 90.00th=[25472], 95.00th=[25472],
| 99.00th=[25472], 99.50th=[25472], 99.90th=[25472], 99.95th=[25472],
| 99.99th=[25472]
lat (msec) : 4=3.12%, 10=9.38%, 20=25.00%, 50=62.50%
cpu : usr=0.00%, sys=74.84%, ctx=792583, majf=0, minf=16427
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=32/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: io=65536KB, aggrb=133KB/s, minb=133KB/s, maxb=133KB/s, mint=489743msec, maxt=489743msec
Disk stats (read/write):
nvme0n1: ios=0/64317, merge=0/0, ticks=0/1777871, in_queue=925406, util=0.19%
4. Repeat the test
5. Performance is much higher:
fio-2.12
Starting 32 threads
readtest: (groupid=0, jobs=32): err= 0: pid=5224: Thu Jul 14 13:11:58 2016
read : io=861484MB, bw=2389.3MB/s, iops=1194, runt=360564msec
slat (usec): min=111, max=203593, avg=26742.15, stdev=21321.98
clat (msec): min=414, max=5176, avg=3391.05, stdev=522.29
lat (msec): min=414, max=5247, avg=3417.79, stdev=524.75
clat percentiles (msec):
| 1.00th=[ 1614], 5.00th=[ 2376], 10.00th=[ 2802], 20.00th=[ 3097],
| 30.00th=[ 3228], 40.00th=[ 3359], 50.00th=[ 3458], 60.00th=[ 3556],
| 70.00th=[ 3654], 80.00th=[ 3785], 90.00th=[ 3949], 95.00th=[ 4080],
| 99.00th=[ 4359], 99.50th=[ 4424], 99.90th=[ 4752], 99.95th=[ 4883],
<e...
07-28-2016 08:54 AM
Hello,
First we skipped flushing of page cache and performed fio testing right after large file creation. And with such approach results of direct reading tests were very poor.
But as I mentioned above, we found that flushing page cache after file creation improves situation. Which is confusing because O_DIRECT mode has to skip any operations with page cache.
Later I reviewed Linux kernel code and found that it performs some operations with page cache even in direct mode. So now I suspect that this issue is rather related to Linux kernel.
07-28-2016 02:58 PM
AlexNZ,
Thanks for the information provided, we will continue with our testing here and we will let you know soon.NC08-04-2016 03:28 PM
Hello Everyone,
Our engineering team is running some investigation on this report and we will share any results once we get them.Thanks.NC08-08-2016 07:54 AM
Hi AlexNZ,
Chances are that your findings with the Kernel are the reason for this drop. We understand that Linux users can submit Kernel questions, findings and bugs here: https://bugzilla.kernel.org/. Here are some instructions that we found: https://www.kernel.org/pub/linux/docs/lkml/reporting-bugs.html It is very important to bear in mind that the benchmarking we provide using FIO (or even IOMeter for Windows), as per the evaluation guide, are not done the same way you've reported to be doing it, since the evaluation guide (shared on previous post) states that those are synthetic tools, which are used for raw disk, and you all seem to be getting the actual numbers we've shared on the SSD's spec's when measuring raw disk… the drop you see is once the file system is created and, as you may know, even different type of file systems may cause different SSD's performance numbers, some interesting articles on this (that you may even be aware of, but still worth to share): http://www.linux-magazine.com/Issues/2015/172/Tuning-Your-SSD https://wiki.archlinux.org/index.php/Solid_State_Drives http://www.phoronix.com/scan.php?page=article&item=linux-43-ssd&num=1Please let us know if you have any questions.NC08-08-2016 02:14 PM
Hello NC,
Thanks for your reply.
I'll consider asking Kernel community about it. But since I know how to avoid this effect and know that kernel actually manipulates page cache in direct mode, I think it's not longer so important.
Alex