07-14-2016 01:40 PM
We have been testing two Intel DC P3700 U.2 800GB NVMe SSDs to see the impact of the emulated sector size to throughput (512 vs 4096). Using fio 2.12, we observed a puzzling collapse of performance. The steps are given below.
Steps:
1. Copy or write sequentially single large file (300G or larger)
2. Start fio test with the following config:
[readtest]
thread=1
blocksize=2m
filename=/export/beegfs/data0/file_000000
rw=randread
direct=1
buffered=0
ioengine=libaio
nrfiles=1
gtod_reduce=0
numjobs=32
iodepth=128
runtime=360
group_reporting=1
percentage_random=90
3. Observe extremely slow performance:
fio-2.12
Starting 32 threads
readtest: (groupid=0, jobs=32): err= 0: pid=5097: Thu Jul 14 13:00:25 2016
read : io=65536KB, bw=137028B/s, iops=0, runt=489743msec
slat (usec): min=4079, max=7668, avg=5279.19, stdev=662.80
clat (msec): min=3, max=25, avg=18.97, stdev= 6.16
lat (msec): min=8, max=31, avg=24.25, stdev= 6.24
clat percentiles (usec):
| 1.00th=[ 3280], 5.00th=[ 4320], 10.00th=[ 9664], 20.00th=[17536],
| 30.00th=[18816], 40.00th=[20352], 50.00th=[20608], 60.00th=[21632],
| 70.00th=[21632], 80.00th=[22912], 90.00th=[25472], 95.00th=[25472],
| 99.00th=[25472], 99.50th=[25472], 99.90th=[25472], 99.95th=[25472],
| 99.99th=[25472]
lat (msec) : 4=3.12%, 10=9.38%, 20=25.00%, 50=62.50%
cpu : usr=0.00%, sys=74.84%, ctx=792583, majf=0, minf=16427
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=32/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: io=65536KB, aggrb=133KB/s, minb=133KB/s, maxb=133KB/s, mint=489743msec, maxt=489743msec
Disk stats (read/write):
nvme0n1: ios=0/64317, merge=0/0, ticks=0/1777871, in_queue=925406, util=0.19%
4. Repeat the test
5. Performance is much higher:
fio-2.12
Starting 32 threads
readtest: (groupid=0, jobs=32): err= 0: pid=5224: Thu Jul 14 13:11:58 2016
read : io=861484MB, bw=2389.3MB/s, iops=1194, runt=360564msec
slat (usec): min=111, max=203593, avg=26742.15, stdev=21321.98
clat (msec): min=414, max=5176, avg=3391.05, stdev=522.29
lat (msec): min=414, max=5247, avg=3417.79, stdev=524.75
clat percentiles (msec):
| 1.00th=[ 1614], 5.00th=[ 2376], 10.00th=[ 2802], 20.00th=[ 3097],
| 30.00th=[ 3228], 40.00th=[ 3359], 50.00th=[ 3458], 60.00th=[ 3556],
| 70.00th=[ 3654], 80.00th=[ 3785], 90.00th=[ 3949], 95.00th=[ 4080],
| 99.00th=[ 4359], 99.50th=[ 4424], 99.90th=[ 4752], 99.95th=[ 4883],
<e...
07-15-2016 08:11 AM
Since we deal with similar situation, I tried the above steps and confirmed on our machine this issue. In fact, I also tried it with both XFS and EXT4. The symptom showed up regardless.
07-15-2016 08:54 AM
AlexNZ,
Thanks for bringing this situation to our attention, we would like to verify this and provide a solution as fast as possible. Please allow us some time to check on this and we will keep you all posted.NC07-20-2016 06:37 AM
Hello,
After reviewing the settings, we would like to verify the following:For the read test, could you please try: fio –output=test_result.txt –name=myjob –filename=/dev/nvme0n1 –ioengine=libaio –direct=1 –norandommap –randrepeat=0 –runtime=600 –blocksize=4K –rw=randread –iodepth=32 –numjobs=4 –group_reporting.It is important to notice that we normally run the tests with 4 threads and iodepth=32, for the blocksize=4K.Please let us know as we may need to keep researching about this.NC07-20-2016 07:32 AM
Hello,
With proposed settings I received the following result:
myjob: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
...
fio-2.12
Starting 4 processes
myjob: (groupid=0, jobs=4): err= 0: pid=23560: Wed Jul 20 07:06:08 2016
read : io=1092.2GB, bw=1863.1MB/s, iops=477156, runt=600001msec
slat (usec): min=1, max=63, avg= 2.76, stdev= 1.57
clat (usec): min=14, max=3423, avg=260.81, stdev=90.86
lat (usec): min=18, max=3426, avg=263.68, stdev=90.84
clat percentiles (usec):
| 1.00th=[ 114], 5.00th=[ 139], 10.00th=[ 157], 20.00th=[ 185],
| 30.00th=[ 207], 40.00th=[ 229], 50.00th=[ 251], 60.00th=[ 274],
| 70.00th=[ 298], 80.00th=[ 326], 90.00th=[ 374], 95.00th=[ 422],
| 99.00th=[ 532], 99.50th=[ 588], 99.90th=[ 716], 99.95th=[ 788],
| 99.99th=[ 1048]
bw (KB /s): min= 5400, max=494216, per=25.36%, avg=484036.11, stdev=14017.77
lat (usec) : 20=0.01%, 50=0.01%, 100=0.23%, 250=49.61%, 500=48.54%
lat (usec) : 750=1.55%, 1000=0.06%
lat (msec) : 2=0.01%, 4=0.01%
cpu : usr=15.00%, sys=41.78%, ctx=77056567, majf=0, minf=264
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=286294132/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: io=1092.2GB, aggrb=1863.1MB/s, minb=1863.1MB/s, maxb=1863.1MB/s, mint=600001msec, maxt=600001msec
Disk stats (read/write):
nvme0n1: ios=286276788/29109, merge=0/0, ticks=72929877/10859607, in_queue=84848144, util=99.33%
But in this case it was the test for raw device (/dev/nvme0n1), whereas in our case it was file on XFS on NVMe drive.
Also during latest tests we determined that flushing page cache (echo 1 > /proc/sys/vm/drop_caches) solves the problem.
Why does page cache affect direct IO - is still the question.
Can it be something specific to NVMe drivers?
AlexNZ