Solidigm

ANaza · ‎07-14-2016

NVMe drive model: Intel SSD DC P3700 U.2 NVMe SSD
Capacity: 764G
FS: XFS
Other HW:
- AIC SB122A-PH
- 8 Intel NVMe DC P3700 2 on CPU 0, 6 on CPU 1
- 128 GiB RAM (8 x 16 DDR4 2400Mhz DIMMs)
- 2 x Intel E5-2620v3 2.4Ghz CPUs
- 2 x Intel DC S2510 SATA SSDs (one is used a system drive).
- Note that both are engineering samples provided by Intel NSG. But all have had the latest firmware updated using isdct 3.0.0.
OS: CentOS Linux release 7.2.1511 (Core)
Kernel: Linux fs00 3.10.0-327.22.2.el7.x86_64 # 1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

We have been testing two Intel DC P3700 U.2 800GB NVMe SSDs to see the impact of the emulated sector size to throughput (512 vs 4096). Using fio 2.12, we observed a puzzling collapse of performance. The steps are given below.

Steps:

1. Copy or write sequentially single large file (300G or larger)

2. Start fio test with the following config:

[readtest]

thread=1

blocksize=2m

filename=/export/beegfs/data0/file_000000

rw=randread

direct=1

buffered=0

ioengine=libaio

nrfiles=1

gtod_reduce=0

numjobs=32

iodepth=128

runtime=360

group_reporting=1

percentage_random=90

3. Observe extremely slow performance:

fio-2.12

Starting 32 threads

readtest: (groupid=0, jobs=32): err= 0: pid=5097: Thu Jul 14 13:00:25 2016

read : io=65536KB, bw=137028B/s, iops=0, runt=489743msec

slat (usec): min=4079, max=7668, avg=5279.19, stdev=662.80

clat (msec): min=3, max=25, avg=18.97, stdev= 6.16

lat (msec): min=8, max=31, avg=24.25, stdev= 6.24

clat percentiles (usec):

| 1.00th=[ 3280], 5.00th=[ 4320], 10.00th=[ 9664], 20.00th=[17536],

| 30.00th=[18816], 40.00th=[20352], 50.00th=[20608], 60.00th=[21632],

| 70.00th=[21632], 80.00th=[22912], 90.00th=[25472], 95.00th=[25472],

| 99.00th=[25472], 99.50th=[25472], 99.90th=[25472], 99.95th=[25472],

| 99.99th=[25472]

lat (msec) : 4=3.12%, 10=9.38%, 20=25.00%, 50=62.50%

cpu : usr=0.00%, sys=74.84%, ctx=792583, majf=0, minf=16427

IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

issued : total=r=32/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):

READ: io=65536KB, aggrb=133KB/s, minb=133KB/s, maxb=133KB/s, mint=489743msec, maxt=489743msec

Disk stats (read/write):

nvme0n1: ios=0/64317, merge=0/0, ticks=0/1777871, in_queue=925406, util=0.19%

4. Repeat the test

5. Performance is much higher:

fio-2.12

Starting 32 threads

readtest: (groupid=0, jobs=32): err= 0: pid=5224: Thu Jul 14 13:11:58 2016

read : io=861484MB, bw=2389.3MB/s, iops=1194, runt=360564msec

slat (usec): min=111, max=203593, avg=26742.15, stdev=21321.98

clat (msec): min=414, max=5176, avg=3391.05, stdev=522.29

lat (msec): min=414, max=5247, avg=3417.79, stdev=524.75

clat percentiles (msec):

| 1.00th=[ 1614], 5.00th=[ 2376], 10.00th=[ 2802], 20.00th=[ 3097],

| 30.00th=[ 3228], 40.00th=[ 3359], 50.00th=[ 3458], 60.00th=[ 3556],

| 70.00th=[ 3654], 80.00th=[ 3785], 90.00th=[ 3949], 95.00th=[ 4080],

| 99.00th=[ 4359], 99.50th=[ 4424], 99.90th=[ 4752], 99.95th=[ 4883],

<e...

ANaza · ‎07-28-2016

Hello,

First we skipped flushing of page cache and performed fio testing right after large file creation. And with such approach results of direct reading tests were very poor.

But as I mentioned above, we found that flushing page cache after file creation improves situation. Which is confusing because O_DIRECT mode has to skip any operations with page cache.

Later I reviewed Linux kernel code and found that it performs some operations with page cache even in direct mode. So now I suspect that this issue is rather related to Linux kernel.

idata · ‎07-28-2016

AlexNZ,

Thanks for the information provided, we will continue with our testing here and we will let you know soon.NC

idata · ‎08-04-2016

Hello Everyone,

Our engineering team is running some investigation on this report and we will share any results once we get them.Thanks.NC

idata · ‎08-08-2016

Hi AlexNZ,

Chances are that your findings with the Kernel are the reason for this drop. We understand that Linux users can submit Kernel questions, findings and bugs here: https://bugzilla.kernel.org/. Here are some instructions that we found: https://www.kernel.org/pub/linux/docs/lkml/reporting-bugs.html It is very important to bear in mind that the benchmarking we provide using FIO (or even IOMeter for Windows), as per the evaluation guide, are not done the same way you've reported to be doing it, since the evaluation guide (shared on previous post) states that those are synthetic tools, which are used for raw disk, and you all seem to be getting the actual numbers we've shared on the SSD's spec's when measuring raw disk… the drop you see is once the file system is created and, as you may know, even different type of file systems may cause different SSD's performance numbers, some interesting articles on this (that you may even be aware of, but still worth to share): http://www.linux-magazine.com/Issues/2015/172/Tuning-Your-SSD https://wiki.archlinux.org/index.php/Solid_State_Drives http://www.phoronix.com/scan.php?page=article&item=linux-43-ssd&num=1Please let us know if you have any questions.NC

ANaza · ‎08-08-2016

Hello NC,

Thanks for your reply.

I'll consider asking Kernel community about it. But since I know how to avoid this effect and know that kernel actually manipulates page cache in direct mode, I think it's not longer so important.

Alex

Solidigm

Critical performance drop on newly created large file