07-14-2016 01:40 PM
We have been testing two Intel DC P3700 U.2 800GB NVMe SSDs to see the impact of the emulated sector size to throughput (512 vs 4096). Using fio 2.12, we observed a puzzling collapse of performance. The steps are given below.
Steps:
1. Copy or write sequentially single large file (300G or larger)
2. Start fio test with the following config:
[readtest]
thread=1
blocksize=2m
filename=/export/beegfs/data0/file_000000
rw=randread
direct=1
buffered=0
ioengine=libaio
nrfiles=1
gtod_reduce=0
numjobs=32
iodepth=128
runtime=360
group_reporting=1
percentage_random=90
3. Observe extremely slow performance:
fio-2.12
Starting 32 threads
readtest: (groupid=0, jobs=32): err= 0: pid=5097: Thu Jul 14 13:00:25 2016
read : io=65536KB, bw=137028B/s, iops=0, runt=489743msec
slat (usec): min=4079, max=7668, avg=5279.19, stdev=662.80
clat (msec): min=3, max=25, avg=18.97, stdev= 6.16
lat (msec): min=8, max=31, avg=24.25, stdev= 6.24
clat percentiles (usec):
| 1.00th=[ 3280], 5.00th=[ 4320], 10.00th=[ 9664], 20.00th=[17536],
| 30.00th=[18816], 40.00th=[20352], 50.00th=[20608], 60.00th=[21632],
| 70.00th=[21632], 80.00th=[22912], 90.00th=[25472], 95.00th=[25472],
| 99.00th=[25472], 99.50th=[25472], 99.90th=[25472], 99.95th=[25472],
| 99.99th=[25472]
lat (msec) : 4=3.12%, 10=9.38%, 20=25.00%, 50=62.50%
cpu : usr=0.00%, sys=74.84%, ctx=792583, majf=0, minf=16427
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=32/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: io=65536KB, aggrb=133KB/s, minb=133KB/s, maxb=133KB/s, mint=489743msec, maxt=489743msec
Disk stats (read/write):
nvme0n1: ios=0/64317, merge=0/0, ticks=0/1777871, in_queue=925406, util=0.19%
4. Repeat the test
5. Performance is much higher:
fio-2.12
Starting 32 threads
readtest: (groupid=0, jobs=32): err= 0: pid=5224: Thu Jul 14 13:11:58 2016
read : io=861484MB, bw=2389.3MB/s, iops=1194, runt=360564msec
slat (usec): min=111, max=203593, avg=26742.15, stdev=21321.98
clat (msec): min=414, max=5176, avg=3391.05, stdev=522.29
lat (msec): min=414, max=5247, avg=3417.79, stdev=524.75
clat percentiles (msec):
| 1.00th=[ 1614], 5.00th=[ 2376], 10.00th=[ 2802], 20.00th=[ 3097],
| 30.00th=[ 3228], 40.00th=[ 3359], 50.00th=[ 3458], 60.00th=[ 3556],
| 70.00th=[ 3654], 80.00th=[ 3785], 90.00th=[ 3949], 95.00th=[ 4080],
| 99.00th=[ 4359], 99.50th=[ 4424], 99.90th=[ 4752], 99.95th=[ 4883],
<e...
07-20-2016 08:52 AM
I read this thread with strong interest. I concur with AlexNZ, testing files residing on a file system is far more useful in production situations. We do so to figure out the overhead of
over raw devices (individual and aggregated).
The following suggestion from NC is only for testing raw devices.
fio –output=test_result.txt –name=myjob –filename=/dev/nvme0n1 –ioengine=libaio –direct=1 –norandommap –randrepeat=0 –runtime=600 –blocksize=4K –rw=randread –iodepth=32 –numjobs=4 –group_reporting.On our end, we have done many hundreds of raw device tests. Results are always in line with what Intel has published. But this particular file testing result, as I posted on July 15, is a "shocker"!
It would be great to know why fio reading a regular file from a NVMe SSD with direct=1 is still affected by data in the page cache.
Another point: we understand why usually for Intel NVMe SSDs, numjobs=4 and iodepth=32 are used. But such settings are only optimal for raw devices, right? When it comes to reading/writing regular files, IMHO we should configure fio using parameter values that match as closely as possible to that of the actual workloads. NC, your view?
07-20-2016 02:29 PM
Hello all,
According to this situation and checking all the information provided, we will be escalating the situation here and we will be updating here. Please expect a response anytime soon.NC07-22-2016 02:04 PM
Hello all,
We would like you to try the test but before that could you please try to TRIM the drives first?, once you do that please share the results back to us.Also, please make sure you are using the correct driver in this https://downloadcenter.intel.com/download/23929/Intel-SSD-Data-Center-Family-for-NVMe-Drivers link.Something important to mention is that the performance tools we use are Synthetic Benchmarking tools, as explained in the Intel® Solid-State Drive DC P3700 evaluation guide, and these are intended to measure the behavior of the SSD without taking into consideration other components in the system that would add "bottlenecks". Synthetic benchmarks measure raw drive I/O transfer rates.http://manuals.ts.fujitsu.com/file/12176/fujitsu_intel-ssd-dc-pcie-eg-en.pdf Here is the evaluation guide.Please let us know.NC07-24-2016 10:38 AM
Thanks for your follow-up. I did try fstrim on a DC P3700 NVMe SSD here.
First of all, lets get the driver and firmware issue out of the way. The server runs CentOS 7.2:
[root@fs11 ~]# uname -a
Linux fs11 3.10.0-327.22.2.el7.x86_64 # 1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@fs11 ~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
We also use the latest isdct:
[root@fs11 ~]# isdct version
- Version Information -
Name: Intel(R) Data Center Tool
Version: 3.0.0
Description: Interact and configure Intel SSDs.
And, according to the tool, the drive is healthy:
[root@fs11 ~]# isdct show -intelssd 2
- Intel SSD DC P3700 Series CVFT515400401P6JGN -
Bootloader : 8B1B0131
DevicePath : /dev/nvme2n1
DeviceStatus : Healthy
Firmware : 8DV10171
FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.
Index : 2
ModelNumber : INTEL SSDPE2MD016T4
ProductFamily : Intel SSD DC P3700 Series
SerialNumber : CVFT515400401P6JGN
While the drive had a file system (XFS), with data, I ran fstrim:
[root@fs11 ~]# fstrim -v /export/beegfs/data2
fstrim: /export/beegfs/data2: FITRIM ioctl failed: Input/output error
So, I umount the XFS, use isdct delete to remove all data, recreated the XFS, mount it again, and then ran fstrim:
Same outcome. Please see the session log below:
[root@fs11 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 192G 2.4G 190G 2% /
devtmpfs 63G 0 63G 0% /dev
tmpfs 63G 0 63G 0% /dev/shm
tmpfs 63G 26M 63G 1% /run
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/sda1 506M 166M 340M 33% /boot
/dev/sdb 168G 73M 157G 1% /export/beegfs/meta
tmpfs 13G 0 13G 0% /run/user/99
/dev/nvme2n1 1.5T 241G 1.3T 17% /export/beegfs/data2
tmpfs 13G 0 13G 0% /run/user/0
[root@fs11 ~]# umount /export/beegfs/data2
[root@fs11 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 192G 2.4G 190G 2% /
devtmpfs 63G 0 63G 0% /dev
tmpfs 63G 0 63G 0% /dev/shm
tmpfs 63G 26M 63G 1% /run
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/sda1 506M 166M 340M 33% /boot
/dev/sdb ...
07-24-2016 10:46 AM
Hello,
I can confirm that after TRIM result is still poor.
Actually, after quick looking at linux kernel code, including XFS implementation, I found that even during direct reading, page cache is still involved.
But such poor performance still looks weired.