07-14-2016 01:40 PM
We have been testing two Intel DC P3700 U.2 800GB NVMe SSDs to see the impact of the emulated sector size to throughput (512 vs 4096). Using fio 2.12, we observed a puzzling collapse of performance. The steps are given below.
Steps:
1. Copy or write sequentially single large file (300G or larger)
2. Start fio test with the following config:
[readtest]
thread=1
blocksize=2m
filename=/export/beegfs/data0/file_000000
rw=randread
direct=1
buffered=0
ioengine=libaio
nrfiles=1
gtod_reduce=0
numjobs=32
iodepth=128
runtime=360
group_reporting=1
percentage_random=90
3. Observe extremely slow performance:
fio-2.12
Starting 32 threads
readtest: (groupid=0, jobs=32): err= 0: pid=5097: Thu Jul 14 13:00:25 2016
read : io=65536KB, bw=137028B/s, iops=0, runt=489743msec
slat (usec): min=4079, max=7668, avg=5279.19, stdev=662.80
clat (msec): min=3, max=25, avg=18.97, stdev= 6.16
lat (msec): min=8, max=31, avg=24.25, stdev= 6.24
clat percentiles (usec):
| 1.00th=[ 3280], 5.00th=[ 4320], 10.00th=[ 9664], 20.00th=[17536],
| 30.00th=[18816], 40.00th=[20352], 50.00th=[20608], 60.00th=[21632],
| 70.00th=[21632], 80.00th=[22912], 90.00th=[25472], 95.00th=[25472],
| 99.00th=[25472], 99.50th=[25472], 99.90th=[25472], 99.95th=[25472],
| 99.99th=[25472]
lat (msec) : 4=3.12%, 10=9.38%, 20=25.00%, 50=62.50%
cpu : usr=0.00%, sys=74.84%, ctx=792583, majf=0, minf=16427
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=32/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=128
Run status group 0 (all jobs):
READ: io=65536KB, aggrb=133KB/s, minb=133KB/s, maxb=133KB/s, mint=489743msec, maxt=489743msec
Disk stats (read/write):
nvme0n1: ios=0/64317, merge=0/0, ticks=0/1777871, in_queue=925406, util=0.19%
4. Repeat the test
5. Performance is much higher:
fio-2.12
Starting 32 threads
readtest: (groupid=0, jobs=32): err= 0: pid=5224: Thu Jul 14 13:11:58 2016
read : io=861484MB, bw=2389.3MB/s, iops=1194, runt=360564msec
slat (usec): min=111, max=203593, avg=26742.15, stdev=21321.98
clat (msec): min=414, max=5176, avg=3391.05, stdev=522.29
lat (msec): min=414, max=5247, avg=3417.79, stdev=524.75
clat percentiles (msec):
| 1.00th=[ 1614], 5.00th=[ 2376], 10.00th=[ 2802], 20.00th=[ 3097],
| 30.00th=[ 3228], 40.00th=[ 3359], 50.00th=[ 3458], 60.00th=[ 3556],
| 70.00th=[ 3654], 80.00th=[ 3785], 90.00th=[ 3949], 95.00th=[ 4080],
| 99.00th=[ 4359], 99.50th=[ 4424], 99.90th=[ 4752], 99.95th=[ 4883],
<e...
07-24-2016 10:57 AM
Just a quick supplement regarding the I/O errors that I reported in my last reply: I even tried to do the following:
I still got
[root@fs11 ~]# dmesg |tail -11
[987891.677911] nvme2n1: unknown partition table
[987898.749260] XFS (nvme2n1): Mounting V4 Filesystem
[987898.752844] XFS (nvme2n1): Ending clean mount
[987948.612051] blk_update_request: I/O error, dev nvme2n1, sector 3070890712
[987948.612088] blk_update_request: I/O error, dev nvme2n1, sector 3087667912
[987948.612151] blk_update_request: I/O error, dev nvme2n1, sector 3121222312
[987948.612193] blk_update_request: I/O error, dev nvme2n1, sector 3062502112
[987948.612211] blk_update_request: I/O error, dev nvme2n1, sector 3104445112
[987948.612228] blk_update_request: I/O error, dev nvme2n1, sector 3079279312
[987948.612296] blk_update_request: I/O error, dev nvme2n1, sector 3096056512
[987948.612314] blk_update_request: I/O error, dev nvme2n1, sector 3112833712
So, unlike SCSI drives that I used years ago, format didn't remap "bad" sectors. Would appreciate a hint as to how to get this issue resolved too.
07-24-2016 12:08 PM
I tried to narrow down the cause of the issue with fstrim more. It seems to me the hardware (i.e. the NVMe SSD itself) is responsible, rather than the software layer on top of it (XFS). So I decided to add a partition table first and create XFS on the partition. As is evident below, adding the partition didn't help.
Is the drive faulty? If yes, then why isdct still deems its DeviceStatus Healthy?
[root@fs11 ~]# isdct delete -f -intelssd 2
Deleting...
- Intel SSD DC P3700 Series CVFT515400401P6JGN -
Status : Delete successful.
[root@fs11 ~]# parteed -a optimal /dev/nvme2n1 mklabel gpt
-bash: parteed: command not found
[root@fs11 ~]# parted -a optimal /dev/nvme2n1 mklabel gpt
Information: You may need to update /etc/fstab.
[root@fs11 ~]# parted /dev/nvme2n1 mkpart primary 1048576B 100%
Information: You may need to update /etc/fstab.
[root@fs11 ~]# parted /dev/nvme2n1
GNU Parted 3.1
Using /dev/nvme2n1
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: Unknown (unknown)
Disk /dev/nvme2n1: 1600GB
Sector size (logical/physical): 4096B/4096B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 1600GB 1600GB primary
(parted) quit
[root@fs11 ~]# mkfs.xfs -K -f -d agcount=24 -l size=128m,version=2 -i size=512 -s size=4096 /dev/nvme2n1
meta-data=/dev/nvme2n1 isize=512 agcount=24, agsize=16279311 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=390703446, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
<p style="color: # 222222; font-family: arial, sans-serif; fon...07-25-2016 09:35 AM
Hello,
Thanks everyone for trying the suggestion. We would like to gather all these inputs and research here with our department in order to work in a resolution for all of you.Please allow us some time to do the research, we will keep you posted.NC07-25-2016 02:54 PM
Thanks for following-up. I reviewed what I had done regarding the fstrim, and the tests that I have done, and came up two additional plausible causes:
My tests indicates that the Intel DC P3700 firmware or the NVMe Linux driver or both may have a bug. The following are my evidences. Please review.
We use a lot of Intel DC P3700 SSDs of various capacities - 800GB to 1.6TB are two common ones - and have done hundreds of tests over them.
We also understand that with Intel NVMe DC P3700 SSDs, there is no need to run trim at all. The firmware's garbage collection takes care of such needs transparently and behind the scene. But still, IMHO it's a good idea when sector size is changed, well-known Linux utilities still work as anticipated. We ran into this issue by serendipity, and got a "nice" surprise along the way
Case 1. mkfs.xfs without -K
We will pick one /dev/nvme2n1, umount it, isdct delete all data on it, mkfs.xfs without the -K flag, and then run fstrim.
[root@fs11 ~]# man mkfs.xfs
[root@fs11 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 223.6G 0 disk
├─sda1 8:1 0 512M 0 part /boot
├─sda2 8:2 0 31.5G 0 part [SWAP]
└─sda3 8:3 0 191.6G 0 part /
sdb 8:16 0 223.6G 0 disk /export/beegfs/meta
sdc 8:32 0 59.6G 0 disk
sdd 8:48 0 59.6G 0 disk
sr0 11:0 1 1024M 0 rom
nvme0n1 259:2 0 1.5T 0 disk /export/beegfs/data0
nvme1n1 259:6 0 1.5T 0 disk /export/beegfs/data1
nvme2n1 259:7 0 1.5T 0 disk /export/beegfs/data2
nvme3n1 259:5 0 1.5T 0 disk /export/beegfs/data3
nvme4n1 259:0 0 1.5T 0 disk /export/beegfs/data4
nvme5n1 259:3 0 1.5T 0 disk
nvme6n1 259:1 0 1.5T 0 disk
nvme7n1 259:4 0 1.5T 0 disk
[root@fs11 ~]# umount /export/beegfs/data2
[root@fs11 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 223.6G 0 disk
├─sda1 8:1 0 512M 0 part /boot
├─sda2 8:2 0 31.5G 0 part [SWAP]
└─sda3 8:3 0 191.6G 0 part /
sdb 8:16 0 223.6G 0 disk /export/beegfs/meta
sdc 8:32 0 59.6G 0 disk
sdd 8:48 0 59.6G 0 disk
sr0 11:0 1 1024M 0 rom
nvme0n1 259:2 0 1.5T 0 disk /export/beegfs/data0
nvme1n1 259:6 0 1.5T 0 disk /export/beegfs/data1
nvme2n1 259:7 0 1.5T 0 disk
nvme3n1 259:5 0 1.5T 0 disk /export/beegfs/data3
nvme4n1 259:0 0 1.5T 0 disk /export/beegfs/data4
<p style="padding-left: 3...07-28-2016 08:20 AM
Hello everyone,
We would like to address the performance drop questions first so we don't mix the situations.Can you please confirm if this was the process you followed:-Create large file-Flush page cache-Run FIONow, at which step are you flushing the page cache to avoid performance drop?Please let us know.NC