cancel
Showing results for 
Search instead for 
Did you mean: 

Critical performance drop on newly created large file

ANaza
New Contributor II
  • NVMe drive model: Intel SSD DC P3700 U.2 NVMe SSD
  • Capacity: 764G
  • FS: XFS
  • Other HW:
    • AIC SB122A-PH
    • 8 Intel NVMe DC P3700 2 on CPU 0, 6 on CPU 1
    • 128 GiB RAM (8 x 16 DDR4 2400Mhz DIMMs)
    • 2 x Intel E5-2620v3 2.4Ghz CPUs
    • 2 x Intel DC S2510 SATA SSDs (one is used a system drive).
    • Note that both are engineering samples provided by Intel NSG. But all have had the latest firmware updated using isdct 3.0.0.
  • OS: CentOS Linux release 7.2.1511 (Core)
  • Kernel: Linux fs00 3.10.0-327.22.2.el7.x86_64 # 1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

We have been testing two Intel DC P3700 U.2 800GB NVMe SSDs to see the impact of the emulated sector size to throughput (512 vs 4096). Using fio 2.12, we observed a puzzling collapse of performance. The steps are given below.

Steps:

1. Copy or write sequentially single large file (300G or larger)

2. Start fio test with the following config:

[readtest]

thread=1

blocksize=2m

filename=/export/beegfs/data0/file_000000

rw=randread

direct=1

buffered=0

ioengine=libaio

nrfiles=1

gtod_reduce=0

numjobs=32

iodepth=128

runtime=360

group_reporting=1

percentage_random=90

3. Observe extremely slow performance:

fio-2.12

Starting 32 threads

readtest: (groupid=0, jobs=32): err= 0: pid=5097: Thu Jul 14 13:00:25 2016

read : io=65536KB, bw=137028B/s, iops=0, runt=489743msec

slat (usec): min=4079, max=7668, avg=5279.19, stdev=662.80

clat (msec): min=3, max=25, avg=18.97, stdev= 6.16

lat (msec): min=8, max=31, avg=24.25, stdev= 6.24

clat percentiles (usec):

| 1.00th=[ 3280], 5.00th=[ 4320], 10.00th=[ 9664], 20.00th=[17536],

| 30.00th=[18816], 40.00th=[20352], 50.00th=[20608], 60.00th=[21632],

| 70.00th=[21632], 80.00th=[22912], 90.00th=[25472], 95.00th=[25472],

| 99.00th=[25472], 99.50th=[25472], 99.90th=[25472], 99.95th=[25472],

| 99.99th=[25472]

lat (msec) : 4=3.12%, 10=9.38%, 20=25.00%, 50=62.50%

cpu : usr=0.00%, sys=74.84%, ctx=792583, majf=0, minf=16427

IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

issued : total=r=32/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

latency : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):

READ: io=65536KB, aggrb=133KB/s, minb=133KB/s, maxb=133KB/s, mint=489743msec, maxt=489743msec

Disk stats (read/write):

nvme0n1: ios=0/64317, merge=0/0, ticks=0/1777871, in_queue=925406, util=0.19%

4. Repeat the test

5. Performance is much higher:

fio-2.12

Starting 32 threads

readtest: (groupid=0, jobs=32): err= 0: pid=5224: Thu Jul 14 13:11:58 2016

read : io=861484MB, bw=2389.3MB/s, iops=1194, runt=360564msec

slat (usec): min=111, max=203593, avg=26742.15, stdev=21321.98

clat (msec): min=414, max=5176, avg=3391.05, stdev=522.29

lat (msec): min=414, max=5247, avg=3417.79, stdev=524.75

clat percentiles (msec):

| 1.00th=[ 1614], 5.00th=[ 2376], 10.00th=[ 2802], 20.00th=[ 3097],

| 30.00th=[ 3228], 40.00th=[ 3359], 50.00th=[ 3458], 60.00th=[ 3556],

| 70.00th=[ 3654], 80.00th=[ 3785], 90.00th=[ 3949], 95.00th=[ 4080],

| 99.00th=[ 4359], 99.50th=[ 4424], 99.90th=[ 4752], 99.95th=[ 4883],

<e...

22 REPLIES 22

idata
Esteemed Contributor III

Just a quick supplement regarding the I/O errors that I reported in my last reply: I even tried to do the following:

  1. umount the drive
  2. Do a nvmeformat: isdct start -intelssd 2 -nvmeformat LBAformat=3 SecureEraseSetting=0 ProtectionInformation=0 MetaDataSettings=0
  3. Recreate XFS
  4. mount the XFS
  5. ran fstrim -v to the mount point.

I still got

[root@fs11 ~]# dmesg |tail -11

[987891.677911] nvme2n1: unknown partition table

[987898.749260] XFS (nvme2n1): Mounting V4 Filesystem

[987898.752844] XFS (nvme2n1): Ending clean mount

[987948.612051] blk_update_request: I/O error, dev nvme2n1, sector 3070890712

[987948.612088] blk_update_request: I/O error, dev nvme2n1, sector 3087667912

[987948.612151] blk_update_request: I/O error, dev nvme2n1, sector 3121222312

[987948.612193] blk_update_request: I/O error, dev nvme2n1, sector 3062502112

[987948.612211] blk_update_request: I/O error, dev nvme2n1, sector 3104445112

[987948.612228] blk_update_request: I/O error, dev nvme2n1, sector 3079279312

[987948.612296] blk_update_request: I/O error, dev nvme2n1, sector 3096056512

[987948.612314] blk_update_request: I/O error, dev nvme2n1, sector 3112833712

So, unlike SCSI drives that I used years ago, format didn't remap "bad" sectors. Would appreciate a hint as to how to get this issue resolved too.

idata
Esteemed Contributor III

I tried to narrow down the cause of the issue with fstrim more. It seems to me the hardware (i.e. the NVMe SSD itself) is responsible, rather than the software layer on top of it (XFS). So I decided to add a partition table first and create XFS on the partition. As is evident below, adding the partition didn't help.

Is the drive faulty? If yes, then why isdct still deems its DeviceStatus Healthy?

[root@fs11 ~]# isdct delete -f -intelssd 2

Deleting...

- Intel SSD DC P3700 Series CVFT515400401P6JGN -

Status : Delete successful.

[root@fs11 ~]# parteed -a optimal /dev/nvme2n1 mklabel gpt

-bash: parteed: command not found

[root@fs11 ~]# parted -a optimal /dev/nvme2n1 mklabel gpt

Information: You may need to update /etc/fstab.

[root@fs11 ~]# parted /dev/nvme2n1 mkpart primary 1048576B 100%

Information: You may need to update /etc/fstab.

[root@fs11 ~]# parted /dev/nvme2n1

GNU Parted 3.1

Using /dev/nvme2n1

Welcome to GNU Parted! Type 'help' to view a list of commands.

(parted) print

Model: Unknown (unknown)

Disk /dev/nvme2n1: 1600GB

Sector size (logical/physical): 4096B/4096B

Partition Table: gpt

Disk Flags:

Number Start End Size File system Name Flags

1 1049kB 1600GB 1600GB primary

(parted) quit

[root@fs11 ~]# mkfs.xfs -K -f -d agcount=24 -l size=128m,version=2 -i size=512 -s size=4096 /dev/nvme2n1

meta-data=/dev/nvme2n1 isize=512 agcount=24, agsize=16279311 blks

= sectsz=4096 attr=2, projid32bit=1

= crc=0 finobt=0

data = bsize=4096 blocks=390703446, imaxpct=5

= sunit=0 swidth=0 blks

naming =version 2 bsize=4096 ascii-ci=0 ftype=0

<p style="color: # 222222; font-family: arial, sans-serif; fon...

idata
Esteemed Contributor III

Hello,

Thanks everyone for trying the suggestion. We would like to gather all these inputs and research here with our department in order to work in a resolution for all of you.Please allow us some time to do the research, we will keep you posted.NC

idata
Esteemed Contributor III

Thanks for following-up. I reviewed what I had done regarding the fstrim, and the tests that I have done, and came up two additional plausible causes:

  1. In the way I do mkfs.xfs, I always use the -K option, what if I don't use it?
  2. I would like to take advantage of the variable sector support provided by DC P3700. So, we are evaluating the performance benefits of using large SectorSize these days. Thus, the NVMe SSDs that I tested fstrim on has a 4096 sector size. What happens if I retain the default 512?

My tests indicates that the Intel DC P3700 firmware or the NVMe Linux driver or both may have a bug. The following are my evidences. Please review.

We use a lot of Intel DC P3700 SSDs of various capacities - 800GB to 1.6TB are two common ones - and have done hundreds of tests over them.

We also understand that with Intel NVMe DC P3700 SSDs, there is no need to run trim at all. The firmware's garbage collection takes care of such needs transparently and behind the scene. But still, IMHO it's a good idea when sector size is changed, well-known Linux utilities still work as anticipated. We ran into this issue by serendipity, and got a "nice" surprise along the way

Case 1. mkfs.xfs without -K

We will pick one /dev/nvme2n1, umount it, isdct delete all data on it, mkfs.xfs without the -K flag, and then run fstrim.

[root@fs11 ~]# man mkfs.xfs

[root@fs11 ~]# lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sda 8:0 0 223.6G 0 disk

├─sda1 8:1 0 512M 0 part /boot

├─sda2 8:2 0 31.5G 0 part [SWAP]

└─sda3 8:3 0 191.6G 0 part /

sdb 8:16 0 223.6G 0 disk /export/beegfs/meta

sdc 8:32 0 59.6G 0 disk

sdd 8:48 0 59.6G 0 disk

sr0 11:0 1 1024M 0 rom

nvme0n1 259:2 0 1.5T 0 disk /export/beegfs/data0

nvme1n1 259:6 0 1.5T 0 disk /export/beegfs/data1

nvme2n1 259:7 0 1.5T 0 disk /export/beegfs/data2

nvme3n1 259:5 0 1.5T 0 disk /export/beegfs/data3

nvme4n1 259:0 0 1.5T 0 disk /export/beegfs/data4

nvme5n1 259:3 0 1.5T 0 disk

nvme6n1 259:1 0 1.5T 0 disk

nvme7n1 259:4 0 1.5T 0 disk

[root@fs11 ~]# umount /export/beegfs/data2

[root@fs11 ~]# lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

sda 8:0 0 223.6G 0 disk

├─sda1 8:1 0 512M 0 part /boot

├─sda2 8:2 0 31.5G 0 part [SWAP]

└─sda3 8:3 0 191.6G 0 part /

sdb 8:16 0 223.6G 0 disk /export/beegfs/meta

sdc 8:32 0 59.6G 0 disk

sdd 8:48 0 59.6G 0 disk

sr0 11:0 1 1024M 0 rom

nvme0n1 259:2 0 1.5T 0 disk /export/beegfs/data0

nvme1n1 259:6 0 1.5T 0 disk /export/beegfs/data1

nvme2n1 259:7 0 1.5T 0 disk

nvme3n1 259:5 0 1.5T 0 disk /export/beegfs/data3

nvme4n1 259:0 0 1.5T 0 disk /export/beegfs/data4

<p style="padding-left: 3...

idata
Esteemed Contributor III

Hello everyone,

We would like to address the performance drop questions first so we don't mix the situations.Can you please confirm if this was the process you followed:-Create large file-Flush page cache-Run FIONow, at which step are you flushing the page cache to avoid performance drop?Please let us know.NC