Solidigm

Pete_H_ · ‎12-18-2018

I have 6 P4510 in a RAID 6 array. Under seemingly random circumstances, I am getting kernel messages such as:

Dec 18 01:34:26 dimebox kernel: nvme nvme0: I/O 55 QID 52 timeout, reset controller

The issue seems to be triggered more frequently during periods of high I/O, lots of reads and simultaneous writes. The machine has yet to fail, but when the controller is reset, all I/O operations are stalled.

The operating system pertinent information is:

[root@dimebox ~]# cat /etc/redhat-release ; uname -a

CentOS Linux release 7.6.1810 (Core)

Linux dimebox.stata.com 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

[root@dimebox ~]# df -hl | grep dev

/dev/md127 7.0T 1.4T 5.3T 21% /

devtmpfs 63G 0 63G 0% /dev

tmpfs 63G 4.0K 63G 1% /dev/shm

/dev/md125 249M 12M 238M 5% /boot/efi

[root@dimebox ~]# cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] [raid1]

md125 : active raid1 nvme5n1p3[5] nvme2n1p3[2] nvme4n1p3[4] nvme3n1p3[3] nvme0n1p3[0] nvme1n1p3[1]

254912 blocks super 1.0 [6/6] [UUUUUU]

bitmap: 0/1 pages [0KB], 65536KB chunk

md126 : active raid6 nvme3n1p2[3] nvme1n1p2[1] nvme5n1p2[5] nvme4n1p2[4] nvme0n1p2[0] nvme2n1p2[2]

16822272 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]

md127 : active raid6 nvme1n1p1[1] nvme3n1p1[3] nvme5n1p1[5] nvme4n1p1[4] nvme0n1p1[0] nvme2n1p1[2]

7516188672 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/6] [UUUUUU]

bitmap: 8/14 pages [32KB], 65536KB chunk

unused devices: <none>

As you can see, I also overprovisioned the drives, leaving approximately 7GB free on each:

n[root@dimebox ~]# parted /dev/nvme0n1 unit MB print

Model: NVMe Device (nvme)

Disk /dev/nvme0n1: 2000399MB

Sector size (logical/physical): 512B/512B

Partition Table: gpt

Disk Flags:

Number Start End Size File system Name Flags

1 1.05MB 1924281MB 1924280MB raid

2 1924281MB 1928592MB 4312MB raid

3 1928592MB 1928853MB 261MB fat16 raid

In the attached nvme.txt, you can see the output of isdct show -a -intelssd. nvme2.txt contains the kernel's ring buffer filtered on nvme entries from last boot. What I find most interesting is that not every "timeout, aborting" entry will trigger the reset of the controller. I am also not certain if these timeout entries are noticed by the user, but the "timeout, reset controller" issues are.

Does Intel have any idea what could be triggering these events, more importantly, how to avoid them?

JosafathB_Intel · ‎12-20-2018

Hello Pete H, Thank you for your reply. We will be looking forward to your reply letting us know the recommendations provided by your OEM. If you have future questions, please don’t hesitate to contact us. We will be more than happy to help you in any way we can. Best regards, Josh B. Intel® Customer Support Technician Under Contract to Intel Corporation

JosafathB_Intel · ‎12-27-2018

Hello Pete H, Thank you for having contacted Intel® Technical Support. I was reviewing your community post and we would like to know if you need further assistance or if we can close this case. We will be looking forward to your reply. Best regards, Josh B. Intel® Customer Support Technician Under Contract to Intel Corporation

Pete_H_ · ‎12-27-2018

Josh,

As it turns out, the server was incorrectly configured. Once the proper amount of RAM was installed, the IRQ timeout errors have gone away. Please feel free to close this case, thank you.

-Pete

JosafathB_Intel · ‎12-28-2018

Hello Pete H, Thank you for your reply. It has been a pleasure to assist you through this process and as per your consent, this case is now close if you need further assistance please do not hesitate to contact us again. Best regards, Josh B. Intel Customer Support Technician Under Contract to Intel Corporation

Solidigm

Intel SSD DC P4510 Series, reset controller