02-06-2018 08:04 AM
Hi,
We are experiencing persistent I/O request timeouts on Linux with P3520/P4600 SSDs. We have tried multiple different kernels (3.10, 4.4, 4.9) and see the timeouts on all of them. The P4600 seems to be more prone to these than the P3520 though we see them on the latter as well. We have the latest firmware installed on both drives which are housed in the same machine (Supermicro 5018R-WR with X10SRW-F motherboard and E5-1650 V4 CPU). We can reproduce the timeouts by simply running mkfs -t xfs on the drive.
Here is the output from isdct (version isdct-3.0.9.400-17.x86_64):
- Intel SSD DC P3520 Series CVPF717100L01P2JGN -
Bootloader : MB1B0105
DevicePath : /dev/nvme0n1
DeviceStatus : Healthy
Firmware : MDV10271
FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.
Index : 0
ModelNumber : INTEL SSDPEDMX012T7
ProductFamily : Intel SSD DC P3520 Series
SerialNumber : CVPF717100L01P2JGN
- Intel SSD DC P4600 Series BTLE736007F54P0KGN -
Bootloader : 0110
DevicePath : /dev/nvme1n1
DeviceStatus : Healthy
Firmware : QDV10150
FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.
Index : 1
ModelNumber : INTEL SSDPEDKE040T7
ProductFamily : Intel SSD DC P4600 Series
SerialNumber : BTLE736007F54P0KGN
Here are the messages the 4.9 kernel prints when using the P4600
[ 151.297903] nvme nvme1: I/O 568 QID 1 timeout, aborting
[ 151.303130] nvme nvme1: I/O 569 QID 1 timeout, aborting
[ 151.308347] nvme nvme1: I/O 570 QID 1 timeout, aborting
[ 151.313562] nvme nvme1: I/O 571 QID 1 timeout, aborting
[ 151.355465] nvme nvme1: completing aborted command with status: 0000
[ 151.411273] nvme nvme1: completing aborted command with status: 0000
[ 151.466903] nvme nvme1: completing aborted command with status: 0000
[ 151.522609] nvme nvme1: completing aborted command with status: 0000
[ 151.578226] nvme nvme1: completing aborted command with status: 0000
...
[ 165.395295] nvme nvme1: Abort status: 0x0
[ 165.399296] nvme nvme1: Abort status: 0x0
[ 165.403299] nvme nvme1: Abort status: 0x0
[ 165.407304] nvme nvme1: Abort status: 0x0
We would appreciate your help in resolving this issue.
Regards,
Shantanu Goel
03-02-2018 05:03 AM
I had similar problems and discovered it had to do with "discards". In my case I was creating a LVM volume and upon destroying it (lvremove) I was getting the timeouts. My problems disappeared when using the option "issue_discards = 0" in /etc/lvm/lvm.conf.
For mkfs.xfs there is also an option "-K": Do not attempt to discard blocks at mkfs time. You might want to try with this option, the problem might vanish.
It seems those SSDs do not support discards, maybe Intel can confirm here.