02-06-2018 08:04 AM
Hi,
We are experiencing persistent I/O request timeouts on Linux with P3520/P4600 SSDs. We have tried multiple different kernels (3.10, 4.4, 4.9) and see the timeouts on all of them. The P4600 seems to be more prone to these than the P3520 though we see them on the latter as well. We have the latest firmware installed on both drives which are housed in the same machine (Supermicro 5018R-WR with X10SRW-F motherboard and E5-1650 V4 CPU). We can reproduce the timeouts by simply running mkfs -t xfs on the drive.
Here is the output from isdct (version isdct-3.0.9.400-17.x86_64):
- Intel SSD DC P3520 Series CVPF717100L01P2JGN -
Bootloader : MB1B0105
DevicePath : /dev/nvme0n1
DeviceStatus : Healthy
Firmware : MDV10271
FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.
Index : 0
ModelNumber : INTEL SSDPEDMX012T7
ProductFamily : Intel SSD DC P3520 Series
SerialNumber : CVPF717100L01P2JGN
- Intel SSD DC P4600 Series BTLE736007F54P0KGN -
Bootloader : 0110
DevicePath : /dev/nvme1n1
DeviceStatus : Healthy
Firmware : QDV10150
FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.
Index : 1
ModelNumber : INTEL SSDPEDKE040T7
ProductFamily : Intel SSD DC P4600 Series
SerialNumber : BTLE736007F54P0KGN
Here are the messages the 4.9 kernel prints when using the P4600
[ 151.297903] nvme nvme1: I/O 568 QID 1 timeout, aborting
[ 151.303130] nvme nvme1: I/O 569 QID 1 timeout, aborting
[ 151.308347] nvme nvme1: I/O 570 QID 1 timeout, aborting
[ 151.313562] nvme nvme1: I/O 571 QID 1 timeout, aborting
[ 151.355465] nvme nvme1: completing aborted command with status: 0000
[ 151.411273] nvme nvme1: completing aborted command with status: 0000
[ 151.466903] nvme nvme1: completing aborted command with status: 0000
[ 151.522609] nvme nvme1: completing aborted command with status: 0000
[ 151.578226] nvme nvme1: completing aborted command with status: 0000
...
[ 165.395295] nvme nvme1: Abort status: 0x0
[ 165.399296] nvme nvme1: Abort status: 0x0
[ 165.403299] nvme nvme1: Abort status: 0x0
[ 165.407304] nvme nvme1: Abort status: 0x0
We would appreciate your help in resolving this issue.
Regards,
Shantanu Goel
02-22-2018 04:54 PM
Hello Shantanu,
I would like to inform you that we have performed several tests to try and reproduce the issue you are experiencing, and that we are still doing research in order to find out what could be causing the timeout error messages. I'll contact you as soon as we find something relevant. Regards,Andres V.03-01-2018 04:31 PM
Hello Shantanu,
I just want to inform you that we have been trying to reproduce the issue you are experiencing but currently haven't got the same output as you do.Would it be possible for you to run the command on CentOS 7, kernel 4.15, and share the output?In case you have any update don't hesitate to contact us.I'll be waiting for your response.Regards,Andres V.03-02-2018 09:40 AM
Hello Shantanu,
I was wondering if you would be interested in trying the workaround kindly suggested by community member berthierp?
In case you do, please share your results with us.Regards,
Andres V.03-05-2018 09:02 AM
Hi,
Thank you both for your suggestions and I can confirm that increasing the timeout to 30 seconds as upstream kernel.org has done or passing the -K flag to mkfs fixes the issue.
Regards,
Shantanu
03-06-2018 10:18 AM
Hello Shantanu,
I'm glad to hear that you found a solution to the issue.Thank you for sharing the workaround, the community really appreciates it.In case you have another question, don't hesitate to contact us. Regards,Andres V.