I/O request timeouts on Linux with Intel P3520/P4600 NVMe PCIe SSDs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-06-2018 08:04 AM
Hi,
We are experiencing persistent I/O request timeouts on Linux with P3520/P4600 SSDs. We have tried multiple different kernels (3.10, 4.4, 4.9) and see the timeouts on all of them. The P4600 seems to be more prone to these than the P3520 though we see them on the latter as well. We have the latest firmware installed on both drives which are housed in the same machine (Supermicro 5018R-WR with X10SRW-F motherboard and E5-1650 V4 CPU). We can reproduce the timeouts by simply running mkfs -t xfs on the drive.
Here is the output from isdct (version isdct-3.0.9.400-17.x86_64):
- Intel SSD DC P3520 Series CVPF717100L01P2JGN -
Bootloader : MB1B0105
DevicePath : /dev/nvme0n1
DeviceStatus : Healthy
Firmware : MDV10271
FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.
Index : 0
ModelNumber : INTEL SSDPEDMX012T7
ProductFamily : Intel SSD DC P3520 Series
SerialNumber : CVPF717100L01P2JGN
- Intel SSD DC P4600 Series BTLE736007F54P0KGN -
Bootloader : 0110
DevicePath : /dev/nvme1n1
DeviceStatus : Healthy
Firmware : QDV10150
FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.
Index : 1
ModelNumber : INTEL SSDPEDKE040T7
ProductFamily : Intel SSD DC P4600 Series
SerialNumber : BTLE736007F54P0KGN
Here are the messages the 4.9 kernel prints when using the P4600
[ 151.297903] nvme nvme1: I/O 568 QID 1 timeout, aborting
[ 151.303130] nvme nvme1: I/O 569 QID 1 timeout, aborting
[ 151.308347] nvme nvme1: I/O 570 QID 1 timeout, aborting
[ 151.313562] nvme nvme1: I/O 571 QID 1 timeout, aborting
[ 151.355465] nvme nvme1: completing aborted command with status: 0000
[ 151.411273] nvme nvme1: completing aborted command with status: 0000
[ 151.466903] nvme nvme1: completing aborted command with status: 0000
[ 151.522609] nvme nvme1: completing aborted command with status: 0000
[ 151.578226] nvme nvme1: completing aborted command with status: 0000
...
[ 165.395295] nvme nvme1: Abort status: 0x0
[ 165.399296] nvme nvme1: Abort status: 0x0
[ 165.403299] nvme nvme1: Abort status: 0x0
[ 165.407304] nvme nvme1: Abort status: 0x0
We would appreciate your help in resolving this issue.
Regards,
Shantanu Goel
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2018 12:26 PM
Hi,
I powercycled the system and tried running the load again but it still reports the drive as having the latest firmware. When I first downloaded and ran isdct 3.0.10 it did report having newer firmware and successfully updated it on the drive and all commands were run as root.
Here is the version of the tool:
# isdct version
- Version Information -
Name: Intel(R) Data Center Tool
Version: 3.0.10
Description: Interact and configure Intel SSDs.
When I attempt to load the firmware now, this is the output I get from the tool:
# isdct load -intelssd 1
WARNING! You have selected to update the drives firmware!
Proceed with the update? (Y|N): Y
Updating firmware...
- Intel SSD DC P4600 Series BTLE736007F54P0KGN -
Status : The selected Intel SSD contains current firmware as of this tool release.
# isdct show -intelssd 1
- Intel SSD DC P4600 Series BTLE736007F54P0KGN -
Bootloader : 0122
DevicePath : /dev/nvme1n1
DeviceStatus : Healthy
Firmware : QDV10170
FirmwareUpdateAvailable : The selected Intel SSD contains current firmware as of this tool release.
Index : 1
ModelNumber : INTEL SSDPEDKE040T7
ProductFamily : Intel SSD DC P4600 Series
SerialNumber : BTLE736007F54P0KGN
The version of RHEL is 6.9
Thanks,
Shantanu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2018 01:08 PM
Hello Shantanu,
There seems to be a software compatibility issue that may be causing this, because as you can see in the following image, the Intel® SSD Data Center Tool is supported for the following operating systems, and RHEL 6.9 is not one of those:Do you have access to a PC with any of the listed operating systems? Could you please try again to install the latest firmware using the official tool?
It's important for us to find out if version QDV10190 solves the issue you are experiencing. I'll be waiting for your response. Regards,Andres V.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2018 01:46 PM
Hi,
RHEL 6.6 is very old (released in 2014) and we have long since upgraded our systems to 6.9 so I am unable to test on that release. I am surprised your tool releases have not kept up with vendor OS releases. Both isdct versions 3.0.9 and 3.0.10 did update the firmware to a newer release without complaint so it is not clear what the nature of the incompatibility is here since the tool itself does not print message indicating as such.
Shantanu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2018 02:35 PM
Hello Shantanu,
Thank you for your feedback.
Regarding your comment:
Both isdct versions 3.0.9 and 3.0.10 did update the firmware to a newer release without complaint so it is not clear what the nature of the incompatibility is here since the tool itself does not print message indicating as such. Are you referring to an update to firmware version QDV10170 or to firmware version QDV10190? Have you been able to update the SSDs that do not show the persistent I/O request timeouts? Do you have any Intel® SSD DC P4600 with firmware version QDV10190? Regards,Andres V.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2018 03:11 PM
Hi,
I was referring to the fact that on the test machine we initially used isdct 3.0.9 to upgrade the P4600 firmware version from QDV10130 to QDV10150 and isdct 3.0.10 subsequently from QDV10150 to QDV10170. As I posted in the output above isdct 3.0.10 shows QDV10170 as the latest revision of the firmware available and states that the drive already has that revision installed on it. It does not report QDV10190 as being available. Could this be a discrepancy in the firmware revision between the documentation and the tool itself?
The P4600s we tried deploying in production have firmware QDV10130 and they all exhibit the timeouts so until this issue is resolved, these drives are unusable for us. We have had great success with your SATA SSDs (S3700, S3600, S3610, S3520) on various different versions of the OS and Linux kernels which is why we purchased their NVMe counterparts but as I now, the experience with them has been a disappointing one so we would really appreciate help in resolving the issue.
Thanks,
Shantanu