cancel
Showing results for 
Search instead for 
Did you mean: 

S3610 SSDs have failed "READ/WRITE FPDMA QUEUED" ATA commands, frozen, then link reset

ASmit32
New Contributor II

Hi,

I have a new Linux machine with two DC S3610 1.6TB SSDs. It's Debian jessie so kernel 3.6.17. Since around one month after installation these errors started appearing:

Jul 30 16:30:59 snaps kernel: [186914.249429] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen

Jul 30 16:30:59 snaps kernel: [186914.250465] ata1.00: failed command: WRITE FPDMA QUEUED

Jul 30 16:30:59 snaps kernel: [186914.251505] ata1.00: cmd 61/08:00:39:db:8e/00:00:09:00:00/40 tag 0 ncq 4096 out

Jul 30 16:30:59 snaps kernel: [186914.251505] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jul 30 16:30:59 snaps kernel: [186914.253613] ata1.00: status: { DRDY }

Jul 30 16:30:59 snaps kernel: [186914.254781] ata1.00: failed command: WRITE FPDMA QUEUED

Jul 30 16:30:59 snaps kernel: [186914.255810] ata1.00: cmd 61/08:08:71:fc:4e/00:00:66:00:00/40 tag 1 ncq 4096 out

Jul 30 16:30:59 snaps kernel: [186914.255810] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jul 30 16:30:59 snaps kernel: [186914.257940] ata1.00: status: { DRDY }

Jul 30 16:30:59 snaps kernel: [186914.259086] ata1: hard resetting link

Jul 30 16:31:00 snaps kernel: [186914.577366] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Jul 30 16:31:00 snaps kernel: [186914.578307] ata1.00: configured for UDMA/133

Jul 30 16:31:00 snaps kernel: [186914.578310] ata1.00: device reported invalid CHS sector 0

Jul 30 16:31:00 snaps kernel: [186914.578311] ata1.00: device reported invalid CHS sector 0

Jul 30 16:31:00 snaps kernel: [186914.578316] ata1: EH complete

The error is always the same, and the only thing on ata1.00 is one of the SSDs. I switched the two SSDs around and the problem followed the same SSD.

I can't force the error to happen on demand, it just seems to happen every other day or so, though not at the same time of day. All IO is held up briefly while the link is reset. The drive passes a SMART long self-test.

So is this drive faulty? If not, what can I try to fix this? If so, is there an easy way to prove it for RMA purposes?

Jul 27 05:59:30 snaps kernel: [ 33.054376] ata1.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133

Jul 27 05:59:30 snaps kernel: [ 33.054474] ata1.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)

Jul 27 05:59:30 snaps kernel: [ 33.054567] ata2.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133

Jul 27 05:59:30 snaps kernel: [ 33.054657] ata2.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)

$ sudo smartctl -i /dev/sda

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)

Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX016T4

Serial Number: BTHC511604V41P6PGN

LU WWN Device Id: 5 5cd2e4 04b7b1bfa

Firmware Version: G2010110

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

Form Factor: 2.5 inches

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: ACS-2 T13/2015-D revision 3

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Fri Jul 31 11:04:09 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

$ sudo smartctl -i /dev/sdb

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)

Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX016T4

Serial Number: BTHC511604SD1P6PGN

LU WWN Device Id: 5 5cd2e4 04b7b1ba2

Firmware Version: G2010110

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

Form Factor: 2.5 inches

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: ACS-2 T13/2015-D revision 3

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Fri Jul 31 11:04:35 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

Message was edited by: Andy Smith Now seeing same problems with other SSD, so this is not restricted to a single drive.

45 REPLIES 45

jbenavides
Valued Contributor II

Hello gschoenberger,

We understand your concern and we will consider your feedback about this matter for the future.

Please remember that if you have any doubts, or if you require assistance to locate documentation, you can also http://www.intel.com/p/en_US/support/contactsupport Contact Support to engage the support center in your region.

GSchö
New Contributor

Just to let you know, our "task abort" problem was solved with the latest firmware update!

Would be nice if Intel could bring out some statement about the problem and that it is fixed with the new firmware.

Cheers Georg

YArab
New Contributor

Alright, so we're bacisally hitting the same problem.

The setup is the following: LSI3008 in IT mode.

82:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)

mpt3sas0: LSISAS3008: FWVersion(10.00.00.00), ChipRevision(0x02), BiosVersion(08.25.00.00)

mpt3sas version 04.100.00.00-rh loaded

4x Intel 730 series SSD drives SSDSC2BP480G4 (480GB) running their factory firmware. (no f/w update on intel site).

Software md raid10.

LVM atop of that md raid.

Using http://arctic.org/~dean/randomio/ randomio util in write mode (32 threads):

./randomio /dev/vg/200g 32 1 0.1 4096 60 1

Resulting in following messages (a lot of them):

Nov 29 04:11:03 x41-1 kernel: sd 7:0:0:0: attempting task abort! scmd(ffff880056294a80)

Nov 29 04:11:03 x41-1 kernel: sd 7:0:0:0: [sda] CDB: Write(10): 2a 00 04 bd 4c d8 00 00 08 00

Nov 29 04:11:03 x41-1 kernel: scsi target7:0:0: handle(0x0009), sas_address(0x4433221100000000), phy(0)

Nov 29 04:11:03 x41-1 kernel: scsi target7:0:0: enclosure_logical_id(0x5003048019734c00), slot(0)

Nov 29 04:11:03 x41-1 kernel: sd 7:0:0:0: task abort: SUCCESS scmd(ffff880056294a80)

Nov 29 04:11:03 x41-1 kernel: sd 7:0:0:0: attempting task abort! scmd(ffff88004a3d6dc0)

Nov 29 04:11:03 x41-1 kernel: sd 7:0:0:0: [sda] CDB: Write(10): 2a 00 01 f7 6f d8 00 00 08 00

Nov 29 04:11:03 x41-1 kernel: scsi target7:0:0: handle(0x0009), sas_address(0x4433221100000000), phy(0)

Nov 29 04:11:03 x41-1 kernel: scsi target7:0:0: enclosure_logical_id(0x5003048019734c00), slot(0)

Nov 29 04:11:03 x41-1 kernel: sd 7:0:0:0: task abort: SUCCESS scmd(ffff88004a3d6dc0)

When testing on bare drive (/dev/sda), or on LV created on drive partition, the issue doesn't manifest itself. So it's some kind of weird combo of LSI HBA, MD raid and of Intel SSD.

Update: I was able to reproduce the hangup on bare blockdevs (without LVM and MDRAID). So, it's only LSI and Drives combination now.

Suggestions are welcome.

jbenavides
Valued Contributor II

Hello yuri.a,

The issue you reported is different from the original topic of this thread.

This thread is related to "READ/WRITE FPDMA QUEUED" ATA commands and link resets with the Intel® SSD DC S3610 and Intel® SSD DC S3710 Series. We are glad to inform that this has been addressed with the latest firmware updates for these drive models.

In your case, the error is being reported for the HBA, not the SSD. In this thread, the user Telbizov reported the same issue with the LSI3008 HBA, but using different SSD models. So we would advise you to contact LSI support for assistance on this problem.

Here are the recommendations we originally provided to Telbizov about this matter:

The errors "attempting task abort! scmd" may be caused by different reasons. You might want to check with the http://www.intel.com/support/oems.htm Computer Manufacturer Support, or LSI, to confirm if the SSD's that you are using were tested/validated to be used with the storage adapter, drive cage and server you have.

Here are some actions that may help in this case:

- Update the Firmware and driver of your LSI storage controller.

- Update the system BIOS.

- Confirm if the combination of Server, controller, drive cage, etc; is supported by the http://www.intel.com/support/oems.htm Computer Manufacturer Support.

That's exactly what I did -- contacted vendor (Supermicro). Let's see if this gets solved. As for now, I'd recommend anyone willing to buy LSI 3008 (IT mode) - based adapter and wanting to use SSD drives to step back and wait until LSI/Avago implements a fix.