cancel
Showing results for 
Search instead for 
Did you mean: 

S3610 SSDs have failed "READ/WRITE FPDMA QUEUED" ATA commands, frozen, then link reset

ASmit32
New Contributor II

Hi,

I have a new Linux machine with two DC S3610 1.6TB SSDs. It's Debian jessie so kernel 3.6.17. Since around one month after installation these errors started appearing:

Jul 30 16:30:59 snaps kernel: [186914.249429] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen

Jul 30 16:30:59 snaps kernel: [186914.250465] ata1.00: failed command: WRITE FPDMA QUEUED

Jul 30 16:30:59 snaps kernel: [186914.251505] ata1.00: cmd 61/08:00:39:db:8e/00:00:09:00:00/40 tag 0 ncq 4096 out

Jul 30 16:30:59 snaps kernel: [186914.251505] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jul 30 16:30:59 snaps kernel: [186914.253613] ata1.00: status: { DRDY }

Jul 30 16:30:59 snaps kernel: [186914.254781] ata1.00: failed command: WRITE FPDMA QUEUED

Jul 30 16:30:59 snaps kernel: [186914.255810] ata1.00: cmd 61/08:08:71:fc:4e/00:00:66:00:00/40 tag 1 ncq 4096 out

Jul 30 16:30:59 snaps kernel: [186914.255810] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jul 30 16:30:59 snaps kernel: [186914.257940] ata1.00: status: { DRDY }

Jul 30 16:30:59 snaps kernel: [186914.259086] ata1: hard resetting link

Jul 30 16:31:00 snaps kernel: [186914.577366] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Jul 30 16:31:00 snaps kernel: [186914.578307] ata1.00: configured for UDMA/133

Jul 30 16:31:00 snaps kernel: [186914.578310] ata1.00: device reported invalid CHS sector 0

Jul 30 16:31:00 snaps kernel: [186914.578311] ata1.00: device reported invalid CHS sector 0

Jul 30 16:31:00 snaps kernel: [186914.578316] ata1: EH complete

The error is always the same, and the only thing on ata1.00 is one of the SSDs. I switched the two SSDs around and the problem followed the same SSD.

I can't force the error to happen on demand, it just seems to happen every other day or so, though not at the same time of day. All IO is held up briefly while the link is reset. The drive passes a SMART long self-test.

So is this drive faulty? If not, what can I try to fix this? If so, is there an easy way to prove it for RMA purposes?

Jul 27 05:59:30 snaps kernel: [ 33.054376] ata1.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133

Jul 27 05:59:30 snaps kernel: [ 33.054474] ata1.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)

Jul 27 05:59:30 snaps kernel: [ 33.054567] ata2.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133

Jul 27 05:59:30 snaps kernel: [ 33.054657] ata2.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)

$ sudo smartctl -i /dev/sda

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)

Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX016T4

Serial Number: BTHC511604V41P6PGN

LU WWN Device Id: 5 5cd2e4 04b7b1bfa

Firmware Version: G2010110

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

Form Factor: 2.5 inches

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: ACS-2 T13/2015-D revision 3

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Fri Jul 31 11:04:09 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

$ sudo smartctl -i /dev/sdb

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)

Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX016T4

Serial Number: BTHC511604SD1P6PGN

LU WWN Device Id: 5 5cd2e4 04b7b1ba2

Firmware Version: G2010110

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

Form Factor: 2.5 inches

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: ACS-2 T13/2015-D revision 3

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Fri Jul 31 11:04:35 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

Message was edited by: Andy Smith Now seeing same problems with other SSD, so this is not restricted to a single drive.

45 REPLIES 45

RTelb
New Contributor

Hello,

I am another one of those unfortunate users of Intel S3710's. I have 12 SuperMicro servers, each with 2 SSDs in software mdraid. I run Debian 8.1. SSDs are connected to an LSI3008 HBA controller running in IT mode. I have been experiencing the same kind of problems as everyone else in this thread across all machines and all drives.

Yesterday I flashed the firmware on 8 of those 12 servers (16 drives total) with the latest one provided in the Data Center Tool 2.2.4 from August 21st.

The firmware revision of those SSDs is now G2010140. I would say that things are generally more stable than before but nevertheless there are still problems reported like the ones below:

[76133.744944] sd 0:0:1:0: attempting task abort! scmd(ffff881fb7735640)

[76133.744951] sd 0:0:1:0: [sdb] CDB:

[76133.744954] Synchronize Cache(10): 35 00 00 00 00 00 00 00 00 00

[76133.744966] scsi target0:0:1: handle(0x000a), sas_address(0x4433221101000000), phy(1)

[76133.744969] scsi target0:0:1: enclosure_logical_id(0x500304801acd7500), slot(1)

[76133.778338] sd 0:0:1:0: task abort: SUCCESS scmd(ffff881fb7735640)

[76133.778346] sd 0:0:1:0: attempting task abort! scmd(ffff881fc392cc40)

[76133.778350] sd 0:0:1:0: [sdb] CDB:

[76133.778352] Write(10): 2a 00 00 4d 6e 90 00 00 08 00

[76133.778363] scsi target0:0:1: handle(0x000a), sas_address(0x4433221101000000), phy(1)

[76133.778366] scsi target0:0:1: enclosure_logical_id(0x500304801acd7500), slot(1)

[76133.778395] sd 0:0:1:0: task abort: SUCCESS scmd(ffff881fc392cc40)

[76133.778399] sd 0:0:1:0: attempting task abort! scmd(ffff8817022f16c0)

[76133.778401] sd 0:0:1:0: [sdb] CDB:

[76133.778403] Write(10): 2a 00 00 4d 6e a0 00 00 10 00

[76133.778411] scsi target0:0:1: handle(0x000a), sas_address(0x4433221101000000), phy(1)

[76133.778413] scsi target0:0:1: enclosure_logical_id(0x500304801acd7500), slot(1)

[76133.778421] sd 0:0:1:0: task abort: SUCCESS scmd(ffff8817022f16c0)

[76133.778424] sd 0:0:1:0: attempting task abort! scmd(ffff881f1edf6040)

[76133.778426] sd 0:0:1:0: [sdb] CDB:

[76133.778427] Write(10): 2a 00 00 4d 6f 18 00 00 68 00

[76133.778436] scsi target0:0:1: handle(0x000a), sas_address(0x4433221101000000), phy(1)

[76133.778438] scsi target0:0:1: enclosure_logical_id(0x500304801acd7500), slot(1)

[76133.778446] sd 0:0:1:0: task abort: SUCCESS scmd(ffff881f1edf6040)

As far as I can tell the problem has not been solved. I am still experiencing the same kind of problems. Any further ideas? Did anyone else see a complete improvement?

Thank you,

Rumen Telbizov

http://telbizov.com/ Unix Systems Administrator

RTelb
New Contributor

Does anyone else keep experiencing the same problems after flashing the firmware to G2010140? Can anyone else report success and no further problems just to see that we're on the right track?

jonathan_intel have you guys tested G2010140 on SSDs behind LSI3008 HBA?

I am experiencing the same problem when running S3700's as well and I don't see any newer firmware for that one.

jbenavides
Valued Contributor II

Hello Telbizov,

The issue reported in your system is different than the original one mentioned in this forum. The error is not the same, and since it also happens with the Intel® SSD DC 3700 series, it appears to have a different origin.

I did a quick online search for the error:

"attempting task abort! scmd"

There are multiple results, mentioning different reasons why this may be happening. You might want to check with Supermicro, or LSI, to confirm if the SSD's that you are using were tested/validated to be used with the storage adapter, drive cage and server you have.

Here are some actions that may help in this case:

- Update the Firmware and driver of your LSI storage controller.

- Update the system BIOS.

- Confirm if the combination of Server, controller, drive cage, etc; is supported by the http://www.intel.com/support/oems.htm Computer Manufacturer Support.

I would like to add that the Intel® Solid-State Drive Data Center Tool also has firmware for the Intel® SSD DC 3700 series. If there is no update in the latest ISDCT, it would mean that the drive's firmware is up-to-date.

For new issues, please create a different thread in the /community/tech/solidstate/content Solid State Drives communities.

ASmit32
New Contributor II

It's been a week now since I applied the firmware update, and I haven't had a re-occurrence of this issue since then. This is the longest span of time it's gone without problem so I think the firmware update has fixed it.

Thanks,

Andy

ASmit32
New Contributor II

By the way, I'd just like to re-iterate that I think it's extremely counter-productive to not clearly mention this problem in the release notes for the firmware. While you might advise anyone having problems to update their firmware, you could save a lot of people a lot of time by putting something in the release notes, as many people will otherwise not upgrade firmware just in the hope that it fixes what they see, and will end up asking you for support.