Solidigm

ASmit32 · ‎07-31-2015

Hi,

I have a new Linux machine with two DC S3610 1.6TB SSDs. It's Debian jessie so kernel 3.6.17. Since around one month after installation these errors started appearing:

Jul 30 16:30:59 snaps kernel: [186914.249429] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen

Jul 30 16:30:59 snaps kernel: [186914.250465] ata1.00: failed command: WRITE FPDMA QUEUED

Jul 30 16:30:59 snaps kernel: [186914.251505] ata1.00: cmd 61/08:00:39:db:8e/00:00:09:00:00/40 tag 0 ncq 4096 out

Jul 30 16:30:59 snaps kernel: [186914.251505] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jul 30 16:30:59 snaps kernel: [186914.253613] ata1.00: status: { DRDY }

Jul 30 16:30:59 snaps kernel: [186914.254781] ata1.00: failed command: WRITE FPDMA QUEUED

Jul 30 16:30:59 snaps kernel: [186914.255810] ata1.00: cmd 61/08:08:71:fc:4e/00:00:66:00:00/40 tag 1 ncq 4096 out

Jul 30 16:30:59 snaps kernel: [186914.255810] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jul 30 16:30:59 snaps kernel: [186914.257940] ata1.00: status: { DRDY }

Jul 30 16:30:59 snaps kernel: [186914.259086] ata1: hard resetting link

Jul 30 16:31:00 snaps kernel: [186914.577366] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Jul 30 16:31:00 snaps kernel: [186914.578307] ata1.00: configured for UDMA/133

Jul 30 16:31:00 snaps kernel: [186914.578310] ata1.00: device reported invalid CHS sector 0

Jul 30 16:31:00 snaps kernel: [186914.578311] ata1.00: device reported invalid CHS sector 0

Jul 30 16:31:00 snaps kernel: [186914.578316] ata1: EH complete

The error is always the same, and the only thing on ata1.00 is one of the SSDs. I switched the two SSDs around and the problem followed the same SSD.

I can't force the error to happen on demand, it just seems to happen every other day or so, though not at the same time of day. All IO is held up briefly while the link is reset. The drive passes a SMART long self-test.

So is this drive faulty? If not, what can I try to fix this? If so, is there an easy way to prove it for RMA purposes?

Jul 27 05:59:30 snaps kernel: [ 33.054376] ata1.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133

Jul 27 05:59:30 snaps kernel: [ 33.054474] ata1.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)

Jul 27 05:59:30 snaps kernel: [ 33.054567] ata2.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133

Jul 27 05:59:30 snaps kernel: [ 33.054657] ata2.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)

$ sudo smartctl -i /dev/sda

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX016T4

Serial Number: BTHC511604V41P6PGN

LU WWN Device Id: 5 5cd2e4 04b7b1bfa

Firmware Version: G2010110

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

Form Factor: 2.5 inches

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: ACS-2 T13/2015-D revision 3

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Fri Jul 31 11:04:09 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

$ sudo smartctl -i /dev/sdb

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)

=== START OF INFORMATION SECTION ===

Device Model: INTEL SSDSC2BX016T4

Serial Number: BTHC511604SD1P6PGN

LU WWN Device Id: 5 5cd2e4 04b7b1ba2

Firmware Version: G2010110

User Capacity: 1,600,321,314,816 bytes [1.60 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Rotation Rate: Solid State Device

Form Factor: 2.5 inches

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: ACS-2 T13/2015-D revision 3

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)

Local Time is: Fri Jul 31 11:04:35 2015 UTC

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

Message was edited by: Andy Smith Now seeing same problems with other SSD, so this is not restricted to a single drive.

jbenavides · ‎09-04-2015

Hello Grifferz,

Thank you for the update. I am glad to know the issue did not reoccur after the firmware update was applied.

We will take your feedback regarding the release notes documentation to the proper resources. In the meantime, we are sure that this community thread will be a good reference for other users facing this issue.

RBade2 · ‎09-07-2015

Hi Telbizov,

We also have Supermicro with LSI3008 HBA and are seeing this SCSI Task Abort on S3710 and S3700. However the same servers also have an Intel SATA controller.

We have 8 drives per server on the LSI and 2 on the Intel.

When we see the IO delay (4x more often on S3710 than S3700) we either get the ATA hard resetting link (on Intel) or SCSI Task Abort (on LSI) so I disagree that it's a different just because the message is different. IMO the ATA and SCSI messages are the same, just different terminology due to ATA vs SCSI.

The pattern of events is the same when this happens.

When you updated the firmware on the your S3710's did the frequency of Task Aborts decrease?

I will report back with more info after I've updated the firmware.

RTelb · ‎09-08-2015

Hi rich.bade,

When you updated the firmware on the your S3710's did the frequency of Task Aborts decrease?

It did actually. I noticed that errors got drastically reduced. Nevertheless with the LSI's I was still experiencing problems.

Here's my update: since last Friday I had all 12 LSI3008 HBAs replaced with simple arms which connect the backplanes to the motherboard. So with that I am running all disks directly connected to the motherboard (X10DRT-PT) controller. I also made sure that I am running the latest G2010140 firmware on those S3710s.

I am happy to report that after 3 days of heavy load on those systems I see no errors whatsoever. It looks like there were 2 problems indeed. One the firmware had a bug which seems to be fixed now. But then when LSI3008 is involved there are further issues. Now where the remaining problem lies I don't know. It might be the LSI card itself or the HBA/backplane issue of LSI/Intel 3710s issues as well.

In my case I could go production without the LSI HBA but if I were Intel I would still test the above configuration combo myself just to make sure that there isn't another remaining SSD firmware bug specifically when paired with LSI3008. Something to dwell on jonathan_intel ...

Cheers,

Rumen Telbizov

http://telbizov.com/ Unix Systems Administrator

RBade2 · ‎09-13-2015

Hi Everyone,

I updated the firmware on 3 drives (one host) last Tuesday and have not seen any ATA resets or Task Aborts on that host in the 5 days since. I have seen 18 ATA resets or Task Aborts on the two hosts that I made no changes on.

It looks like this firmware has fixed my issues.

Thanks everyone for the info in this thread.

SHawk · ‎09-16-2015

I updated to the new firmware a few weeks ago now and my problem hasn't recurred since.

Solidigm

S3610 SSDs have failed "READ/WRITE FPDMA QUEUED" ATA commands, frozen, then link reset