07-31-2015 04:07 AM
Hi,
I have a new Linux machine with two DC S3610 1.6TB SSDs. It's Debian jessie so kernel 3.6.17. Since around one month after installation these errors started appearing:
Jul 30 16:30:59 snaps kernel: [186914.249429] ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
Jul 30 16:30:59 snaps kernel: [186914.250465] ata1.00: failed command: WRITE FPDMA QUEUED
Jul 30 16:30:59 snaps kernel: [186914.251505] ata1.00: cmd 61/08:00:39:db:8e/00:00:09:00:00/40 tag 0 ncq 4096 out
Jul 30 16:30:59 snaps kernel: [186914.251505] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 30 16:30:59 snaps kernel: [186914.253613] ata1.00: status: { DRDY }
Jul 30 16:30:59 snaps kernel: [186914.254781] ata1.00: failed command: WRITE FPDMA QUEUED
Jul 30 16:30:59 snaps kernel: [186914.255810] ata1.00: cmd 61/08:08:71:fc:4e/00:00:66:00:00/40 tag 1 ncq 4096 out
Jul 30 16:30:59 snaps kernel: [186914.255810] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jul 30 16:30:59 snaps kernel: [186914.257940] ata1.00: status: { DRDY }
Jul 30 16:30:59 snaps kernel: [186914.259086] ata1: hard resetting link
Jul 30 16:31:00 snaps kernel: [186914.577366] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 30 16:31:00 snaps kernel: [186914.578307] ata1.00: configured for UDMA/133
Jul 30 16:31:00 snaps kernel: [186914.578310] ata1.00: device reported invalid CHS sector 0
Jul 30 16:31:00 snaps kernel: [186914.578311] ata1.00: device reported invalid CHS sector 0
Jul 30 16:31:00 snaps kernel: [186914.578316] ata1: EH complete
The error is always the same, and the only thing on ata1.00 is one of the SSDs. I switched the two SSDs around and the problem followed the same SSD.
I can't force the error to happen on demand, it just seems to happen every other day or so, though not at the same time of day. All IO is held up briefly while the link is reset. The drive passes a SMART long self-test.
So is this drive faulty? If not, what can I try to fix this? If so, is there an easy way to prove it for RMA purposes?
Jul 27 05:59:30 snaps kernel: [ 33.054376] ata1.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133
Jul 27 05:59:30 snaps kernel: [ 33.054474] ata1.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)
Jul 27 05:59:30 snaps kernel: [ 33.054567] ata2.00: ATA-9: INTEL SSDSC2BX016T4, G2010110, max UDMA/133
Jul 27 05:59:30 snaps kernel: [ 33.054657] ata2.00: 3125627568 sectors, multi 1: LBA48 NCQ (depth 31/32)
$ sudo smartctl -i /dev/sda
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: INTEL SSDSC2BX016T4
Serial Number: BTHC511604V41P6PGN
LU WWN Device Id: 5 5cd2e4 04b7b1bfa
Firmware Version: G2010110
User Capacity: 1,600,321,314,816 bytes [1.60 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Jul 31 11:04:09 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
$ sudo smartctl -i /dev/sdb
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: INTEL SSDSC2BX016T4
Serial Number: BTHC511604SD1P6PGN
LU WWN Device Id: 5 5cd2e4 04b7b1ba2
Firmware Version: G2010110
User Capacity: 1,600,321,314,816 bytes [1.60 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 2.6, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Jul 31 11:04:35 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Message was edited by: Andy Smith Now seeing same problems with other SSD, so this is not restricted to a single drive.
07-31-2015 07:15 AM
Hello grifferz,
We are going to check on this and will provide you a reply as soon as possible.
08-06-2015 10:09 AM
Hello grifferz,
Please make sure the BIOS of your system is up-to-date, and that you are using the drivers recommended by the system manufacturer.
If the issue persists, please let us know the following:
- Smart Attributes output (smartctl -A)
- PC make and model
- Motherboard model
- BIOS version
- Type of Storage controller where the drive is plugged into.
08-06-2015 10:45 AM
Hi Jonathan,
> Please make sure the BIOS of your system is up-to-date
Yes, it is the latest BIOS.
> and that you are using the drivers recommended by the system manufacturer.
Well, this is a Debian Linux 8.0 system, with the latest kernel package, so I don't think there are any other recommended drivers.
> Smart Attributes output (smartctl -A)
$ sudo smartctl -A /dev/sda
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-3.16.0-4-amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 099 099 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 630
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 17
170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 0
171 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 2
175 Program_Fail_Count_Chip 0x0033 100 100 010 Pre-fail Always - 5164180714
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 090 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 076 071 000 Old_age Always - 24 (Min/Max 24/30)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 2
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 24
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 80726
226 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 20
227 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 62
228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 37677
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0
234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 80726
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 131822
> PC make and model
A Supermicro server
> Motherboard model
Supermicro X10SDV-F
> BIOS version
AMI BIOS R 1.0a
> Type of Storage controller where the drive is plugged into.
Directly into motherboard SATA.
08-10-2015 03:58 AM
Same issue for me, but with brand new S3710s, seemingly all our samples are 'defective' and tends to reset bus once or twice per day with very moderate workload applied. S3700 and S3500 worked at the same place (SATA port, M/B revision and BIOS # ) just flawless previously. Had to ask both SuperMicro and Intel support privately for possible actions, though most likely the issue is specific to a 22nm SSD generation.
Edit: would be very grateful for RMA hints as well, possibly with direct communication with a retailer involved. The risk of using those devices is too high right now, we`d prefer to replace entire party with well-known S3700 over return by defect and start detailed investigation on a selected samples after that.