cancel
Showing results for 
Search instead for 
Did you mean: 

How to Fix Bad Blocks in SSD X25 160GB ?

idata
Esteemed Contributor III

Hey All!

Yesterday, I've experienced an NTFS BSOD.Today I had another one, this time I've got a dump file.when I analyzed it I've got :KERNEL_DATA_INPAGE_ERROR (7a)The requested page of kernel data could not be read in. Typically caused bya bad block in the paging file or disk controller error. Also seeKERNEL_STACK_INPAGE_ERROR. I've tried to run chkdsk and fix the blocks, but I'm not sure it was successful. CHKDSK is verifying files (stage 1 of 3)... 729344 file records processed.File verification completed. 1138 large file records processed. 0 bad file records processed. 2 EA records processed. 197 reparse records processed.CHKDSK is verifying indexes (stage 2 of 3)... 1018110 index entries processed.Index verification completed. 0 unindexed files scanned. 0 unindexed files recovered.CHKDSK is verifying security descriptors (stage 3 of 3)... 729344 file SDs/SIDs processed.Security descriptor verification completed. 144384 data files processed.CHKDSK is verifying Usn Journal... 37265120 USN bytes processed.Usn Journal verification completed.Windows has checked the file system and found no problems. 156185599 KB total disk space. 126836176 KB in 578548 files. 312828 KB in 144385 indexes. 0 KB in bad sectors. 838979 KB in use by the system. 65536 KB occupied by the log file. 28197616 KB available on disk. 4096 bytes in each allocation unit. 39046399 total allocation units on disk. 7049404 allocation units available on disk. I've run HDTune: HD Tune: INTEL SSDSA2M160G2GC Health ID Current Worst ThresholdData Status (03) Spin Up Time 100 100 0 0 Ok (04) Start/Stop Count 100 100 0 0 Ok (05) Reallocated Sector Count 100 100 0 3 Ok (09) Power On Hours Count 100 100 0 3384 Ok (0C) Power Cycle Count 100 100 0 1581 Ok (C0) Power Off Retract Count 100 100 0 27 Ok (E1) Load/Unload Cycle Count 200 200 0 145807 Ok (E2) Load-in time 100 100 0 2526 Ok (E3) Torque Amplification Count 100 100 0 0 Ok (E4) Power-Off Retract Cycle 100 100 0 -8120237 Ok (E8) (unknown attribute) 99 99 10 0 Ok (E9) (unknown attribute) 98 98 0 0 Ok (B8) (unknown attribute) 100 100 99 0 Ok Power On Time : 3384Health Status : Ok But when I did an health check, It found 2 damaged blocks. HD Tune: INTEL SSDSA2M160G2GC Error Scan Scanned data : 152566 MBDamaged Blocks : 0.1 %Elapsed Time : 19:06 I've downloaded the SSDToolbox from Intel site and run a full diagnostics. While the Data Integrity Scan succeeded. the Read Scan test failed. How do you recommend me to proceed? ThanksAriel
5 REPLIES 5

idata
Esteemed Contributor III

I would do an image-based backup AND a file-based backup.

Wipe the drive.

Run a full diagnostic. I would run the full diagnostic several times before trusting live data on it.

A bad block so early is concerning, and I wouldn't trust the drive thereafter. It probably won't qualify for a refund until it fails outright to start or fails a diagnostic.

idata
Esteemed Contributor III

I am not quite sure how it's working, but Intel SSDs like many others equipped with lot of spare flash, so when bad block occured, for example, firmware must remap bad sector with good one. I am not sure how it is implemented in Intel SSDs,are they using same methods like usual hard drives (relocated sector count in S.M.A.R.T.) or there is other too, I don't know. As for me, I have zero relocated sectors (remapped) but Avaible Reserved Space is 87 already (was 100), I did secure erase periodically too.

Anyway if builtin SSD toolbox test fails, I recommend RMA the drive if it is possible, instead "fixing" it. If peoples start asking in forums they should have some problems with RMA, I think.

idata
Esteemed Contributor III

SMART attribute 5 on these drives indicate that a NAND cell is no longer usable (writes failed). The NAND cell won't be used going forward. However, data stored in that NAND cell may or may not have been transparently moved to another cell by the drive itself (possibly via the FTL).

The problem here is that Intel doesn't implement SMART attribute 198 (Offline_Uncorrectable), which is used to indicate remaps which failed. Instead, all we have is basically a counter that says "there's 3 NAND cells which aren't available for writes any more". There's no indicator of whether or not data was transparently remapped (read then re-written to a working cell) or not.

Your drive has a power-on hours count of 3384, which is ~141 days. People here are getting spun up over loss of 3 NAND cells over 139 days? Give me a break. I wouldn't bother with an RMA unless this number starts growing at a higher rate (say, 50+). 3 is more than reasonable given how unreliable MLC NAND is. http://www.oempcworld.com/support/SLC_vs_MLC.htm SLC is significantly better (read, don't skim).

Now, to answer your actual question, re: "how to fix bad blocks" -- you can't fix bad LBAs on an SSD. On a MHDD, you can't "fix" them either -- suspect LBAs are marked unusable (unreadable) until a write is issued to them for re-analysis. If the re-analysis passes, the LBA is marked usable (but the contents of which are now zeroed, which obviously will cause data loss or a broken filesystem (keep reading)). If the re-analysis fails, the LBA is marked unusable and that's that.

In either situation with MHDDs, your drive has to be reformatted. CHKDSK /F will not fix the problem. fsck will not fix the problem. Here's why: both CHKDSK and fsck do not examine the actual *data* stored in a filesystem, they simply check the filesystem integrity itself (e.g. file allocation tables, and internal filesystem structures). So if the bad LBA is located within a data region on the drive (not a filesystem table section), CHKDSK/fsck won't find the problem and will tell you everything is fine. Therefore, the only way to figure out what file got damaged is to find out what file uses the LBA in question. This takes time to figure out, and it's a manual process. The easy solution is to simply format the disk (you do not have to zero out all LBAs); it's the only way to guarantee that nothing still remains using the LBA which was previously bad and was remapped, or was permanently bad.

Your next question/comment will be "Wow that sucks, how does a person deal with this?" The answer is to use a checksum-based filesystem. The only two filesystems to my knowledge which do this are ZFS (Solaris/OpenSolaris/FreeBSD) and Btrfs (Linux), and require either a mirror configuration (e.g. RAID-1) or a parity configuration (e.g. RAID-5). They can auto-correct errors of this sort in real-time. A journalling filesystem (NTFS, ext2, ext3, FreeBSD gjournal, FreeBSD UFS + softupdates, etc.) will not address this problem.

Does the same situation/advice apply to SSDs? Yes. The difference with an SSD is that there's an FTL which translates physical NAND cells to LBAs and the FTL can remap what goes where. However, based on what you've shown here, if the FTL was able to transparently remap the bad LBAs (those which utilise the bad NAND cell), e.g. read data from the bad NAND cell and stick the data in a working NAND cell and then use the FTL to remap the old LBAs to the new/working NAND cell, then you wouldn't have seen a BSOD. So my advice is to format the filesystem entirely and start fresh; it's the only way to be sure.