we've got a couple of servers, each with one of the DC P3700 2TB AIC drives and we see a serious performance regression after a couple of hours.
Initially, before we run the application tests, we quickly tested the I/O performance using `dd` writing 100 8GB files with a consistent rate of 2GB/s. After that, another `dd´ run reading these 100 files with direct IO and resulting in a consistent rate of 1.1GB/s. File system is XFS - but also tested ext4. (We're aware that this is not a solid test - but good enough to get some indication that the drive's working with a consistent write and read throughput.)
The actual test is an application, that usually just writes at 100MB/s and no reads, periodically peaking to 600MB/s writes and 150MB/s reads - everything's sequential I/O. This works for a couple of hours. But after that, IO performance degrades to a few MB/s. Even after the application has been stopped and the same `dd` tests show that write throughput's degraded to maybe 200MB/s and reads to 100MB/s.
We would have expected the drive to eventually degrade a bit, but not to 200/100 MB/s.
ext4 resulted in a generally worse performance than XFS. But the general behaviour (throughput regression) is the same on all machines.
Generally, we can also reproduce kernel panics in combination with isdct. One way to cause a kernel panic is to issue "isdct delete -intelssd"; the command completes but shortly after that, the kernel panic occurs.
Do you have any idea what may cause these behaviours and how to fix these?
We understand you a couple of Intel® SSD DC P3700 Series, which are experiencing performance drops after being used for a short period of time.
Before we can provide reasons and suggestions, we would like to have some more information:
1. What OS are you using with these drives?2. These are NVMe* SSDs, which driver are you using?
3. Which firmware version are the drives currently using?
4. What are the make and models of your servers?5. What program are you using to perform the benchmarking tests?6. Would it be possible for you to attach the SMART details from at least one of these drives? (using the advanced editor at the top right when replying will allow you to attach files to your post).
We look forward to hearing back from you.
Best regards,Carlos A.
1: SLES 12 SP1 (3.12.59-60.45)
2: standard kernel driver (nvme.ko)
3: 8DV1LP11 and 8DV1LP10 (with 8DV1LP11 interrupt coalescing is disabled, with 8DV1LP10 it's not)
4: Lenovo x3550 M5
5: `dd` and `hdparm`. We also ran some fio and bonnie++ tests before - but these looked ok (they didn't ran for many hours though).
6. (see attachment smart-details.txt)
The benchmarking itself was a Cassandra stress test with mixed read/write workload. But all the reads never actually hit the disk since the files were cached in memory.
The (constant) base workload against the SSD of ~100MB/s is just sequentially writing the commit log. The irregular additional workload on top of that is compaction (i.e. sequential read + write).
After approx. 5 hours, the disk was not able to provide a higher throughput of 200MB/s for writes and 100MB/s for reads - this was measured with `dd`.
Hello BreakStuff,In this case we can start by recommending for you to install and use the Intel® SSD DC NVMe* Drivers instead of the built-in version provided with your OS.You will need to install driver version 220.127.116.112, as the latest release did not include an updated driver for Linux*.- https://downloadcenter.intel.com/download/26167/Intel-SSD-Data-Center-Family-for-NVMe-Drivers Intel® SSD Data Center Family for NVMe Drivers.- http://www.intel.com/content/dam/support/us/en/documents/ssdc/data-center-ssds/Intel_Linux_NVMe_Guid... NVMe* Driver Installation Guide for Linux*.Aside from that, based on your firmware version, we can tell that your SSDs are Lenovo* OEM drives. Due to this, our firmware versions will not apply, you would need to check with your OEM to find out if you're using the latest firmware, or if there are any new releases.When it comes to SSD benchmarking, our recommended tool for Linux* is FIO (which I see that you've used). We also recommend using FIO Visualizer if you'd like a GUI.- https://itpeernetwork.intel.com/how-to-benchmark-ssds-with-fio-visualizer/ How to benchmark SSDs with FIO Visualizer.From your SMART details, we were unable to find any red flags. It seems that your drive's overall health is ok, as expected.Note: Any links provided for third party tools or sites are offered for your convenience and should not be viewed as an endorsement by Intel® of the content, products, or services offered there. We do not offer support for any third party tool mentioned here.Please let us know if this helps.Best regards,Carlos A.
thanks for your quick reply!
I'll ask the ops guys to install the Intel NVMe driver.
We already looked at the SMART details and nothing jumped out to us.
Let's see how it works with the NVMe driver.
BTW: Is it sufficient to over provision the drive by just using less capacity (e.g just use for example 1.5T instead of 2T) or is it better to adjust the Max-LBA setting?
Do you have any idea why interrupt coalescing is enabled in the one firmware revision and not in the other?
Generally speaking, can I expect the SSD to achieve the specified performance numbers in steady state?