01-17-2017 12:27 PM
Hi,
we've got a couple of servers, each with one of the DC P3700 2TB AIC drives and we see a serious performance regression after a couple of hours.
Initially, before we run the application tests, we quickly tested the I/O performance using `dd` writing 100 8GB files with a consistent rate of 2GB/s. After that, another `dd´ run reading these 100 files with direct IO and resulting in a consistent rate of 1.1GB/s. File system is XFS - but also tested ext4. (We're aware that this is not a solid test - but good enough to get some indication that the drive's working with a consistent write and read throughput.)
The actual test is an application, that usually just writes at 100MB/s and no reads, periodically peaking to 600MB/s writes and 150MB/s reads - everything's sequential I/O. This works for a couple of hours. But after that, IO performance degrades to a few MB/s. Even after the application has been stopped and the same `dd` tests show that write throughput's degraded to maybe 200MB/s and reads to 100MB/s.
We would have expected the drive to eventually degrade a bit, but not to 200/100 MB/s.
ext4 resulted in a generally worse performance than XFS. But the general behaviour (throughput regression) is the same on all machines.
Generally, we can also reproduce kernel panics in combination with isdct. One way to cause a kernel panic is to issue "isdct delete -intelssd"; the command completes but shortly after that, the kernel panic occurs.
Do you have any idea what may cause these behaviours and how to fix these?
01-28-2017 03:24 AM
Hi Carlos,
we just used
for i in `seq 1 100`; do time sh -c "dd if=/dev/zero of=/nvme-disk/scratch/dd-${i} bs=32k count=262144" ; done
for i in `seq 1 100`; do time sh -c "dd of=/dev/null if=/nvme-disk/scratch/dd-${i} bs=32k iflag=direct count=262144" ; done
as a "quick" test to test basic functionality and get an idea of sequential write & read throughput (despite the non-optimal block-sizes).
The kernel panic (with isdct delete) is reproducible on SLES 12 SP1. That Linux kernel version gets in trouble when the drive's reset and doesn't find a partition table afterwards. This does not happen on recent kernel versions (at least not with 4.9, not tested w/ 4.4 yet).
From our experience it looks like the Linux NVMe driver/firmware prefers writes over trims/GC and trims/GC over reads. So, if you execute both for-loops above concurrently, write throughput (nearly) stays at 2.0GB/s but reads throughput is just a few MB/s. Similar seems to be true when executing an `fstrim` and the reading dd-loop. Is this observation correct? Is there a way to get a fair priorization of writes and reads (i.e. let the driver/firmware not prefer writes over reads)?
The current performance is fine for us as we won't push the drive to its limits in production - but if there's something to optimize, it would be great.
Thanks for the tip regarding fstrim (i.e. not necessary to schedule that "manually")!
01-30-2017 03:32 PM
Hello Break Stuff,
These commands are ok as a quick test, or even a well-being check. Just don't trust these results too much as far as performance benchmarking, for this we recommend a dedicated benchmarking tool such as FIO*.Unfortunately this is all a bit outside of our support scope, since your drives don't use our firmware and the Linux* NVMe* driver is not made by us either.Our recommendation in this case would be to contact your Linux* Support Community, or your computer manufacturer for more details regarding the drive's firmware.As far I'm aware, there's no command to make your drive lean towards reads or writes, but then again this not our normal area of support.Best regards,Carlos A.