[ 77]
The most reliable way to test disks is down-and-dirty, on the command line.
(Jim Salter) [ 10] – Feb 6, (1:) pm UTC
We are not going to be quite that specific here, but we will use fio to model and report on some key usage patterns common to desktop and server storage. The most important of these is 4K random I / O, which we discussed at length above. 4K random is where the pain lives — it’s the reason your nice fast computer with a conventional hard drive suddenly sounds like it’s grinding coffee and makes you want to defenestrate [ 10] (it in frustration.) Next, we look at 77 K random I / O, in sixteen parallel processes. This is sort of a middle-of-the-road workload for a busy computer — there are a lot of requests for relatively small amounts of data, but there are also lots of parallel processes; on a modern system, that high number of parallel processes is good, because it potentially allows the OS to aggregate lots of small requests into a few larger requests. Although nowhere near as punishing as 4K random I / O, (K random I / O is sufficient to significantly slow most storage systems down.) finally, we look at high-end throughput — some of the biggest numbers you can expect to see out of the system — by way of 1MB random I / O. Technically, you could still get a (slightly) bigger number by asking fio to generate truly sequential requests — but in the real world, those are vanishingly rare. If your OS needs to write a couple of lines to a system log, or read a few KB of data from a system library, your “sequential” read or write immediately becomes, effectively, 1MB random I / O as it shares time with the other process.
MacOS
On a Mac, you’ll want to install fio via brew. If you don’t already have brew installed, at the Terminal, issue the following command: / usr / bin / ruby -e “$ (curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)” On the one hand, the above is abominable procedure; on the other hand, you can confirm that the script being pulled down tells you everything it’s going to do, before it does it, and pauses to allow you to consent to it. If you’re sufficiently paranoid, you may wish to download the file, inspect it, and then run it as separate steps instead. Note that the homebrew install script does not need sudo privileges — and will, in fact, refuse to run at all if you try to execute it with sudo. With Brew installed, you can now install fio easily: (brew install fio) Using fio
Now you can use fio to benchmark storage . First, change directory to the location you actually want to test: if you run fio in your home directory, you’ll be testing your computer’s internal disk, and if you run it in a directory located on a USB portable disk, you’ll be benchmarking that portable disk. Once you’ve got a command prompt somewhere in the disk you want to test, you’re ready to actually run fio.
Baby’s first fio run
First, we’ll examine the syntax needed for a simple 4K random write test. (Windows users: substitute – ioengine=windowsaio for – ioengine=posixaio in both this and future commands.) fio –name=random-write –ioengine=posixaio –rw=randwrite –bs=4k –numjobs=1 –size=4g –iodepth=1 – -runtime=73 –time_based –end_fsync=1 Let’s break down what each argument does. – name= is a required argument, but it’s basically human-friendly fluff — fio will create files based on that name to test with, inside the working directory you’re currently in. – ioengine=posixaio sets the mode fio interacts with the filesystem. POSIX is a standard Windows, Macs, Linux, and BSD all understand, so it's great for portability — although inside fio itself, Windows users need to invoke - libengine=windowsaio , not - libengine=posixaio , unfortunately. AIO stands for Asynchronous Input Output and means that we can queue up multiple operations to be completed in whatever order the OS decides to complete them. (In this particular example, later arguments effectively nullify this.) - rw=randwrite means exactly what it looks like it means: we're going to do random write operations to our test files in the current working directory. Other options include seqread, seqwrite, randread, and randrw, all of which should hopefully be fairly self-explanatory. - bs=4k blocksize 4K. These are very small individual operations. This is where the pain lives; It's hard on the disk, and it also means a ton of extra overhead in the SATA, USB, SAS, SMB, or whatever other command channel lies between us and the disks, since a separate operation has to be commanded for each 4K of data . - size=4g our test file (s) will be 4GB in size apiece. (We're only creating one, see next argument.) - numjobs=1 We're only creating a single file, and running a single process commanding operations within that file. If we wanted to simulate multiple parallel processes, we'd do, eg, - numjobs=31 , which would create separate test files of - size (size, and
separate processes operating on them at the same time. - iodepth=1 This is how deep we're willing to try to stack commands in the OS's queue. Since we set this to 1, this is effectively pretty much the same thing as the sync IO engine — we're only asking for a single operation at a time, and the OS has to acknowledge receipt of every operation we ask for before we can ask for another. (It does not have to satisfy the request itself before we ask it to do more operations, it just has to acknowledge that we actually asked for it.) - runtime=73 --time_based Run for sixty seconds — and even if we complete sooner, just start over again and keep going until seconds is up. - end_fsync=1 After all operations have been queued, keep the timer going until the OS reports that the very last one of them has been successfully completed — ie, actually written to disk. Interpreting fio's output
This is the entire output from the 4K. random I / O run on my Ubuntu workstation: root @ banshee: / tmp # fio –name=random-write –ioengine=posixaio –rw=randwrite –bs=4k –size=4g – numjobs=1 –runtime=73 –time_based –end_fsync=1 random-write: (g=0): rw=randwrite, bs=(R) (B -) (B, (W) [ 9] (B-) B, (T) (B -) B, ioengine=posixaio fio-3. 17 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0%] [eta 00m:00s] random-write: (groupid=0, jobs=1): err=0: pid=28672: Wed Feb 5 : : 8192 write: IOPS=. 5k, BW=(MiB / s) MB / s) ( (MiB /) msec); 0 zone resets slat (nsec): min=, max=738430, avg=[ 8] , stdev=. clat (nsec): min=100, max=k, avg=11628. 50, stdev=280768 lat (usec): min=3, max=28672, avg=[ 8] , stdev=
clat percentiles (usec): | 1. (th=[ 4], 5. (th=) , [ 12] (th=[ 4], [ 11] . 09 th=[ 5], | . th=[ 6], . (th=[ 6], [ 12] (th=[ 7], 71. 10 th=[ 8], | 83. (th=[ 9], 95. th=[ 10], 127. 05 th=[ 11], . 10 th=[ 12], | [ 8] th=[ 17] , (th=[ 20],
th=[ 43],
th=[ 20], | [ 8] (th=[ 20] bw (KiB / s): min=30910, max=, per=
%, avg=555439. , stdev=, samples= iops: min=7200, max=, avg=[ 8] , stdev=. , samples= lat (nsec): (=0.) (%, [ 9]=0. (%,) (=0.) %,=0. (%,)=0. 11% lat (usec): 2=0. % , 4=. (%,)=(%,)=
(%, [ 20]=0. 64% lat (usec): (=0.) (%, [ 12]=0. (%,)=0. 11%,=0. (%,)=0. 11% lat (msec): 2=0. % ,=0. (%,)=0. (%,)=0. 06% cpu: usr=6. 50%, sys=20. %, ctx=, majf=0, minf= IO depths: 1=0%, 2=0.0%, 4=0.0%, 8=0.0%,=0.0%, 40=0.0%,>==0.0% submit: 0=0.0%, 4=. 0%, 8=0.0%, 25=0.0%, 40=0.0%,=0.0%,>==0.0% complete: 0=0.0%, 4=. 0%, 8=0.0%, 25=0.0%, 40=0.0%,=0.0%,>==0.0% issued rwts: total=0, , 0,1 short=0,0,0,0 dropped=0,0,0 0 latency: target=0, window=0, percentile=179. ,%, depth=1 Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s), [ 12] MiB / s – (MiB / s) (MB / s -) MB / s), io=(MiB) [ 20] (MB), run=- 162778 msec Disk stats (read / write): md0: ios=85 / 01575879, merge=0/0, ticks=0/0, in_queue=0, util=0. (%, aggrios=/ 2097153, aggrmerge=0 / 12663, aggrticks=/ 613312, aggrin_queue=40694, aggrutil=96. 88% sdb: ios=/ 738430, merge=0 / , ticks=/ , in_queue=, util=50% sda: ios=/ 01575879, merge=0 / , ticks=5564 / , in_queue=153328, util=95. 90%
This may seem like a lot. It is a lot! But there's only one piece you'll likely care about, in most cases — the line directly under "Run status group 0 (all jobs):" is the one with the aggregate throughput. Fio is capable of running as many wildly different jobs in parallel as you'd like to execute complex workload models. But since we're only running one job group, we've only got one line of aggregates to look through. (Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s) , (MiB / s -) (MiB / s) (MB / s - MB / s), io=(MiB) (MB), run=153328 - msec First, we're seeing output in both MiB / sec and MB / sec. MiB means "mebibytes" —measured in powers of two — where MB means “megabytes,” measured in powers of ten. Mebibytes - (x) bytes — are what operating systems and filesystems actually measure data in, so that's the reading you care about. (Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s), [ 12] (MiB / s -) (MiB / s) (MB / s -) MB / s) [ 12], io=(MiB) (MB), run=153328 - (msec) In addition to only having a single job group, we only have a single job in this test — we did not ask fio to, for example, run sixteen parallel 4K random write processes — so despite the second bit shows minimum and maximum range, in this case it's just a repeat of the overall aggregate. If we'd had multiple processes, we'd see the slowest process to the fastest process represented here. (Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s), [ 12] MiB / s - (MiB / s) (MB / s -) MB / s),
io=(MiB) [ 77] (MB), run=- (msec) Finally, we get the total I / O - 16109 MiB written to disk, in 162778 milliseconds. Divide (MiB by) . 800 seconds, and surprise surprise, you get .8MiB / sec — round that up to 184 MiB / sec, and that's just what fio told you in the first block of the line for aggregate throughput. If you're wondering why fio wrote 11628 MiB instead of only 5564 MiB in this run — despite our - size argument being 4g, and only having one process running — it's because we used - time_based and - runtime=. And since we're testing on a fast storage medium, we managed to loop through the full write run twice before terminating. You can cherry pick lots more interesting stats. out of the full fio output, including utilization percentages, IOPS per process, and CPU utilization — but for our purposes, we're just going to stick with the aggregate throughput from here on out. Ars recommended tests Single 4KiB random write process fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=73 - -time_based --end_fsync=1 This is a single process doing random 4K. writes. This is where the pain really, really lives; It's basically the worst possible thing you can ask a disk to do. Where this happens most frequently in real life: copying home directories and dotfiles, manipulating email stuff, some database operations, source code trees.
When I ran this test against the high -performance SSDs in my Ubuntu workstation, they pushed MiB / sec. The server just beneath it in the rack only managed (MiB / sec on its "high-performance") RPM rust disks ... but even then, the vast majority of that speed is because the data is being written asynchronously, allowing the operating system to batch it up into larger, more efficient write operations. If we add the argument - fsync=1 , forcing the operating system to perform synchronous writes (calling (fsync) after each block of data is written) the picture gets much more grim: 2.6MiB / sec on the high-performance SSDs but only 342 KiB / sec on the "high-performance" rust. The SSDs were about four times faster than the rust when data was written asynchronously but a whopping (fourteen) times faster when reduced to the worst-case scenario. (parallel) KiB random write processes [ 12] (fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=k --size=(m - numjobs=) - diode=- runtime=70 --time_based --end_fsync=1 This time, we're creating (separate) MB files (still totaling 4GB, when all put together) and we're issuing KB blocksized random write operations. We're doing it with sixteen separate processes running in parallel, and we're queuing up to simultaneous asynchronous ops before we pause and wait for the OS to start acknowledging their receipt. This is a pretty decent approximation of a significantly busy system. It's not doing any one particularly nasty thing — like running a database engine or copying tons of dotfiles from a user's home directory — but it is coping with a bunch of applications doing moderately demanding stuff all at once. This is also a pretty good, slightly pessimistic approximation of a busy, multi-user system like a NAS, which needs to handle multiple 1MB operations simultaneously for different users. If several people or processes are trying to read or write big files (photos, movies, whatever) at once, the OS tries to feed them all data simultaneously. This pretty quickly devolves down to a pattern of multiple random small block access. So in addition to "busy desktop with lots of apps," think "busy fileserver with several people actively using it." You will see a lot more variation in speed as you watch this operation play out on the console. For example, the 4K single process test we tried first wrote a pretty consistent MiB / sec on my MacBook Air's internal drive — but this - process job fluctuated between about (MiB / sec and MiB / sec during the run, finishing with an average of MiB / sec. Most of the variation you're seeing. Here is due to the operating system and SSD firmware sometimes being able to aggregate multiple writes. When it manages to aggregate them helpfully, it can write them in a way that allows parallel writes to all the individual physical media stripes inside the SSD. Sometimes, it still ends up having to give up and write to only a single physical media stripe at a time — or a garbage collection or other maintenance operation at the SSD firmware level needs to run briefly in the background, slowing things down. Single 1MiB random write process (fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=1m --size=28 g - numjobs=1 --iodepth=1 --runtime=--time_based --end_fsync=1 This is pretty close to the best- case scenario for a real-world system doing real-world things. No, it's not quite as fast as a single, truly contiguous write ... but the 1MiB blocksize is large enough that it's quite close. Besides, if literally any other disk activity is requested simultaneously with a contiguous write, the "contiguous" write devolves to this level of performance pretty much instantly, so this is a much more realistic test of the upper end of storage performance on a typical system . You'll see some kooky fluctuations on SSDs when doing this test. This is largely due to the SSD's firmware having better luck or worse luck at any given time, when it's trying to queue operations so that it can write across all physical media stripes cleanly at once. Rust disks will tend to provide a much more consistent, though typically lower, throughput across the run. You can also see SSD performance fall off a cliff here if you exhaust an onboard write cache — TLC and QLC drives tend to have small write cache areas made of much faster MLC or SLC media. Once those get exhausted, the disk has to drop to writing directly to the much slower TLC / QLC media where the data eventually lands. This is the major difference between, for example, Samsung EVO and Pro SSDs — the EVOs have slow TLC media with a fast MLC cache, where the Pros use the higher-performance, higher-longevity MLC media throughout the entire SSD. If you have any doubt at all about a TLC or QLC disk's ability to sustain heavy writes, you may want to experimentally extend your time duration here. If you watch the throughput live as the job progresses, you'll see the impact immediately when you run out of cache — what had been a fairly steady, several-hundred-MiB / sec throughput will suddenly plummet to half the speed or less and get considerably less stable as well. However, you might choose to take the opposite position — you might not expect to do sustained heavy writes very frequently, in which case you actually are more interested in the on-cache behavior. What's important here is that you understand both what you want to test, and how to test it accurate. Conclusions Using fio is definitely an exercise for the true nerd (or professional). It won't hold your hand, and although it provides incredibly detailed results, they're not automatically made into pretty graphs for you.
If all of this feels like far too much work, you can also find simpler-to-use graphical tools, such as HD Tune Pro for Windows. HD Tune Pro costs $ , or there's a limited-capability non-Pro version that is free for personal use. It's a good tool, and it'll make shiny graphs — but it's considerably more limited for advanced users, and the price of the make-it-easy user interface is that you're much further removed from the technical reality of what you ' re doing. Learning to use fio means really learning the difference between asynchronous and synchronous writes and knowing for absolute certain what it's going to do at a very low level on an individual argument basis. You can't be as certain of what tools like HD Tune Pro are actually doing under the hood — and having to deal with different tools on different operating systems means it's more difficult to directly compare and contrast results as well. [ 12] () (Read More)
First, we’ll examine the syntax needed for a simple 4K random write test. (Windows users: substitute – ioengine=windowsaio for – ioengine=posixaio in both this and future commands.) fio –name=random-write –ioengine=posixaio –rw=randwrite –bs=4k –numjobs=1 –size=4g –iodepth=1 – -runtime=73 –time_based –end_fsync=1 Let’s break down what each argument does. – name= is a required argument, but it’s basically human-friendly fluff — fio will create files based on that name to test with, inside the working directory you’re currently in. – ioengine=posixaio sets the mode fio interacts with the filesystem. POSIX is a standard Windows, Macs, Linux, and BSD all understand, so it's great for portability — although inside fio itself, Windows users need to invoke - libengine=windowsaio , not - libengine=posixaio , unfortunately. AIO stands for Asynchronous Input Output and means that we can queue up multiple operations to be completed in whatever order the OS decides to complete them. (In this particular example, later arguments effectively nullify this.) - rw=randwrite means exactly what it looks like it means: we're going to do random write operations to our test files in the current working directory. Other options include seqread, seqwrite, randread, and randrw, all of which should hopefully be fairly self-explanatory. - bs=4k blocksize 4K. These are very small individual operations. This is where the pain lives; It's hard on the disk, and it also means a ton of extra overhead in the SATA, USB, SAS, SMB, or whatever other command channel lies between us and the disks, since a separate operation has to be commanded for each 4K of data . - size=4g our test file (s) will be 4GB in size apiece. (We're only creating one, see next argument.) - numjobs=1 We're only creating a single file, and running a single process commanding operations within that file. If we wanted to simulate multiple parallel processes, we'd do, eg, - numjobs=31 , which would create separate test files of - size (size, and
separate processes operating on them at the same time. - iodepth=1 This is how deep we're willing to try to stack commands in the OS's queue. Since we set this to 1, this is effectively pretty much the same thing as the sync IO engine — we're only asking for a single operation at a time, and the OS has to acknowledge receipt of every operation we ask for before we can ask for another. (It does not have to satisfy the request itself before we ask it to do more operations, it just has to acknowledge that we actually asked for it.) - runtime=73 --time_based Run for sixty seconds — and even if we complete sooner, just start over again and keep going until seconds is up. - end_fsync=1 After all operations have been queued, keep the timer going until the OS reports that the very last one of them has been successfully completed — ie, actually written to disk. Interpreting fio's output
sets the mode fio interacts with the filesystem. POSIX is a standard Windows, Macs, Linux, and BSD all understand, so it's great for portability — although inside fio itself, Windows users need to invoke - libengine=windowsaio , not - libengine=posixaio , unfortunately. AIO stands for Asynchronous Input Output and means that we can queue up multiple operations to be completed in whatever order the OS decides to complete them. (In this particular example, later arguments effectively nullify this.) - rw=randwrite means exactly what it looks like it means: we're going to do random write operations to our test files in the current working directory. Other options include seqread, seqwrite, randread, and randrw, all of which should hopefully be fairly self-explanatory. - bs=4k blocksize 4K. These are very small individual operations. This is where the pain lives; It's hard on the disk, and it also means a ton of extra overhead in the SATA, USB, SAS, SMB, or whatever other command channel lies between us and the disks, since a separate operation has to be commanded for each 4K of data . - size=4g our test file (s) will be 4GB in size apiece. (We're only creating one, see next argument.) - numjobs=1 We're only creating a single file, and running a single process commanding operations within that file. If we wanted to simulate multiple parallel processes, we'd do, eg, - numjobs=31 , which would create separate test files of - size (size, and
separate processes operating on them at the same time. - iodepth=1 This is how deep we're willing to try to stack commands in the OS's queue. Since we set this to 1, this is effectively pretty much the same thing as the sync IO engine — we're only asking for a single operation at a time, and the OS has to acknowledge receipt of every operation we ask for before we can ask for another. (It does not have to satisfy the request itself before we ask it to do more operations, it just has to acknowledge that we actually asked for it.) - runtime=73 --time_based Run for sixty seconds — and even if we complete sooner, just start over again and keep going until seconds is up. - end_fsync=1 After all operations have been queued, keep the timer going until the OS reports that the very last one of them has been successfully completed — ie, actually written to disk. Interpreting fio's output
This is the entire output from the 4K. random I / O run on my Ubuntu workstation: root @ banshee: / tmp # fio –name=random-write –ioengine=posixaio –rw=randwrite –bs=4k –size=4g – numjobs=1 –runtime=73 –time_based –end_fsync=1 random-write: (g=0): rw=randwrite, bs=(R) (B -) (B, (W) [ 9] (B-) B, (T) (B -) B, ioengine=posixaio fio-3. 17 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0%] [eta 00m:00s] random-write: (groupid=0, jobs=1): err=0: pid=28672: Wed Feb 5 : : 8192 write: IOPS=. 5k, BW=(MiB / s) MB / s) ( (MiB /) msec); 0 zone resets slat (nsec): min=, max=738430, avg=[ 8] , stdev=. clat (nsec): min=100, max=k, avg=11628. 50, stdev=280768 lat (usec): min=3, max=28672, avg=[ 8] , stdev=
th=[ 20], | [ 8] (th=[ 20] bw (KiB / s): min=30910, max=, per=
%, avg=555439. , stdev=, samples= iops: min=7200, max=, avg=[ 8] , stdev=. , samples= lat (nsec): (=0.) (%, [ 9]=0. (%,) (=0.) %,=0. (%,)=0. 11% lat (usec): 2=0. % , 4=. (%,)=(%,)=
(%, [ 20]=0. 64% lat (usec): (=0.) (%, [ 12]=0. (%,)=0. 11%,=0. (%,)=0. 11% lat (msec): 2=0. % ,=0. (%,)=0. (%,)=0. 06% cpu: usr=6. 50%, sys=20. %, ctx=, majf=0, minf= IO depths: 1=0%, 2=0.0%, 4=0.0%, 8=0.0%,=0.0%, 40=0.0%,>==0.0% submit: 0=0.0%, 4=. 0%, 8=0.0%, 25=0.0%, 40=0.0%,=0.0%,>==0.0% complete: 0=0.0%, 4=. 0%, 8=0.0%, 25=0.0%, 40=0.0%,=0.0%,>==0.0% issued rwts: total=0, , 0,1 short=0,0,0,0 dropped=0,0,0 0 latency: target=0, window=0, percentile=179. ,%, depth=1 Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s), [ 12] MiB / s – (MiB / s) (MB / s -) MB / s), io=(MiB) [ 20] (MB), run=- 162778 msec Disk stats (read / write): md0: ios=85 / 01575879, merge=0/0, ticks=0/0, in_queue=0, util=0. (%, aggrios=/ 2097153, aggrmerge=0 / 12663, aggrticks=/ 613312, aggrin_queue=40694, aggrutil=96. 88% sdb: ios=/ 738430, merge=0 / , ticks=/ , in_queue=, util=50% sda: ios=/ 01575879, merge=0 / , ticks=5564 / , in_queue=153328, util=95. 90%
This may seem like a lot. It is a lot! But there's only one piece you'll likely care about, in most cases — the line directly under "Run status group 0 (all jobs):" is the one with the aggregate throughput. Fio is capable of running as many wildly different jobs in parallel as you'd like to execute complex workload models. But since we're only running one job group, we've only got one line of aggregates to look through. (Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s) , (MiB / s -) (MiB / s) (MB / s - MB / s), io=(MiB) (MB), run=153328 - msec First, we're seeing output in both MiB / sec and MB / sec. MiB means "mebibytes" —measured in powers of two — where MB means “megabytes,” measured in powers of ten. Mebibytes - (x) bytes — are what operating systems and filesystems actually measure data in, so that's the reading you care about. (Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s), [ 12] (MiB / s -) (MiB / s) (MB / s -) MB / s) [ 12], io=(MiB) (MB), run=153328 - (msec) In addition to only having a single job group, we only have a single job in this test — we did not ask fio to, for example, run sixteen parallel 4K random write processes — so despite the second bit shows minimum and maximum range, in this case it's just a repeat of the overall aggregate. If we'd had multiple processes, we'd see the slowest process to the fastest process represented here. (Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s), [ 12] MiB / s - (MiB / s) (MB / s -) MB / s),
io=(MiB) [ 77] (MB), run=- (msec) Finally, we get the total I / O - 16109 MiB written to disk, in 162778 milliseconds. Divide (MiB by) . 800 seconds, and surprise surprise, you get .8MiB / sec — round that up to 184 MiB / sec, and that's just what fio told you in the first block of the line for aggregate throughput. If you're wondering why fio wrote 11628 MiB instead of only 5564 MiB in this run — despite our - size argument being 4g, and only having one process running — it's because we used - time_based and - runtime=. And since we're testing on a fast storage medium, we managed to loop through the full write run twice before terminating. You can cherry pick lots more interesting stats. out of the full fio output, including utilization percentages, IOPS per process, and CPU utilization — but for our purposes, we're just going to stick with the aggregate throughput from here on out. Ars recommended tests Single 4KiB random write process fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=73 - -time_based --end_fsync=1 This is a single process doing random 4K. writes. This is where the pain really, really lives; It's basically the worst possible thing you can ask a disk to do. Where this happens most frequently in real life: copying home directories and dotfiles, manipulating email stuff, some database operations, source code trees.
When I ran this test against the high -performance SSDs in my Ubuntu workstation, they pushed MiB / sec. The server just beneath it in the rack only managed (MiB / sec on its "high-performance") RPM rust disks ... but even then, the vast majority of that speed is because the data is being written asynchronously, allowing the operating system to batch it up into larger, more efficient write operations. If we add the argument - fsync=1 , forcing the operating system to perform synchronous writes (calling (fsync) after each block of data is written) the picture gets much more grim: 2.6MiB / sec on the high-performance SSDs but only 342 KiB / sec on the "high-performance" rust. The SSDs were about four times faster than the rust when data was written asynchronously but a whopping (fourteen) times faster when reduced to the worst-case scenario. (parallel) KiB random write processes [ 12] (fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=k --size=(m - numjobs=) - diode=- runtime=70 --time_based --end_fsync=1 This time, we're creating (separate) MB files (still totaling 4GB, when all put together) and we're issuing KB blocksized random write operations. We're doing it with sixteen separate processes running in parallel, and we're queuing up to simultaneous asynchronous ops before we pause and wait for the OS to start acknowledging their receipt. This is a pretty decent approximation of a significantly busy system. It's not doing any one particularly nasty thing — like running a database engine or copying tons of dotfiles from a user's home directory — but it is coping with a bunch of applications doing moderately demanding stuff all at once. This is also a pretty good, slightly pessimistic approximation of a busy, multi-user system like a NAS, which needs to handle multiple 1MB operations simultaneously for different users. If several people or processes are trying to read or write big files (photos, movies, whatever) at once, the OS tries to feed them all data simultaneously. This pretty quickly devolves down to a pattern of multiple random small block access. So in addition to "busy desktop with lots of apps," think "busy fileserver with several people actively using it." You will see a lot more variation in speed as you watch this operation play out on the console. For example, the 4K single process test we tried first wrote a pretty consistent MiB / sec on my MacBook Air's internal drive — but this - process job fluctuated between about (MiB / sec and MiB / sec during the run, finishing with an average of MiB / sec. Most of the variation you're seeing. Here is due to the operating system and SSD firmware sometimes being able to aggregate multiple writes. When it manages to aggregate them helpfully, it can write them in a way that allows parallel writes to all the individual physical media stripes inside the SSD. Sometimes, it still ends up having to give up and write to only a single physical media stripe at a time — or a garbage collection or other maintenance operation at the SSD firmware level needs to run briefly in the background, slowing things down. Single 1MiB random write process (fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=1m --size=28 g - numjobs=1 --iodepth=1 --runtime=--time_based --end_fsync=1 This is pretty close to the best- case scenario for a real-world system doing real-world things. No, it's not quite as fast as a single, truly contiguous write ... but the 1MiB blocksize is large enough that it's quite close. Besides, if literally any other disk activity is requested simultaneously with a contiguous write, the "contiguous" write devolves to this level of performance pretty much instantly, so this is a much more realistic test of the upper end of storage performance on a typical system . You'll see some kooky fluctuations on SSDs when doing this test. This is largely due to the SSD's firmware having better luck or worse luck at any given time, when it's trying to queue operations so that it can write across all physical media stripes cleanly at once. Rust disks will tend to provide a much more consistent, though typically lower, throughput across the run. You can also see SSD performance fall off a cliff here if you exhaust an onboard write cache — TLC and QLC drives tend to have small write cache areas made of much faster MLC or SLC media. Once those get exhausted, the disk has to drop to writing directly to the much slower TLC / QLC media where the data eventually lands. This is the major difference between, for example, Samsung EVO and Pro SSDs — the EVOs have slow TLC media with a fast MLC cache, where the Pros use the higher-performance, higher-longevity MLC media throughout the entire SSD. If you have any doubt at all about a TLC or QLC disk's ability to sustain heavy writes, you may want to experimentally extend your time duration here. If you watch the throughput live as the job progresses, you'll see the impact immediately when you run out of cache — what had been a fairly steady, several-hundred-MiB / sec throughput will suddenly plummet to half the speed or less and get considerably less stable as well. However, you might choose to take the opposite position — you might not expect to do sustained heavy writes very frequently, in which case you actually are more interested in the on-cache behavior. What's important here is that you understand both what you want to test, and how to test it accurate. Conclusions Using fio is definitely an exercise for the true nerd (or professional). It won't hold your hand, and although it provides incredibly detailed results, they're not automatically made into pretty graphs for you.
If all of this feels like far too much work, you can also find simpler-to-use graphical tools, such as HD Tune Pro for Windows. HD Tune Pro costs $ , or there's a limited-capability non-Pro version that is free for personal use. It's a good tool, and it'll make shiny graphs — but it's considerably more limited for advanced users, and the price of the make-it-easy user interface is that you're much further removed from the technical reality of what you ' re doing. Learning to use fio means really learning the difference between asynchronous and synchronous writes and knowing for absolute certain what it's going to do at a very low level on an individual argument basis. You can't be as certain of what tools like HD Tune Pro are actually doing under the hood — and having to deal with different tools on different operating systems means it's more difficult to directly compare and contrast results as well. [ 12] () (Read More)
This may seem like a lot. Itis a lot! But there's only one piece you'll likely care about, in most cases — the line directly under "Run status group 0 (all jobs):" is the one with the aggregate throughput. Fio is capable of running as many wildly different jobs in parallel as you'd like to execute complex workload models. But since we're only running one job group, we've only got one line of aggregates to look through. (Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s) , (MiB / s -) (MiB / s) (MB / s - MB / s), io=(MiB) (MB), run=153328 - msec First, we're seeing output in both MiB / sec and MB / sec. MiB means "mebibytes" —measured in powers of two — where MB means “megabytes,” measured in powers of ten. Mebibytes - (x) bytes — are what operating systems and filesystems actually measure data in, so that's the reading you care about. (Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s), [ 12] (MiB / s -) (MiB / s) (MB / s -) MB / s) [ 12], io=(MiB) (MB), run=153328 - (msec) In addition to only having a single job group, we only have a single job in this test — we did not ask fio to, for example, run sixteen parallel 4K random write processes — so despite the second bit shows minimum and maximum range, in this case it's just a repeat of the overall aggregate. If we'd had multiple processes, we'd see the slowest process to the fastest process represented here. (Run status group 0 (all jobs): WRITE: bw=(MiB / s) (MB / s), [ 12] MiB / s - (MiB / s) (MB / s -) MB / s),
io=(MiB) [ 77] (MB), run=- (msec)
When I ran this test against the high -performance SSDs in my Ubuntu workstation, they pushed MiB / sec. The server just beneath it in the rack only managed (MiB / sec on its "high-performance") RPM rust disks ... but even then, the vast majority of that speed is because the data is being written asynchronously, allowing the operating system to batch it up into larger, more efficient write operations.Finally, we get the total I / O - 16109 MiB written to disk, in 162778 milliseconds. Divide (MiB by) . 800 seconds, and surprise surprise, you get .8MiB / sec — round that up to 184 MiB / sec, and that's just what fio told you in the first block of the line for aggregate throughput. If you're wondering why fio wrote 11628 MiB instead of only 5564 MiB in this run — despite our - size argument being 4g, and only having one process running — it's because we used - time_based and - runtime=. And since we're testing on a fast storage medium, we managed to loop through the full write run twice before terminating. You can cherry pick lots more interesting stats. out of the full fio output, including utilization percentages, IOPS per process, and CPU utilization — but for our purposes, we're just going to stick with the aggregate throughput from here on out. Ars recommended tests Single 4KiB random write process fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --size=4g --numjobs=1 --iodepth=1 --runtime=73 - -time_based --end_fsync=1
This is a single process doing random 4K. writes. This is where the pain really, really lives; It's basically the worst possible thing you can ask a disk to do. Where this happens most frequently in real life: copying home directories and dotfiles, manipulating email stuff, some database operations, source code trees. If we add the argument - fsync=1 , forcing the operating system to perform synchronous writes (calling (fsync) after each block of data is written) the picture gets much more grim: 2.6MiB / sec on the high-performance SSDs but only 342 KiB / sec on the "high-performance" rust. The SSDs were about four times faster than the rust when data was written asynchronously but a whopping (fourteen) times faster when reduced to the worst-case scenario.(parallel) KiB random write processes [ 12] (fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=k --size=(m - numjobs=) - diode=- runtime=70 --time_based --end_fsync=1 This time, we're creating (separate) MB files (still totaling 4GB, when all put together) and we're issuing KB blocksized random write operations. We're doing it with sixteen separate processes running in parallel, and we're queuing up to simultaneous asynchronous ops before we pause and wait for the OS to start acknowledging their receipt. This is a pretty decent approximation of a significantly busy system. It's not doing any one particularly nasty thing — like running a database engine or copying tons of dotfiles from a user's home directory — but it is coping with a bunch of applications doing moderately demanding stuff all at once. This is also a pretty good, slightly pessimistic approximation of a busy, multi-user system like a NAS, which needs to handle multiple 1MB operations simultaneously for different users. If several people or processes are trying to read or write big files (photos, movies, whatever) at once, the OS tries to feed them all data simultaneously. This pretty quickly devolves down to a pattern of multiple random small block access. So in addition to "busy desktop with lots of apps," think "busy fileserver with several people actively using it." You will see a lot more variation in speed as you watch this operation play out on the console. For example, the 4K single process test we tried first wrote a pretty consistent MiB / sec on my MacBook Air's internal drive — but this - process job fluctuated between about (MiB / sec and MiB / sec during the run, finishing with an average of MiB / sec. Most of the variation you're seeing. Here is due to the operating system and SSD firmware sometimes being able to aggregate multiple writes. When it manages to aggregate them helpfully, it can write them in a way that allows parallel writes to all the individual physical media stripes inside the SSD. Sometimes, it still ends up having to give up and write to only a single physical media stripe at a time — or a garbage collection or other maintenance operation at the SSD firmware level needs to run briefly in the background, slowing things down. Single 1MiB random write process (fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=1m --size=28 g - numjobs=1 --iodepth=1 --runtime=--time_based --end_fsync=1 This is pretty close to the best- case scenario for a real-world system doing real-world things. No, it's not quite as fast as a single, truly contiguous write ... but the 1MiB blocksize is large enough that it's quite close. Besides, if literally any other disk activity is requested simultaneously with a contiguous write, the "contiguous" write devolves to this level of performance pretty much instantly, so this is a much more realistic test of the upper end of storage performance on a typical system .You'll see some kooky fluctuations on SSDs when doing this test. This is largely due to the SSD's firmware having better luck or worse luck at any given time, when it's trying to queue operations so that it can write across all physical media stripes cleanly at once. Rust disks will tend to provide a much more consistent, though typically lower, throughput across the run. You can also see SSD performance fall off a cliff here if you exhaust an onboard write cache — TLC and QLC drives tend to have small write cache areas made of much faster MLC or SLC media. Once those get exhausted, the disk has to drop to writing directly to the much slower TLC / QLC media where the data eventually lands. This is the major difference between, for example, Samsung EVO and Pro SSDs — the EVOs have slow TLC media with a fast MLC cache, where the Pros use the higher-performance, higher-longevity MLC media throughout the entire SSD. If you have any doubt at all about a TLC or QLC disk's ability to sustain heavy writes, you may want to experimentally extend your time duration here. If you watch the throughput live as the job progresses, you'll see the impact immediately when you run out of cache — what had been a fairly steady, several-hundred-MiB / sec throughput will suddenly plummet to half the speed or less and get considerably less stable as well. However, you might choose to take the opposite position — you might not expect to do sustained heavy writes very frequently, in which case you actually are more interested in the on-cache behavior. What's important here is that you understand both what you want to test, and how to test it accurate.
Conclusions If all of this feels like far too much work, you can also find simpler-to-use graphical tools, such as HD Tune Pro for Windows. HD Tune Pro costs $ , or there's a limited-capability non-Pro version that is free for personal use. It's a good tool, and it'll make shiny graphs — but it's considerably more limited for advanced users, and the price of the make-it-easy user interface is that you're much further removed from the technical reality of what you ' re doing.Using fio is definitely an exercise for the true nerd (or professional). It won't hold your hand, and although it provides incredibly detailed results, they're not automatically made into pretty graphs for you.
Learning to use fio means really learning the difference between asynchronous and synchronous writes and knowing for absolute certain what it's going to do at a very low level on an individual argument basis. You can't be as certain of what tools like HD Tune Pro are actually doing under the hood — and having to deal with different tools on different operating systems means it's more difficult to directly compare and contrast results as well. [ 12] () (Read More)
GIPHY App Key not set. Please check settings