in

Let the Compiler Do the Work, Hacker News

[build]              [build]        

Learn Rust the Dangerous Way, Part 6

   [ 2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] **************************

[

2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] **************************************************************************************************************************************************************************************************************************************************************************************** – (************************************** [0]

***************************************** (A review [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, 2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ]                              Back to basics: a simple idiomatic version                     (**************************************** [ 2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ********************************** Rusty bits [ 1.289_436_956_213_913_1e1, 1.511_115_140_169_863_1e1, 2.233_075_788_926_557_3e-1 ] ********************************                             (**************************************** Fundamentals *******************************************                             (**************************************** offset_momentum                             (**************************************** output_energy and sqr                             (**************************************** (advance) ********************************************                             (**************************************** (main) *******************************************                                    Building the code                               Inspecting the results                              Performance evaluation

  •                              The pros and cons of auto-vectorization                    (**************************************** [ 2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ************************************ (Pro) *******************************************                             (****************************************Con
  •                                    
  • (Conclusion)                          (Series Overview) )

    In this series so far, we’ve taken a C program and converted it into a faster, smaller, and reasonably robust Rust program. The Rust program is a recognizable descendant of the C program, and that was deliberate: my goal was to compare and contrast the two languages ​​for optimized code.

    In this bonus section, I’ll walk through how we ‘d write the program from scratch in Rust. In particular, I’m going to rely on compiler auto-vectorization to Produce a program that is shorter, simpler, portable, and significantly faster … and without any (unsafe) ************************************************************ [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, 2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] ************************** Can it be? Read on …

    Recall from [ 8.343_366_718_244_58e0, 4.124_798_564_124_305e0, 4.035_234_171_143_214e-1 ] ******************************************************* (part 5) that we had a mostly-safe Rust program which nevertheless contained two kinds of (unsafe) code: (****************************************************************

    Explicit Intel SSE vector intrinsics, which only work properly on certain processors, rendering us non-portable. (********************************************

    Type-punning of memory between (f) ****************************************************************************************************************************************************************************************************************************************************************** and vector types, to support the vector intrinsics. (******************************************** [[0.;3] As for performance, we were doing alright; Here where things stood, using Clang as a baseline:[ 2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] (Command) Mean [s] Min [s]

    (Max [s] [0] Ratio ****************************************************************** () ****************************************************************** (************************************************************ (**********************************************************./ nbody.gcc-8.bench (************************************************************

    6. (± 0.0) ****************************************************************************************************************************************************************************************************************************************************************************************** (********************************************************** (6.) ****************************************************************************************************************************************************************************************************************************************** (************************************************************************ (6.) ****************************************************************************************************************************************************************************************************************************************

    (1) ************************************************************************************************************************************************************************************************************************************************************************************** (x) ************************************************************************* (************************************************************** (********************************************************** ./ nbody.clang-8.bench (************************************************************ [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ****************************************************************** (5.) ************************************************************************************************************************************************************************************************************************* (± 0.) (************************************************************************** (5.) ************************************************************************************************************************************************************************************************************************** (********************************************************** (5.) ************************************************************************************************************************************************************************************************************************** [Body;BODIES_COUNT] ********************************************************************** [0] ************************************************************************************************************************************************************************************************************************************************************************** (x) ************************************************************************ (****************************************************************./ nbody-5.bench [ 8.343_366_718_244_58e0, 4.124_798_564_124_305e0, -4.035_234_171_143_214e-1 ] (************************************************************************** (**************************************************************************** (5.) ************************************************************************************************************************************************************************************************************************************************** (± 0.) (**************************************************************************** [Body;BODIES_COUNT] *********************************************************************** (5) ******************************************************************************************************************************************************************************************************************************************************[ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] **************************************************************** 5 **************************************************************************************************************************************************************************************************************************************************** (************************************************************************ (***************************************************************************** (0.) *********************************************************************************************************************************************************************************************************************************************************** (x) ************************************************************************ (************************************************************ [Body;BODIES_COUNT] **************************************************************************
    From the original C program, we inherited a bunch of optimizations . Let's undo them, and implement the essential algorithm without making too many assumptions. Because the algorithm can be implemented with a few simple loops, processing array elements that don't depend on one another, it's an excellent candidate for auto-vectorization. (****************************************************************************** Rusty bits

    I'm sticking to fairly basic Rust here, since people reading this may not be fluent yet, but I'm also writing this program (authentically) - this is really how I'd solve this problem. The problem does not require a lot of whizbang features.

    The main thing you're going to be seeing below is (iterators.) Iterators in Rust play a similar role to iterators in C , but that comparison will quickly lead you astray, as the two are quite different. You don't need to deeply understand iterators to read this section, but I'd suggest at least reading overthe Iterators section of the Rust bookto familiarize yourself with the pattern. I'll discuss some of the nuanced bits as we encounter them.

    Finally, a word on code style: I'm using the Rust standard conventions as implemented by the (rustfmt) tool, because doing this is automatic and super easy. (************************************ (Fundamentals)

    Each body in the solar system simulation is represented by this (struct) **************************************************************: (************************************** # [derive(Clone, Debug)] (************************************* (struct) *************************************** (Body) {      (************************************* (position) **************************************

    : [

    f64; 3],      (*************************************** (velocity) ***************************************

    : [

    f64; 3],      (************************************* (mass) ************************************** (**************************************:

    (f) ******************************************************************************************************************************************************************************************************************************************************************** [j] (******************************, } (***************************************************************************************

    I'm derivingClone ************************************************************ (meaning it's okay if this type gets copied explicitly) and (Debug) ************************************************************** (for pretty debugging output, should we need it). This is by habit; neither feature is required here.

    We'll use the following constants:

    (************************************** /// Number of bodies modeled in the simulation. (************************************* (const) *************************************** (BODIES_COUNT)

    : (usize [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] **************************

    =5

    ; (************************************* (const) *************************************** (SOLAR_MASS) : (f) ***************************************************************************************************************************************************************************************************************************************************************** () **************************************=[2] **************** ([dependencies] 4.* () ************************************ (std :: f [ 4.841_431_442_464_72e0, -1.160_320_044_027_428_4e0, -1.036_220_444_711_231_1e-1 ] :: consts ::PI) *************************************** [s] ************************************** (std :: f) ****************************************************************************************************************************************************************************************************************************************************************** :: consts :: [2] ************** (PI)

    ; (************************************* (const) *************************************** (DAYS_PER_YEAR) : (f) ***************************************************************************************************************************************************************************************************************************************************************** () **************************************=[

    2] ************** (**************************************

    ;

    /// Number of body-body interactions. (************************************* (const) *************************************** (INTERACTIONS) : (usize [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] **************************

    =BODIES_COUNT (**************************************** () [

    2] BODIES_COUNT [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ****************************** (1) *************************************** [i1..BODIES_COUNT]) / (*********************************** (2) **************************************; (***************************************************************************************

    Finally, we have the constant table giving the starting state of the simulation. This is the same data as the C program used, but I've reformatted the constants:

    I've put in thousands separators, so (4.) **************************************************************************************************************** (becomes (4) _ 442 _ 444, Because I find that easier to read. (********************************************

    The constants had more digits than the actual precision of a (double) / (f)allows. I've rounded them to their actual values.

    Rust's linter, (Clippy) *******************************************, pointed that second one out, which I appreciated - I don't like my constants to be dishonest!

    (***************************************************************************************** (Long code snippet hidden by default [ 1.660_076_642_744_037e-3*DAYS_PER_YEAR, 7.699_011_184_197_404e-3*DAYS_PER_YEAR, -6.904_600_169_720_63e-5*DAYS_PER_YEAR ] ************************************************************************************

    /// Initial state of the simulation. (************************************* (const) *************************************** STARTING_STATE: [Body;BODIES_COUNT]=

    [

    // Sun Body { mass:SOLAR_MASS, position: [0.;3],         velocity: [0.;3],     },     

    // Jupiter (************************************* (Body {         position: [ 4.841_431_442_464_72e0, -1.160_320_044_027_428_4e0, -1.036_220_444_711_231_1e-1 ],         velocity: [ 1.660_076_642_744_037e-3*DAYS_PER_YEAR, 7.699_011_184_197_404e-3*DAYS_PER_YEAR, -6.904_600_169_720_63e-5*DAYS_PER_YEAR ],         mass:

    (9.) _ 389 _ (_) **************************************************************************************************************************************************************************************************************************** (e-4) * SOLAR_MASS     ,     // Saturn (************************************* (Body {         position: [ 8.343_366_718_244_58e0, 4.124_798_564_124_305e0, -4.035_234_171_143_214e-1 ],         velocity: [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ],         mass:

    (2) ****************************************************************************************************************************************** (_) ************************************************************************************************************************************** (_) ********************************************************************************************************************************************** (_) ****************************************************************************************************************************************************************** (_) ******************************************************************************************************************************************************************************************************************** e-4 [build] (**************************************** SOLAR_MASS     ,     // Uranus (************************************* (Body {         position: [ 1.289_436_956_213_913_1e1, -1.511_115_140_169_863_1e1, -2.233_075_788_926_557_3e-1 ],         velocity: [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ],         mass:

    (4.) ************************************************************************************************************************************************************************************************************** (_) ***************************************************************************************************************************************************************************************************************************** (_) ******************************************************************************************************************************************************************************************************************************************************************** (_) ****************************************************************************************************************************************************************************************************************** (_) ******************************************************************************************************************************************************************************** (e-5) **************************************** SOLAR_MASS     ,     // Neptune (************************************* (Body {         position: [ 1.537_969_711_485_091_1e1, -2.591_931_460_998_796_4e1, 1.792_587_729_503_711_8e-1 ],         velocity: [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ],         mass:

    (5. ************************************************************************************************************************************************************************************************************************************ (_) ****************************************************************************************************************************************************************************************************** (_) ************************************************************************************************************************************************************************************************************************************************************************************ (_) ****************************************************************************************************************************************************************************************** (_) ****************************************************************************************************************************************************************************************************************************************************** (_ 5e-5) ***************************************** [2] * (************************************* SOLAR_MASS     } ]; (*************************************************************************************** (********************************************************************************************** (offset_momentum)

    Since the order of items in a file does not matter in Rust, I'm once again visiting the functions from shortest to longest, starting with (offset_momentum) ************************************************************. ********************************

    The algorithm uses (offset_momentum) once, at startup, to adjust the sun's velocity. This means it's probably not a hot spot. My Rust version of the operation looks like this: (************************************************************************************** /// Adjust the Sun's velocity to offset system momentum. (************************************* (fn) ************************************** (offset_momentum) (() ******************************** (bodies) **************************************** (************************************: & (mut) *************************************** [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] {      (*********************************** (let) ************************************* (sun, planets) [2]=(************************************* (bodies.) **************************************** split_first_mut [0] ****************(). unwrap [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ************************** ();     // ^ -------- Note 1 (************************************* (sun.velocity) ****************************************

    =[

    0.;3];      (*************************************** for (************************************ (planet) (ininplanets {          (*************************************** for (************************************ (m) (inin (0) **************************************

    ... [i

    1..BODIES_COUNT] 3

    {             sun.velocity [m] [

    2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] ************ [i1..BODIES_COUNT] -=                planet.velocity [m] [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] ************ [i1..BODIES_COUNT] * (************************************ planet.mass (************************************ / SOLAR_MASS

    ;         }     } } (***************************************************************************************

    This is a slightly different algorithm from what C used, but the numeric results are the same. C subtracted each body's velocity from the sun's ... including the sun's ... which was a complex way of initializing the sun's velocity to zero. This not only seemed odd, but it made it hard to use an iterator.

    You see, in each iteration of the C loop, two [

    2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] ******************************************************************** (bodies are touched: [2] (**************************************** [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ************************************************** (bodies) ************** [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] **********************************, the sun, gets written.bodies [m] [i] ******************************************, which can be either the sun or a planet, gets read.On iteration 0, (bodies) **************** [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] ************************************ and (bodies) **************** [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] ************************************** are the same - the names (alias) ***********************************************************************************, and Rust has restrictive checks around aliasing and mutation to prevent data races and iterator invalidation

  • , a common C bug.
    To short-circuit this, I've made use of a common Rust pattern, which is to (split) ************************************************************************************ the array into two non-overlapping sections ([m] *********************************************************** (Note 1) ************************************************************************** (above), the sun on one side, and all the planets on the other. (This is another example of an operation that uses (unsafe) ************************************************************** (under the hood) , but presents an API that we can't misuse from safe code.)

    Now I can mutate the (sun) ************************************************************** freely even while iterating over the (planets) .

    Using iterators like this is a great way of avoiding bounds checks , by not requesting them in the first place - the iterator is by definitionrestricted to valid bounds. (************************************ (*************************************************************** (output_energy) ************************************************************ and (sqr)

    This routine gets called to print the energy before and after the simulation has been run. The output gets used to check that the program ran correctly.

    At this point, let me introduce the utility function (sqr) **************************************************************, because I got tired of writing things like (body.velocity) ***************** body.velocity (********************************************** [i] ****************

    // A convenient way of computing `x `squared. (************************************* (fn) ************************************** (sqr) (() ******************************** (x) ************************************** (************************************: f ->(f) {     x [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] ************************** } (*************************************************************************************** [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] sqr is a place where one might be tempted to use a macro in C, but writing such a macro in a way that (x) ********************************************************** doesnt get evaluated (twice) ************************************************************************************* (is tricky.) you'll basically (never) ************************************************************************************ see a macro used for something like this in Rust code. A function is easier to write, less error-prone to apply, and Rust is fairly aggressive about inlining, so a function generally has no runtime cost over a macro. (If you're concerned that a function might (not) ************************************************************************************* get inlined, you can slap the (# [i]) attribute on it and move on.) (************************************ Right - now that we've gotsqr (************************************************************, on to output_energy.

    (************************************** /// Print the system energy. (************************************* (fn) ************************************** (output_energy) (() ******************************** (bodies) **************************************** (************************************: & (mut) *************************************** [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] {      (************************************* (let mut [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] ************************** (energy)=[ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] ****************************

    ;     

    // v -------------- ---------- Note 1      (*************************************** for (************************************** (i, body) (********************** (in (bodies.) **************************************** (iter) **************************************(). enumerate [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ************************** () {         // Add the kinetic energy for each body. (************************************* (energy) **************************************

    =

    0.5             *

    body.mass             

    * (************************************ ()(sqr) **************************************************************** (body.velocity [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] )                  (************************************ (sqr) (body.velocity [0]                  (************************************ (sqr) (body.velocity [2]));         // Add the potential energy between this body and every other         // body.          (*************************************** for (************************************ (body2) (in & (bodies) ************************ {             // ^ -------------- ------------- Note 2 (************************************* (energy) **************************************

    -=

    body.mass [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ****************************** body2.mass                 / (************************************ (f) ******************************************************************************************************************************************************************************************************************************************************************* (************************************** :: sqrt                      (************************************* (sqr) ************************************** (body.position [0] (************************************** -body2.position [i])                          (************************************ (sqr) (body.position [0]

    -(body2.position) )                         

    (************************************ (sqr) (body.position [0] [i] ****************** - (body2.position) ),                 );         }     }      (*************************************** (println!) **************************************** ([package] " {:. 9} (************************************ [2] ****************, energy); } (***************************************************************************************

    The (method) ************************************************************************************* being used here is the same as in C, but there are some unfamiliar parts to the code. (************************************ [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] **************************************************************** (Note 1) ****************************************************************************: The Rust code

    (************************************** (************************************ for (i, body) in [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] **************************** (bodies.) ************************************** (iter) (********************************************************************** (enumerate) ***************************************

    () { (***************************************************************************************

    is a common way to process all the items in a collectionwith their indices [ 1.289_436_956_213_913_1e1, -1.511_115_140_169_863_1e1, -2.233_075_788_926_557_3e-1 ] ************************************************************************. The (iter)gets an iterator over each [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] body (************************************************************, and the (enumerate () [ 1.289_436_956_213_913_1e1, -1.511_115_140_169_863_1e1, -2.233_075_788_926_557_3e-1 ] operation (available on any iterator) extends it with an upwards-counting number.

    Okay, so Rust provides a new weird way to write a forloop. Why did I use it here?

    Because, as inoffset_momentum (************************************************************, this approach does not request or require bounds checks, so I don't have to measure the results to tell if the compiler was able to eliminate them. It will also adapt to any changes in the length of the (bodies) array, without even needing to rely on a common constant. And because, in general, it's faster than an explicit indexing loop in practice.

    You'll see this pattern a lot, it's worth understanding . (************************************ [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] **************************************************************** (Note 2) ****************************************************************************: The other significant Rust-ism here is how we express the (body2) ************************************************************ loop, which was (j) ************************************************************ in the C version. The goal is, for each (body) ************************************************************, we want to iterate over every [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] body with a higher index. We express this by (slicing) ************************************************************************************ the array starting at (i 1) and iterating over the result. This technically asks for a check that (i 1) ********************************************************** (is in bounds, and (BODIES_COUNT) is equal to or less than the length - but that's better than checking every time through the loop, and in practice, the compiler is great at optimizing this away. [build] I (strongly) prefer this phrasing of the loop, because we no longer have to keep (i) ************************************************************ and (j) straight in the equations below. I have an easier time telling (body) ************************************************************ from (body2) *************************************************************** (than) ************************************************************ (bodies) **************** from (bodies) [2] () ************************************************************** (advance) **************************************************************

    Theadvanceoperation pushes the simulation forward by one time step. It's the biggest hot spot in the program, executed some million times during a benchmark run. It's where most of the optimizations were applied in C. But let's take a step back and consider what it

    actually does.

    Using the method from the C program, (advance) ************************************************************ consists of four phases. **********************************

    Compute the vector between each pair of bodies in the system. (********************************************

    Compute the magnitude of the gravitational force given those vectors. (********************************************

    Apply gravitation from each body to every other body's velocity. (********************************************

    Update each body's position based on its velocity. (******************************************** [[0.;3] Let's visit each of those steps in turn.

    (************************************** /// Steps the simulation forward by one time-step. (************************************* (fn) ************************************** (advance) ((

    ) ******************************** (bodies) **************************************** (************************************: & (mut) *************************************** [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] {     // Compute point-to-point vectors between each unique pair of     // bodies.      (************************************* (let mut [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] ************************** (position_deltas)=([0.;3]; (*************************************** INTERACTIONS

    ];     {          (************************************* (let mut [

    2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] ************************** (k)=(0) **************************************** (**************************************;          (*************************************** for (************************************ (i) (inin (0) **************************************

    ... [i

    1..BODIES_COUNT] BODIES_COUNT - (************************************* (1) ************************************** [i1..BODIES_COUNT] {              (*************************************** for (************************************ (j) (inin (i) *************************************** ( ) **************************************1.. (************************************** BODIES_COUNT [i1..BODIES_COUNT] {                  (*************************************** for (************************************** (m, pd) (********************** (in (position_deltas) *************************iter_mut (). (************************************** (enumerate) ************************************** [i1..BODIES_COUNT] () {                     * (************************************ (pd) (=(bodies) . position [m] [package] -bodies (***********************. position [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ];                 }                 k =(************************************ (1) ****************************************;             }         }     } (***************************************************************************************

    That loop is nearly unmodified from the C translation. I played with replacing the range-based loops with iterators, but I didn't stick with the result, for two reasons. First, I thought it was kind of hard to read. Second, it messed up the compiler's ability to reason about (k) ************************************************************. Right now, the compiler is smart enough to look at the way (k) ************************************************************ is maintained and determines that it's always in bounds for bounds ************** (position_deltas [k]; adding some iterator implementations to the mix seemed to break this. Usually, iterators help avoid bounds checks, but this is a complex case. (********************************** I (am) , however, using an iterator to loop over each dimension of the vector, which actually improved code generation a bit.

    Note that when we reach the end of that code snippet, (position_deltas) ************************************************************ is still mutable ( (mut) . We don't need to change it ever again, so it does not have to be (mut) ************************************************************. We'll see an idiom for cases like this in just a moment.

    Moving on: (************************************************************************************** (**************************************** // Compute the `1 / d ^ 3` magnitude between each pair of bodies.      (*********************************** (let) ************************************* (magnitudes) (=

    {          (************************************* (let mut [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] ************************** (magnitudes)=[k];          (*************************************** for (************************************** (i, mag) (********************** (in (magnitudes.) **************************************** iter_mut (). enumerate [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ************************** () {              (*********************************** (let) ************************************* (distance_squared) (=(sqr) *************************************** () position_deltas [m] [2])                  (************************************ (sqr) (position_deltas [m] [0]                  (************************************ (sqr) (position_deltas [m] [i]);             * (************************************ (mag) (=(0.) ************************************************************************************************************************************************************************************************************************************************************************************************** (************************************ / (distance_squared * distance_squ ared.

    ******************************** (sqrt) ***************************************

    ());         }         magnitudes     }; (***************************************************************************************

    This time, I'm using the "freeze" pattern to update a mutable array in-place within the block, but then assign it to a non - (mut) ************************************************************ binding, preventing accidental further mutation. This array is small enough that (rustc) ************************************************************ does the right thing. (I did do this for

  • ******************** (position_deltas) ************************************************************ because doing so generated a call to (memcpy) . This is likely to be a compiler bug, and illustrates why benchmarking your programs is important.)

    If you've been following the translation of the C program in the rest of this series, you might notice that this code is really short. For performance, the previous programs used a faster but less accurate "reciprocal square root" operation, followed by two iterations of the Newton-Raphson method for approximating square roots to improve its precision.

    That's clever, but: (****************************************

    Using that "reciprocal square root" instruction makes us non- portable, because it's Intel-specific, and (********************************************

    I'm keeping things simple until data shows me I can 't.

    So, I just used (sqrt) ************************************************************ [i] ****************

    Now, the velocity update loop: (************************************************************************************** (**************************************** // Apply every other body's gravitation to each body's velocity.     {          (************************************* (let mut [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] ************************** (k)=(0) **************************************** (**************************************;          (*************************************** for (************************************ (i) (inin (0) **************************************

    ... [i

    1..BODIES_COUNT] BODIES_COUNT - (************************************* (1) ************************************** [i1..BODIES_COUNT] {              (*************************************** for (************************************ (j) (inin (i) *************************************** ( ) **************************************1.. (************************************** BODIES_COUNT [i1..BODIES_COUNT] {                  (*********************************** (let) ************************************* (i_mass_mag) (=(bodies) . mass (************************************** (magnitudes) ;                  (*********************************** (let) ************************************* (j_mass_mag) (=(bodies) ************************** mass

    * (**************************************** (magnitudes) ;                  (*************************************** for (************************************** (m, pd) (********************** (in (position_deltas) *************************

    iter********************************** (). (************************************** (enumerate) ************************************** [i1..BODIES_COUNT] () {                     bodies [m]. velocity [m] (************************************ -=*

    (pd)

    * [

    -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] **************************** j_mass_mag;                     bodies [j]. velocity [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] **************************** =* [2] ************** (pd) (************************************************************************ i_mass_mag;                 }                 k =(************************************ (1) ****************************************;             }         }     } (*************************************************************************************** Again, that loop is substantially similar to the version from part 5, but using an iterator where possible. (As in the first loop in****************** (advance) , I haven't found an iterator-based formulation for the (i) / (j) ************************************************************ / (k) loops that improves code generation.)

    And finally, position updates:

    (************************************** (************************************ // Update each body's position.      (*************************************** for (************************************ (body) (in

    in (bodies {          (*************************************** for (************************************** (m, pos) (********************** (in (body.position.) **************************************** iter_mut

    (). enumerate [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] **************************** () {             * (************************************ (pos) (=[ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] ************************************************************************************************************************************************************************************************************************************************************************************** (************************************ [s] body.velocity [m];         }     } } (***************************************************************************************

    This is identical to the code from part 5 except for the iterators. iterators) (************************************************************** (main) ************************************************************

    Finally, we need a (main) ************************************************************** (function to drive the whole shebang.) (************************************** fn (************************************** (main) ************************************** [package] () {      (*********************************** (let) ************************************* (c) (=(std :: env :: args (). [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ************************ (nth) ************************************ (

    1) ************************************** [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ************************************ (unwrap) (********************************************************************** (parse) (). [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ************************** (unwrap) ************************************

    ();      (************************************* (let mut [

    2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] ************************** (solar_bodies)=STARTING_STATE (**************************************;      (************************************* (offset_momentum) ************************************** ()& (mut) ***************************************

    solar_bodies);      (************************************* (output_energy [m] ************************ ()& (mut) ***************************************

    solar_bodies);      (*************************************** for (************************************ (_ in)

    0

    ... **************************************

    c {          (************************************* (advance) ************************************** ()& (mut) ***************************************

    solar_bodies)     }      (************************************* (output_energy [m] ************************ ()& (mut) ***************************************

    solar_bodies); } (***************************************************************************************

    That's hardly changed from part 5 - in fact, I only changed some spelling and capitalization to match Rust conventions. (************************************

    Here's the completed program: (**************************************** [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ************************************************************************************** (nbody-6.rs) ****************************************** [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] **********************************Normally, we'd build this program using Cargo, and that's how I built it when writing this article, but because we don't have any dependencies, you can just compile the code with: (************************************************************************************** $ rustc --edition=2020     -C opt-level=3 -C codegen-units=1 -C debuginfo=2 -C target-cpu=core2     --target x (_) - unknown-linux-musl     nbody-6.rs -o nbody-6 (***************************************************************************************

    That being said, because I want to encourage you to use Cargo, here's how you would do it.

    First, we need a directory containing a (Cargo.toml) ************************************************************** file defining a build with the same options as our command line above. (************************************************************************************** [k] (************************************* (name) ************************************** (="nbody" (*************************************** (version [m] ************************ (=0.1.0 (************************************* (authors) **************************************

    =(**************************************** (edition) ************************************** (= (“********************************************************************************************************************** (*************************************** [package] [package] (*************************************** (opt-level) ***************************************

    =3 (************************************* (codegen-units) ****************************************

    =1 (*************************************** (debug) *************************************** (= 2) (***************************************************************************************

    Next, because (Cargo.toml) is portable and unconcerned with certain build details, we need a. cargo / configfile customizing our machine-specific parts of the build: (************************************************************************************** (******************************** (************************************* (target) *************************************** (= ("x" _ - unknown-linux-musl ” (************************************** # static linking (*************************************** rustflags (************************************ (= (***************************************************************************************

    Finally, place the source code in (src / main.rs) and run [m] ********************************** cargo build --release

    Cargo is good. You should use it if possible. (********************************** Okay, end of evangelism. Moving right along ... [build]

    When you want to know if a compiler is behaving a certain way, you check the output. One way to check it is to measure its performance; another is to disassemble it and read the result.

    We can disassemble the binary with

    (************************************** $ objdump -d - demangle nbody- 6 | less (***************************************************************************************

    The listing is long, but insidenbody_6 :: main we find code that looks like this: (************************************************************************************** (******************************************************************************************************************************************************************************************************** (cc0:) (0f) c9 mulpd% xmm9,% xmm9   402 cc5: (************************************************************************************************************************************************************************************************************************************************************** (0f) ************************************************************************************************************************************************************************************************************************************************************************ cc addpd% xmm4,% xmm9   402 cca: (****************************************************************************************************************************************************************************************************************************************************************** (0f) *********************************************************************************************************************************************************************************************************************************************************************** d1 sqrtpd% xmm9,% xmm2   402 ccf: (****************************************************************************************************************************************************************************************************************************************************************** (0f) ********************************************************************************************************************************************************************************************************************************************************************* d1 mulpd% xmm9,% xmm2 (***************************************************************************************

    Those are SSE vector instructions, alright. They even end in (pd) ************************************************************, for "packed doubles, "which means they're operating on two (f) ****************************************************************************************************************************************************************************************************************************************************************** [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] s at a time. Since we didn't (ask [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] for that anywhere in our code, this is a sign that auto-vectorization is happening.

    [

    2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] rustcis quite aggressive about auto-vectorizing programs, which is one reason why it's important to set (target-cpu) to something reasonable. By default, it assumes a very generic processor - which ensures that your binaries will run on your friend's computer, but may leave some performance on the table by not taking advantage of recent processor features. (************************************ But is auto-vectorization actually (helping?) To find out, we'll have to run it. (********************************** Let's take a baseline measurement - surely we'll need optimize it further, right?[ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] (Command) Mean [s]

  • Min [s]

    (Max [s] [0] Ratio ****************************************************************** () ****************************************************************** (************************************************************ (**********************************************************./ nbody.gcc-8.bench (************************************************************

    6. (± 0.0) ****************************************************************************************************************************************************************************************************************************************************************************************** (********************************************************** (6.) ****************************************************************************************************************************************************************************************************************************************** (************************************************************************ (6.) ****************************************************************************************************************************************************************************************************************************************

    (1) ************************************************************************************************************************************************************************************************************************************************************************************** (x) ************************************************************************* (************************************************************** (********************************************************** ./ nbody.clang-8.bench (************************************************************ [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ****************************************************************** (5.) ************************************************************************************************************************************************************************************************************************* (± 0.) (************************************************************************** (5.) ************************************************************************************************************************************************************************************************************************** (********************************************************** (5.) ************************************************************************************************************************************************************************************************************************** [Body;BODIES_COUNT] ********************************************************************** [0] ************************************************************************************************************************************************************************************************************************************************************************** (x) ************************************************************************ (****************************************************************./ nbody-5.bench [ 8.343_366_718_244_58e0, 4.124_798_564_124_305e0, -4.035_234_171_143_214e-1 ] (************************************************************************** (5.) ****************************************************************************************************************************************************************************************************************************************************** (± 0.) ******************************************************************************************************************************************************************************************************************************************************************************************** (************************************************************************** (5) **************************************************************************************************************************************************************************************************************************************************** [2] (5.) ****************************************************************************************************************************************************************************************************************************************************

    (0.) ********************************************************************************************************************************************************************************************************************************************************** (x) (**********************************************************************./ nbody-6.bench (****************************************************** [Body;BODIES_COUNT] **********************************************************************

    (3). ************************************************************************************************************************************ (± 0.0) *************************************************************************************************************************************************************************************************************************************************************************************************

    (3). ************************************************************************************************************************************** (************************************************************************ [2] 3. (****************************************************************************** (**************************************************************************** (0.) **************************************************************************************************************************************************************************************************************************************************************** (x) [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ]

    No, we won. The simpler auto-vectorized program is significantly faster than the hand-optimized version. For each simulation step performed by the original program compiled by (gcc) **************************************************************, the [ 2.680_677_724_903_893_2e-3*DAYS_PER_YEAR, 1.628_241_700_382_423e-3*DAYS_PER_YEAR, -9.515_922_545_197_159e-5*DAYS_PER_YEAR ] ************************************** clangversion can do 1.2 steps, and the new one can do 1.6.

    We've produced the fastest implementation so far, by doing the least work. (************************************Pro

    Our new program is much simpler and clearer than the original C program. It expresses the algorithm in a straightforward way, which can be read and understood without knowing anything about SIMD instruction sets.

    The program uses nounsafe of any kind, so we know without thinking very muchthat it's not going to crash, violate memory safety, or introduce security flaws like stack smash opportunities.

    The program is (entirely portable.) There is no Intel-specific stuff in it, and in fact, no - bit-specific stuff. It compiles and runs fine on x (**************************************************************************************************************************************************************************************************************************************************************, x 97 - 73, and and (bit ARM, among others.)

    This program is also (more accurate) than the C program, because we used real (sqrt) ************************************************************** instead of an approximation. (You could make it faster, but non-portable, by re-introducing the reciprocal square root trick.)

    Finally, the program can automatically benefit from new instruction set improvements, simply by changing the (target-cpu) ************************************************************. For example, if you set (target-cpu=sandybridge) **************************************************************** or later, the program will use AVX instead of SSE.

    Con

    So auto-vectorization is amazing and we never have to write vector code again, right?

    That depends.

    Auto-vectorization is (magic.) One of the down-sides of magic is that, when it does work, it can be hard to tell (why.) ***********************************************************************************

    For example, I mentioned above that you can recompile the program to use AVX. At face value, this should make the program twice as fast, because AVX instructions can chew on four [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] ************************************************************************ [2] ************************************ (f) s at a time instead of SSE's two. But, on my Skylake-based machine at least, the AVX version becomes (slower.) ***********************************************************************************

    by inspecting the disassembly and applying Linux's (perf) tool, I was able to discover why: when targeting AVX, (rustc) ************************************************************* (1) . 0 fails to vectorize the (sqrt) ************************************************************** in the (magnitude) ************************************************************** computation loop. (sqrt) ************************************************************** is the most expensive instruction we have, and instead of getting twice as cheap, it got twice as expensive. I was able to detect this with a benchmark, but to diagnose it I had to read the assembly listing - and that got me no closer to explaining (why it fails.

    Auto-vectorization may also require you to be careful about how you write your loops. The LLVM folks have somedocs on auto-vectorization

  • that show patterns to use and avoid (in C, but they're similar in Rust).

    We've written a simpler program that runs much faster than our hand-optimized code, thanks to the compiler. But we've also seen that auto-vectorization has drawbacks. So should we rely on it in real-world situations?

    Personally? My answer is yes. In my experience, the amount of time I spend inspecting and profiling auto-vectorization hiccups is (much less) than the % of time I could spend hand-optimizing an algorithm for a particular processor's vector unit. And even if I were to invest the extra work, I'd have to do it all again if I wanted to switch to ARM or AVX later.

    When we write high-performance code now-adays, we 're alwaysrelying on some collection of optimization passes in the compiler. Any of them can glitch out and let us down, because compilers are, in the end, dumb pattern-matching machines. Auto-vectorization is one such optimization.

    You should always have regression tests for your software's important features, and if performance is one of your features, you should treat it as one:

    Document the compiler version you tested with, or ideally, pin it in your build system. Rust has the rust-toolchain (file to do this.)

    Include a (benchmark test) that will indicate if things have suddenly gotten slower, so you can at least know to investigate. Ensure that this test runs as part of the build or continuous integration flow. (For Rust projects using Cargo, I recommend (Criterion) .)

    Make sure to run the benchmarks when you upgrade the compiler. I haven't had a (rustc) ************************************************************** update break auto-vectorization before, but I know it's possible, because I've had (gcc) ************************************************************** (upgrades break me several times.) ************************************(************************, **************************************************************************** But you don't have to agree with me.Maybe auto-vectorization asks you to put too much trust in the compiler. That's fine!

    In the next section, I'll look at how we can explicitly vector-optimize our algorithms in Rust, without using non-portable or ************************************************************** (unsafe) ************************************************************** (code) . Stay tuned! (************************************         [ 2.964_601_375_647_616e-3*DAYS_PER_YEAR, 2.378_471_739_594_809_5e-3*DAYS_PER_YEAR, -2.965_895_685_402_375_6e-5*DAYS_PER_YEAR ] ****************************                 (# c)                # rust               # tutorial               [ -2.767_425_107_268_624e-3*DAYS_PER_YEAR, 4.998_528_012_349_172e-3*DAYS_PER_YEAR, 2.304_172_975_737_639_3e-5*DAYS_PER_YEAR ] **************************************************************************************************         (********************************************************************************************************** (********************************************
    (Read More) ****************************************** (************************************************************************************************************** () **************************************************************************************************************** [k]

  • What do you think?

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    GIPHY App Key not set. Please check settings

    World War 3 Memes & Mania Are the Latest Dumb Overreaction to Trump, Crypto Coins News

    World War 3 Memes & Mania Are the Latest Dumb Overreaction to Trump, Crypto Coins News

    Branch prediction minutiae in LZ decoders, Hacker News