Swift: Google’s bet on differentiable programming, Hacker News

Read time: 82 minutes

Two years ago, a small team at Google started working on making Swift the first mainstream language with

first-class language-integrated differentiable programming

capabilities. The scope and initial results of the project have been remarkable, and general public usability is not very far off. Despite this, the project hasn’t received a lot of interest in the machine learning community and remains unknown to most practitioners. This can be attributed in part to the choice of language, which has largely been met with confusion and indifference, as Swift has almost no presence in the data science ecosystem and has mainly been used for building iOS apps.

This is unfortunate though, as even a cursory glance at Google’s project will show that it’s a massive and ambitious undertaking, which could establish Swift as a key player in the area. Furthermore, even though we mainly work with Python at Tryolabs, we think that choosing Swift was a superb idea, and decided to write this (short ™ post to help spread the word about Google’s plans.

But before we get into Swift and what the term differentiable programming actually means, we should first review the current state of affairs. due to the popularity of the post, We will be hosting a Swift for ML live webinar. Register to be notified when date & time are confirmed. Be updated

What is wrong with you, Python?!

Python is by far the most used language in machine learning, and Google has a ton of machine learning libraries and tools written in it. So, why Swift? What’s wrong with Python?

To put it bluntly, Python is

slow

. Also, Python is

not great for parallelism .

To get around these facts, most machine learning projects run their compute-intensive algorithms via libraries written in C / C / Fortran / CUDA, and use Python to glue the different low-level operations together. For the most part, this has worked really well, but as with all abstractions, it can create some problems. Let’s go over some of those. External binaries

Calling external binaries for every compute-intensive operation limits developers to working on a small portion of the algorithm’s surface area. Writing a custom way to perform convolutions, for example, becomes off limits unless the developer is willing to step down into a language like C. Most programmers choose not to do so, either because they have no experience with writing low level performant code, or because jumping back and forth between Python’s development environment and some low level language’s environment becomes too cumbersome to justify.

This leads to the unfortunate situation in which programmers are motivated to write the least amount of sophisticated code they can, and default to calling external library operations. This is the opposite of what’s desirable in an area as dynamic as machine learning, where so much is still not settled, and new ideas are very much needed.

Library abstractions

Having your Python code call lower level code is not as easy as mapping Python’s functions to C functions. The unfortunate reality is that the creators of machine learning libraries have had to make certain development choices in the name of performance, and that has complicated matters a bit. For example, in Tensorflow graph mode , which is the only performant mode in the library, your Python code doesn’t generally run when you think it would. Python actually acts as a sort of metaprogramming language for the underlying Tensorflow graph.

The development flow is as follows: the developer first defines the network using Python, then the TensorFlow backend uses this definition to build the network and compile it into a blob whose internals the developer can no longer access. After compilation, the network is finally ready to run, and the developer can start feeding it data for training / inference jobs. This way of working makes debugging quite hard, as you can’t use Python to dig into what’s happening inside your network as it runs. You can’t use something like pdb . Even if you wish to engage in good old print debugging, you’ll have to use tf.print and build a print node into your network , which has to connect to another node in your network, and be compiled before anything can be printed.
More straightforward solutions exist, though. In PyTorch [15, 10] , your code runs imperatively as is expected in Python, and the only non transparent abstraction is that the operations that run on the GPU execute asynchronously. This is generally a non-issue as PyTorch is smart about this and waits for all the async calls that are dependencies of any user interactive operations to finish before ceding control. Still, this is something to keep in mind, especially with things such as benchmarking. Industry lag All these usability problems aren't just making it more difficult to write code, they are unnecessarily causing the industry to lag behind academia . There have been several papers that tweak low level operations used in neural networks, introducing improvements in accuracy of a few percentage points in the process, and have still taken a long time for the industry to adopt. One reason for this is that even though these algorithmic changes tend to be quite simple themselves, the tooling problems mentioned above make them extremely difficult to implement. Hence, they may not be deemed worth the effort for what often results in only a 1% improvement in accuracy. This is especially problematic for small machine learning dev shops that usually lack the economies of scale to justify paying for their implementation / integration. Therefore, companies tend to ignore these improvements until they get added to a library like PyTorch or TensorFlow. This saves them the implementation and integration costs, but it also causes industry to lag behind academia by 1 or 2 years, as the library maintainers can't be expected to immediately implement the findings of every new paper that is published. One concrete example of this issue are Deformable Convolutions , which seem to improve the performance of most Convolutional Neural Networks (CNNs). An open source implementation appeared about 2 years ago. Nevertheless, the implementation was cumbersome to integrate into PyTorch / TensorFlow and the algorithm didn’t gain widespread use. Only just recently has PyTorch added support for it, and as of yet I am not aware of there being an official TensorFlow version. Now, let’s say this happens for several papers that each contribute a performance enhancement of 2%; the industry could be missing out on significant accuracy improvements of 1. (^ n%) for no reason other than inadequate tooling. This is regrettable, considering the n

could be quite high.

Speed

Using Python fast libraries can still be slow in some cases. Yes, for CNNs running classification on images, using Python and PyTorch / TensorFlow will be really fast. What’s more, there is probably not much performance to be gained by coding your whole network in CUDA, as most of the inference time is spent on big convolutions that are already running in well-optimized implementations. This isn’t always the case though.

Networks that consist of many small operations are often the most susceptible to taking

performance hits

, if they are not fully implemented in a low level language. As an example, in a blogpost in which he professes his love for using Swift for deep learning, (Fast.AI) (s) Jeremy Howard reports that despite using PyTorch’s great JIT compiler, he still couldn ‘ t make a particular RNN work as fast as a version completely implemented in pure CUDA.

Furthermore, Python is not a very good language for cases where

latency

is important, nor for very

low level tasks

such as communicating with sensors. The way some companies choose to get around this is to start by developing their models in PyTorch / TensorFlow-Python. In this way, they take advantage of Python’s ease of use when experimenting with and training new models. After this, they rewrite their model in C for production purposes. I’m not sure if they rewrite it completely, or if they simply serialize it using PyTorch’s tracing functionality or TensorFlow’s graph mode, and then rewrite the Python code around it in C . Either way, a lot of Python code would need to be rewritten, which oftentimes is too costly for small companies to do.

All these problems are well known. Yann LeCun , who is widely considered one of the godfathers of deep learning, has stated that there is a need for a new machine learning language. In a thread PyTorch co-creator Soumith Chintala and him discussed several languages as possible candidates, with Julia, Swift, and even improving Python being mentioned. On the other hand, Fast AI’s Jeremy Howard seems to be decidedly on the Swift train. Google accepts the challenge

Lucky for us, Google’s Swift for Tensorflow

(S4TF) team took the matter into their own hands. What’s even better, they have been remarkably transparent about the whole process. In an extremely thorough document , they detail the journey that got them to this decision, explaining what languages they considered for the task and why they ended up using Swift.

Most notably, they considered:

(Go) : In the document, they state that Go relies too heavily on the dynamic dispatching that its interfaces provide, and that they would have had to make big changes to the language in order to implement the features they wanted. This goes against Go’s philosophy of being simple and having a small surface area. Conversely, Swift’s protocols & extensions afford a lot of freedom in terms of how static you want your dispatch to be. Also, the language is quite complex (and getting more complex every day , so making it a bit more complex to accommodate the features that Google wanted wouldn’t pose as big of a problem.

C & (Rust) : Google’s targeted user base is people who are used to working in Python for the most part, and who are more interested in spending their time thinking about the model and the data rather than thinking about things like the careful management of memory allocation or ownership. Rust and C have a level of complexity and attention to low level detail that is generally not justifiable when doing data science / machine learning development.

(Julia) : If you read any (HackerNews) or Reddit threads about S4TF, the first comment usually is, “Why didn ‘ t they choose Julia? ”. In the previously mentioned document, Google mentions that Julia looked promising too, but they didn’t really provide a solid reason as to why they didn’t go for it. They mentioned that Swift has a much larger community than Julia, which is true, but Julia’s scientific and data science communities are much larger than Swift’s, and these are arguably the communities that would make more use of S4TF. Something to keep in mind is that Google’s team had more expertise in Swift, given that Swift’s creator Chris Lattner The project started, so this probably played a big part in the decision.

A new language: I think they said it best in the manifesto: “Creating a language is a ridiculous amount of work”. This would take too long, and machine learning is moving way too fast.

What’s so cool about Swift, then?

In short, Swift allows you to program at a very high level, in an almost Pythonic way, while at the same time being really fast. A data scientist could use Swift in much the same way as they use Python, while someone working in an optimized machine learning library built in Swift could be more careful about how they manage their memory, and could even drop down to the pointer level of abstraction when idiomatic Swift is too restrictive.

Providing a detailed description of the language is probably overkill for the purpose of this article. The (official documentation

already does a much better job than I ever could. Instead, I’ll describe a few things that I found to be cool about Swift as a new fan of the language, in hopes that this will entice people to try it. The following chapters will be an assortment of random cool things about Swift, in no particular order, and with no particular attention paid to their overall significance. After these, I’ll delve into differentiable programming and Google’s big plan for Swift.

Cool thing number one

It’s fast. This is the first thing I tested when I got started with Swift. I wrote a few short scripts to evaluate how well it would fare against Python and C. These tests are not particularly sophisticated, to be honest. They just fill an array with integers and then add them all up. This by itself is not a very thorough way of testing how fast idiomatic Swift is, but I was curious if Swift could ever will be fast as C, not if Swift would

always be as fast as C.

For the first comparison, I went with Swift vs Python. I took some artistic liberties with curly brace placement in Swift so that each line is basically doing the same thing in both cases.

(import time (import

                                | result=[] | [15, 10] var result=[Int] () for it (in range (

): | for it in .     start=time.time () | [1, 0] let start=CFAbsoluteTimeGetCurrent ()      for

_ in range ( : | for [15, 10] (in [15, 10] (0.) .

{         result.append (it) | result.append (it)}     sum_=sum (result) | [15, 10] let sum=result.reduce ( (0) ,

    end=time.time () | [1, 0] let end=CFAbsoluteTimeGetCurrent ()     print (end –

start, sum_) |

print (end – start, sum)     result=[] | result=[]}

Although their syntax is very similar in this particular snippet, the Swift script proves to be times faster than the Python one. Each outermost loop in the Python script completes in ons on average, vs μs for Swift. This is quite an improvement.

There are yet other interesting things to note. Namely, is an operator as well as a function that gets passed to (reduce) (which I’ll elaborate on later), CFAbsoluteTimeGetCurrent reveals Swift’s quirks regarding legacy iOS namespaces, and the ..

This test doesn’t really tell us how fast Swift can be, though. To find that out, we need to compare it to C. So, that’s what I did, and much to my disappointment, the initial results weren’t good. The version (written in C) took an average of 1.5μs, which is ten times faster than our Swift code. Uh oh.

To be fair, this isn’t a terribly honest comparison. The Swift script is using a dynamic array, which is getting repeatedly reallocated in the heap as it increases in size. This also means it’s performing bound checking on each append. To corroborate this, we can go look at its definition. Swift standard types like int, float, and array are not hardcoded into the compiler, they are structs defined in the standard library. Thus, according to the array’s append definition , we see there’s a lot going on. Knowing this, I evened the playing field by preallocating the array’s memory and using a pointer for filling the array. The resulting script is not much longer:

(import // Preallocating memory var result=ContiguousArray 15, 10]> (repeating: (0) , count:

for it (in

)

. .



let start=CFAbsoluteTimeGetCurrent ()      // Using a buffer pointer for assignment     result.withUnsafeMutableBufferPointer ({buffer in

)          for

i (in

)

. .

{            buffer [i]=it         }     })

let

sum=result.reduce ([1, 1] (0) ,

)

let

end=CFAbsoluteTimeGetCurrent ()     print (end – start, sum)

This new code takes 3μs, so it’s now half as fast as C, which is already a good place to be in. Just for the sake of completeness, though, I continued profiling the code in order to find what the difference with the C version was. It turns out that the reduce (method I was using is performing some

unnecessary indirection with the usage of a nextPartialResult function, which is providing nonessential generalization. After rewriting it utilizing a pointer, I finally got it to C speed. However, this obviously defeats the purpose of using Swift, since at this point we are just writing more verbose and uglier C. Nevertheless, it’s good to know that you can get C speed if you really need it.

To sum up: you won’t get C speed with a Python amount of work, but you will get a great tradeoff between the two. (Cool thing núero dos

Swift has taken an interesting approach to function signatures. In their most basic form, they are relatively simple:

func (person: String, town: String) -> String {      return

Hello

(

person

) ! Glad you could visit from

( (town )

. ” } greet (person: “Bill” , town: “Cupertino” The function signature consists of the parameter names followed by their types; nothing too fancy. The only unusual thing is that Swift requires you to provide the parameter names when calling the function, so you have to write (person) and (town) (when calling) greet , as evidenced by the last line of the snippet above. Things get more interesting when we introduce something called argument labels into the mix.func ( _ person: String , from town: String) -> String {      return Hello ( person ) ! Glad you could visit from ( town ) } greet ( “Bill” , from: "Cupertino" Argument labels are just what they sound like: they are labels for your function’s parameters, and they are declared before their respective parameter in the function’s signature. In the sample above from would be town 's argument label, and _ would be (person) 's argument label. I used _ for this last label because (_) is a special case in Swift that means, “don't provide any argument name when calling this parameter.” With argument labels, every parameter gets 2 different names: an argument label, which is used for calling the function, and a parameter name, which is used inside the function's body definition. This may seem a bit arbitrary, but it makes your code easier to read. If you look at the function signature above, it’s almost like reading English: “Greet person from town.” The function call is just as descriptive: “Greet Bill from Cupertino.” Without argument labels, things become a bit more ambiguous: “Greet person town.” We don't know what town stands for. Is that the town we are in now? Is that the town in which we are going to meet the person? Or is it the town where the person is originally from? Without argument labels we would have to read the function’s body to know what’s happening, or resort to making the function name or the parameter names longer and more descriptive. This can get complicated if you have lots of parameters, and in my opinion tends to result in uglier code and needlessly long function names. Argument labels are prettier and scale better, and luckily, they are used extensively in Swift. The third of the cool things Swift makes extensive use of closures. Therefore, it has some shortcuts to make their usage more ergonomic. This example taken from the language’s documentation highlights how concise and expressive these shortcuts can make Swift look. Let's take an array that we want to sort backwards: let names=["Chris", "Alex", "Ewa", "Barry", "Daniella"] [15, 10] The less idiomatic way of doing this would be to use Swift's (sorted) method for arrays, and employ a custom function that tells it how to do pairwise order comparison on the array's elements, like so: (func backward ( _ s1: String , _ s2: String) -> Bool {      return s1 > s2 } var reversedNames=names.sorted (by: backward) The backward function compares two items at once, and returns true if they are in the desired order and false if they are not. The sorted array method expects such a function as an input in order to know how to sort the array. As a side note, here we can also see the usage of the argument label by , which is oh so beautifully terse. If we resort to more idiomatic Swift, we find that there is a better way to do this using closures: (reversedNames=names.sorted) by: {s1, s2 in [15, 10] return s1 > (s2}) The code between { } is a closure that is being defined and passed as an argument to (sorted) at the same time. If you’ve never heard of them, closures are just unnamed functions which capture their context. A good way to think about them is as Python lambdas on steroids. The keyword in in the closure is used to separate the closure's arguments and its body. More intuitive keywords such as : were already taken for signature type definitions (the closure's argument types get automatically inferred from (sorted) 's signature in this case, so they can be avoided), and we all know naming things is one of the hardest things to do in programming, so we are stuck with using the not so intuitive in keyword for this. In any case, the code is already looking more concise. We can, however, do better: (reversedNames=names.sorted) by: {s1, s2 in s1 > s2}) We removed the return statement here because, in Swift, single line closures implicitly return. Still, we can go even deeper: (reversedNames=names.sorted) by: {$ 0 > $ 1}) Swift also has implicitly named positional parameters, so in the above case ($ 0) is the first argument, ($ 1) (the second, ($ 2) would be the third, and so on. The code is already compact and easy to understand, but we can do better yet: (reversedNames=names.sorted) by: > In Swift, the > operator is just a function named > . Therefore, we can pass it to our sorted method, making our code extremely concise and readable. This also applies to operators like =, -=, , ==, and=, and you'll find them (definition) in the standard library. The difference between these functions / operators and normal functions is that the former have been explicitly declared as operators using the (infix) , (prefix) (or) suffix keywords in the standard library. For instance, the =(function is defined as an operator on this line of the Swift standard library. You can see that the operator conforms to several different protocols such as and (String) , as many different types have their own implementation of the = function. Of further interest is that we can define our own custom operators. One great example of this is in the (GPUImage2) library. The library allows users to load a picture, modify it with a sequence of transformations, and then output it in some way. Naturally, the definition of these sequences of transformations shows up repeatedly in the library, so the library's creator decided to define a new operator called -> that would be used to chain these transformations together: func -> (source: T, destination: T) -> T {     source.addTarget (destination)      return destination } infix operator -> : AdditionPrecedence In the slightly simplified code above, the -> function is first declared, and then defined as an operator. Infix just means that to use the operator, you must place it between its two arguments. This allows you to write code such as the following: let testImage=UIImage (named: "WID-small.jpg" ! let toonFilter=SmoothToonFilter () let luminanceFilter=Luminance () let filteredImage=testImage.filterWithPipeline {input, output in     input -> toonFilter -> luminanceFilter -> output // Interesting part } The above is shorter and easier to understand than a bunch of chained methods, or a series of source.addTarget (...) functions. The fourth of the cool things Previously, I mentioned that the basic Swift types were structs defined in the standard library, and not hardcoded into the compiler as they usually are in other languages . One reason this is useful is that it lets us use a Swift feature called extension , which allows us to add new functionality to any type, including the basic types. Here is how this can play out: extension (Double) {      var radians: Double {          return self (Double.pi / )     } } [1, 0] (radians) // -> 6. 8525679 Though not particularly useful, this example shows how extensible the language is, as it lets you do things such as typing any number into a Swift interpreter, and call any custom method you want on it. (Last one) On top of having a compiler, Swift also has an interpreter and support for Jupyter Notebooks . The interpreter is particularly great for learning the language, as it allows you to type swift at your command prompt and start trying out code right away, much in the same way you would with Python. On the other hand, the integration with Jupyter Notebooks is awesome for visualizing data, performing data exploration, and writing reports. Finally, when you want to run production code, you can compile it and take advantage of the great optimization (LLVM) provides . Google's master plan I mentioned quite a few features in the paragraphs above, but there's one feature that stands apart from the others: Jupyter support is very new, and was in fact added by the S4TF team. This is noteworthy because it gives us an idea of what Google's state of mind is when working on this project: they don't just want to create a library for Swift, they want to deeply improve the Swift language itself, along with its tooling, and then build a new Tensorflow library using this improved version of the language. This point is illustrated best by observing where the S4TF team has been spending most of its time. The majority of the work they’ve done has been on Apple’s Swift compiler repository itself. More specifically, most of the work Google has been doing lies in a (dev branch) inside the Swift compiler repo. Google is adding new features to the Swift language, first creating and testing them in their own branch and then merging them into Apple's master branch. This means that the standard Swift language running on iOS devices all around the world will eventually incorporate these improvements. Now, on to the juicy part: What are the features that Google is building into Swift? Let’s start with the big one. (Differentiable programming Lately, there’s been a lot of hype surrounding differentiable programming. Tesla's director of AI, Andrej Karpathy, has called it Software 2.0 , while Yan LeCun has proclaimed : “Deep Learning est mort. Vive Differentiable Programming. ” Others claim there will be a need for the creation of a whole new set of new tools, such as a new Git, new IDEs, and of course new programming languages. Wink wink. So, what is differentiable programming? In a nutshell, differentiable programming is a programming paradigm in which your program itself can be differentiated. This allows you to set a certain objective you want to optimize, have your program automatically calculate the gradient of itself with regards to this objective, and then fine-tune itself in the direction of this gradient. This is exactly what you do when you train a neural network. The most compelling thing about having a program tune itself is that it allows us to create the sorts of programs we seem to be completely incapable of programming by ourselves. An interesting way to think about this is that your program using its gradients to tune itself for a certain task is better at programming than you are . These past few years have shown that this is indeed true for an ever increasing number of cases, with no clear end to that growth in sight. A differentiable language [1, 1] After that really long introduction, it's finally time to introduce Google's vision of how native differentiable programming will look in Swift: (func cube ( _ x: Float ) -> Float {      return x x x } let cube 𝛁 =gradient (of: cube) cube ( 2 // 8.0 cube 𝛁 ( 2 0 Here we start by defining a simple function named (cube) , which returns the cube of its input. Next comes the exciting part: we create the derivative function of our original function, merely by calling gradient on it. There are no libraries or external code being used here, gradient is simply a new function that is being introduced by the S4TF team into the Swift language . This function takes advantage of the changes the team made to Swift’s core, in order to automatically calculate gradient functions. This is Swift's big new feature. You can take arbitrary Swift code and, as long as it’s differentiable, automatically calculate its gradient. The code above has no imports or weird dependencies, it’s just plain Swift. If you’ve ever used PyTorch, TensorFlow, or any of the other big machine learning libraries, they all support this feature, but only if you’re using their particular library specific operations. What’s more, working with gradients in these Python libraries is not as lightweight, transparent, or well integrated as it is in plain Swift. This is a massive new feature of the language and, as far as I can tell, Swift is the first mainstream language that has native support for such a thing. To further illustrate how this would look in the real world, the following script is a more thorough example of this new feature, applied to a standard machine learning training workflow: (struct Perceptron : @memberwise Differentiable {      var weight: SIMD2 > =.random ( in [15, 10] (1.)      var bias: Float= 0     @differentiable      func callAsFunction ( _ input: SIMD2 > -> Float {         (weight input) .sum () bias     } } var model=Perceptron () let andGateData: [(x: SIMD2>, y: Float)]=[ (x: [0, 0], y: ,     (x: [0, 1], y: 0 ) ,     (x: [1, 0], y: 0 ),     (x: [1, 1], y: 1 )), ] for _ in 0. 15, 10] {      let (loss, 𝛁 loss)=valueWithGradient (at: model) {model -> Float ) in          var loss: Float= 0          for (x, y) [15, 10] in andGateData {              let) let ŷ =model (x)              let error=y -             loss=loss error (error / [15, 10] 2         }          return loss     }     print (loss)     model.weight -= 𝛁 loss.weight 0     model.bias -= 𝛁 loss.bias 0 } Again, the above code is all plain Swift with no external dependencies. In this snippet, we see that Google has introduced two new Swift features: callAsFunction and valueWithGradient . The first one is quite simple, it lets us instantiate classes and structs, and then call them as if they were functions. Here the Perceptron struct gets instantiated as model , and then model [15, 10] gets called as a function in let ŷ=model (x) . When you do this, callAsFunction is the method actually being called. If you’ve ever used Keras or PyTorch models, you know that this is quite a common way of handling models / layers. While these two libraries use Python's __ call __ method to implement their (call) and forward methods, respectively, Swift had no such feature, and thus Google had to add it. The other interesting new feature in the above script is (valueWithGradient) . This function returns the resulting value and gradient of a function or closure, evaluated at a particular point. In the case above, the closure we define and use as input for valueWithGradient is actually our loss function. This loss function takes our model as an input, so when we say that valueWithGradient will evaluate our function at a particular point, we mean that it will evaluate our loss function with our model in a particular weight configuration. After we have calculated the aforementioned value and grad ient, we print the value (which is our loss), and update our model’s weights using the gradients. Repeat this a hundred times and we have a trained model. You'll notice that we can access andGateData inside our loss function, which is an example of Swift closures being able to capture their enclosing context. (Differentiating external code Another fantastic feature is that not only can we differentiate Swift operations, but we can also differentiate operations in external, non-Swift libraries, if we manually tell Swift what their derivatives are. This means you can use a C library with a very fast implementation of some operation not currently present in Swift, import it into your project, code the derivative, and then use this operation in your big neural network and have things like backpropagation work seamlessly. What's more, making this happen is really simple: (import // we import pow and log from here func powerOf2 ( _ x: Float) -> Float {      return pow ( 2 , x) } @derivative (of: powerOf2) func dPowerOf2d ( _ x: Float) -> (value: Float, pullback: (Float) -> Float) {      let d=powerOf2 (x) (log) (2) )      return (value: d, pullback: {v (in v d}) } powerOf2 ( 3 ), // 8 gradient (of: powerOf2) ( 3 )