Nvidia Drops Support for CUDA on macOS, Hacker News

1. CUDA Toolkit Major Components

This section provides an overview of the major components of the CUDA Toolkit and points to their locations after installation.

Compiler

The CUDA-C and CUDA-C compiler,nvcc, is found in the bin /directory . It is built on top of the NVVM optimizer, which is itself built on top of the LLVM compiler infrastructure. Developers who want to target NVVM directly can do so using the Compiler SDK, which is available in thenvvm /directory.

Please note that the following files are compiler-internal and subject to change without any prior notice.

include / crt

bin / crt

(include / common_functions.h),include / device_double_functions.h,include / device_functions.h,include / host_config.h,include / host_defines.h, andinclude / math_functions.h

(nvvm / bin / cicc)

(bin / cudafe ) ,bin / bin2c, andbin / fatbinary

Tools

The following development tools are available in thebin /directory (except for Nsight Visual Studio Edition (VSE) which is installed as a plug-in to Microsoft Visual Studio, Nsight Compute and Nsight Systems are available in a separate directory).

IDEs:nsight(Linux, Mac), Nsight VSE (Windows)

Debuggers:cuda-memcheck,cuda-gdb (Linux), Nsight VSE (Windows)

Profilers: Nsight Systems, Nsight Compute,nvprof, nvvp, Nsight VSE (Windows)

Utilities:cuobjdump,nvdisasm

Libraries

The scientific and utility libraries listed below are available in thelib / directory (DLLs on Windows are inbin /), and their interfaces are available in theinclude /directory.

cublas

cublas_device(BLAS Kernel Interface)

cuda_occupancy(Kernel Occupancy Calculation [header file implementation])

(cudadevrt(CUDA Device Runtime)

(cudart) (CUDA Runtime)

(cufft(Fast Fourier Transform [FFT])

(cupti) (CUDA Profiling Tools Interface)

(curand) (Random Number Generation)

(cusolver) (Dense and Sparse Direct Linear Solvers and Eigen Solvers)

(cusparse(Sparse Matrix)

(libcu (CUDA Standard C Library)

(nvJPEG(JPEG encoding / decoding)

(npp) (NVIDIA Performance Primitives [image and signal processing])

nvblas(“Drop-in” BLAS)

(nvcuvid(CUDA Video Decoder [Windows, Linux])

(nvgraph(CUDA nvGRAPH [accelerated graph analytics])

(nvml) (NVIDIA Management Library)

(nvrtc) (CUDA Runtime Compilation)

(nvtx) (NVIDIA Tools Extension)

(thrust) (Parallel Algorithm Library [header file implementation])

CUDA Samples

Code samples that illustrate how to use various CUDA and library APIs are available in thesamples /directory on Linux and Mac, and are installed toC: ProgramData NVIDIA Corporation CUDA Sampleson Windows. On Linux and Mac, thesamples /directory is read-only and the samples must be copied to another location if they are to be modified. Further instructions can be found in theGetting Started Guidesfor Linux and Mac.

Documentation

The most current version of these release notes can be found online athttp://docs.nvidia.com/ cuda / cuda-toolkit-release-notes / index.html. Also, theversion.txtfile in the root directory of the toolkit will contain the version and build number of the installed toolkit.

Documentation can be found in PDF form in thedoc / pdf /directory, or in HTML form atdoc / html / index.htmland online athttp://docs.nvidia.com/cuda/index.html.

CUDA Driver

Running a CUDA application requires the system with at least one CUDA capable GPU and a driver that is compatible with the CUDA Toolkit. SeeTable 1. For more information various GPU products that are CUDA capable, visithttps://developer.nvidia.com/cuda-gpus.

Each release of the CUDA Toolkit requires a minimum version of the CUDA driver. The CUDA driver is backward compatible, meaning that applications compiled against a particular version of the CUDA will continue to work on subsequent (later) driver releases.

More information on compatibility can be found athttps://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-runtime-and-driver- api-version.

For convenience, the NVIDIA driver is installed as part of the CUDA Toolkit installation. Note that this driver is for development purposes and is not recommended for use in production with Tesla GPUs.

For running CUDA applications in production with Tesla GPUs, it is recommended to download the latest driver for Tesla GPUs from the NVIDIA driver downloads site at http: //www.nvidia. com / drivers.

During the installation of the CUDA Toolkit, the installation of the NVIDIA driver may be skipped on Windows (when using the interactive or silent installation) or on Linux (by using meta packages).

For more information on customizing the install process on Windows, seehttp://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#install- cuda-software.

For meta packages on Linux, seehttps://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-metas

CUDA-GDB Sources

CUDA-GDB sources are available as follows:

For CUDA Toolkit 7.0 and newer, in the installation directoryextras /. The directory is created by default during the toolkit installation unless the.rpm (or. debpackage installer is used. In this case, thecuda-gdb-src package must be manually installed.

For CUDA Toolkit 6.5, 6.0, and 5.5, at (https://github.com/NVIDIA/cuda-gdb) .

For CUDA Toolkit 5.0 and earlier, atftp://download.nvidia.com/CUDAOpen 64 /.

Upon request by sending an e-mail tomailto: [email protected].

2. CUDA 10 2 Release Notes

2.1. General CUDA

Added support for CUDA Virtual Memory Management APIs.

10 .2 now includeslibcu , a parallel standard C library for GPUs.

                                    The following                                  new operating systems are supported by CUDA. See the System Requirements section in                                  the NVIDIA CUDA InstallationGuidefor Linux for a full                                  list of supported operating systems.                                  Note that support for RHEL 6.x is deprecated and support will be dropped in the next release of CUDA.

Red Hat Enterprise Linux (RHEL) 7.x and 8.x

OpenSUSE 15 .x

SUSE SLES 12 4 and SLES 15 .x

Ubuntu 16. 04 .6 LTS and Ubuntu 18. .3 LTS

CUDA 10 .2 (Toolkit and NVIDIA driver) is the last release to support macOS for developing and running CUDA applications. Support for macOS will not be available starting with the next release of CUDA.

CUDA Graphs APIs now support updates to node parameters in instantiated graphs.

CUDA 10 .2 includes new interop APIs (NVSci *libraries for buffer allocation, synchronization, and streaming APIs). These are beta and the APIs may change in future CUDA releases.

The 1D linear texture size limit supported for Maxwell (ie GM (x ) GPUs in CUDA is now 2 ^ 28 (up from 2 ^

2.3. CUDA Libraries

This release of the CUDA toolkit is packaged with libraries that deliver new and extended functionality, bug fixes, and performance improvements for single and multi-GPU environments.

Also in this release thesonameof the libraries has been modified to not include the minor toolkit version number. For example, the cuFFT library sonamehas changed fromlibcufft.so. 10 1to libcufft.so. 10. This is done to facilitate any future library updates that do not include API breaking changes without the need to relink.

2.3.1. cuBLAS Library

Improved the performance on some large and other GEMM sizes (mostly M * N100) due to increased internal workspace size.

2.3.2. cuSOLVER Library

cusolverMgGetrfandcusolverMgGetrsare added in cusolverMg library to support multiGPU LU.

A new Tensor Cores Accelerated Iterative Refinement Solver (TCAIRS) is introduced. This is a linear solver to solve AX=Band is similar to the LAPACKXgesv(orXgetrf Xgetrs) functions, but is different in that it uses reduced precision internally for acceleration and then refines the solution to achieve the corresponding accuracy of the solution. This solver support real and complex data types as well as single and multiple RHS systems of equations. Depending on the problem size and data types used, the observed speedup could reach over 5X. There are two types of APIs to access this solver:
- The basic user friendly LAPACK style APIs:
  - cusolverDnDgesv_bufferSize
  - cusolverDnDgesv
  Where P1 is the final solution precision and P2 is the lowest precision used in the solver. For ExamplecusolverDnZKgesv_bufferSizewill use tensor core accelerated half precision compute while the final solution will be double precision compute accurate.

Added the following experimental expert APIscusolverDnIRSXgesv,cusolverDnIRSXgesv_bufferSizeand related APIs.

2.3.3. cuFFT Library

Improved the performance and scalability for the following use cases:
- multi-GPU non-power of 2 transforms
- R2C and Z2D odd sized transforms
- 2D transforms with small sizes and large batch counts

2.3.4. CUDA Math Library

Added two absolute value APIs for half-precision__ halfand (__ half2) data types:habs,habs2.

Improved performance and accuracy in the following math functions:tanhf, round, roundf, erf, erff, sinf, cosf, sincosf, tanf, sinpif, cospif, sincospif, j0f, j1f, y0f, y1f

2.4. Deprecated Features

The following features are deprecated in the current release of the CUDA software. The features still work in the current release, but their documentation may have been removed, and they will become officially unsupported in a future release. We recommend that developers employ alternative solutions to these features in their software.

General CUDA

Support for RHEL 6.x is deprecated with CUDA 10 .1. It will be dropped in the next release of CUDA. Customers are encouraged to adopt RHEL 7.x to use new versions of CUDA.

Microsoft Visual Studio versions 2011, 2012 and 2013 are now deprecated as host compilers for nvcc. Support for these compilers may be removed in a future release of CUDA.

Support for the following compute capabilities are deprecated in CUDA 10 2. Note that support for                                           these compute capabilities may be removed in a future release of CUDA.

(sm _) (Kepler) (sm _) (Maxwell)

Support for KeplerSM _ 30architecture based products will be dropped starting with the next release of CUDA.

For more information on GPU products and compute capability, see thispage

For WMMA operations with floating point accumulators, the satf(saturate -to-finite value) mode parameter is deprecated. Using it can lead to unexpected results. Seehttp://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#wmma-description for details.

Support for Linux cluster packages is deprecated and may be dropped in a future release of CUDA.

CUDA Libraries General

The nvGRAPH library is deprecated. The library will no longer be shipped in future releases of the CUDA toolkit.

Support for Kepler and Maxwell architectures (compute capabilitysm _ 35throughsm _ 52) is deprecated.

NPP Compression Primitives (JPEG Encode / Decode) are being deprecated and will be removed in the next release. Users of these functions are encouraged to use the nvJPEG library.

CUDA Libraries – NPP

NPP Compression Primitives (JPEG Encode / Decode) are being deprecated and will be removed in the next release. Users of these functions are encouraged to use the nvJPEG library.

CUDA Libraries – nvJPEG

Below APIs are deprecated in this release:
- nvjpegStatus_t NVJPEGAPI nvjpegDecodePhaseOne
- nvjpegStatus_t NVJPEGAPI nvjpegDecodePhaseTwo
- nvjpegStatus_t NVJPEGAPI nvjpegDecodePhaseThree
- nvjpegStatus_t NVJPEGAPI nvjpegDecodeBatchedPhaseOne
- nvjpegStatus_t NVJPEGAPI nvjpegDecodeBatchedPhaseTwo
- nvjpegStatus_t NVJPEGAPI nvjpegDecodeBatchedPhaseThree
The functionality provided by these APIs will be offered by new functions in the next release.

CUDA Libraries – cuSOLVER

The 32 – bit API of cusolverMg multi-GPU library will be removed in the next release. Instead, a 64 – bit API will be adopted for the following:
- Symmetric eigensolver routinescusolverMgSyevd ()
- LU factorization and solver routinescusolverMgGetrfandcusolverMgGetrs.

The expert interfacecusolverDnIRSXgesvof the TCAIRS solver and its helper functions which are proposed as experimental expert APIs in this release will undergo minor changes. The basic user friendly API will remain the same. In summary:
- The expert API will remove the input_data_type from its API since it is part of the structure Params.
- ThecusolverDnIRSInfosXXXXhelpers function will no longer need the Params into their arguments.
- All instances ofcudaDataTypewill be replaced bycusolverPrecType_t.

2.5. Resolved Issues

2.5.1. CUDA Compilers

Fixed an issue where ptxas in some cases may optimize arithmetic shifts incorrectly.

Fixed a crash during compilation when using– DCONSTEXPR=constexprwith the– expt-relaxed-constexproption.

Added documentation for the– timeflag for nvcc. This flag can be used to measure the time take by nvcc and internal sub-components. Seenvcc –helpfor details.

Fixed an issue where ptxas crashes when assembling a PTX file with DWARF debug info generated by clang.

2.5.2. CUDA Libraries

The following issues have been resolved across the CUDA Libraries.

CUDA Libraries – NPP

Fixed a race condition within NPP if user provided a custom created CUDA stream (default or non-blocking).

Fixed an issue where incorrect values were returned by NPP Histogram helper functions. These values are used for defining the size of side buffers used by NPP Histogram main functions.

NPP does not support non-blocking streams on Windows for devices working in WDDM mode.

CUDA Libraries – nvJPEG

Fixed an issue withNVJPEG_BACKEND_HYBRIDbackend when restart markers are enabled.

CUDA Libraries – cuSOLVER

Resolved a conflict of symbols betweenliblapack_static.aandlibf2c.

Resolve missingGKlib / string.oinlibmetis_static.a

CUDA Libraries – cuBLAS

Resolved an issue where CUDA Graph capture with cuBLAS routines on multiple concurrent streams would have caused hangs or data corruption in some cases.

Resolved an issue where strided batched GEMM routines can cause misaligned read errors.

CUDA Libraries – cuFFT

Added missing documentation for the following functions:
- cufftXtExecDescriptorR2C
- cufftXtExecDescriptorC2R
- cufftXtExecDescriptorZ2D
- cufftXtExecDescriptorD2Z

Resolved an issue where multi-GPU supported functionality omitted in-place restriction for all FFT plan types.

Refer to thispagefor the temporary restriction on C2R / Z2D plans.

CUDA Libraries – cuRAND

Starting with CUDA 10 ordering of random numbers returned byMTGP 32andMRG 32 k3agenerators are no longer the same as previous releases despite being guaranteed by the documentation for theCURAND_ORDERING_PSEUDO_DEFAULTsetting. This issue will be addressed in the next release by providing a new non-default option which returns the same ordering as previous releases and the default option will continue to provide the best performance option.