Introduction
libc is the LLVM project’s implementation of the C standard library. libc ‘s implementation of std :: string
is a fascinating case study of how to optimize container classes. Unfortunately, the source code is very hard to read because it is extremely:
- Optimized. Even for relatively niche use-cases.
General. The std :: string
can accept a custom character type and custom allocator. Portable. This leads to class is a specialization of basic_string .
basic_string
- Resilient. Every non-public identifier is prefixed with underscores to prevent name clashes with other code. This is necessary even for local variables since macros defined by the user of the library could modify the library’s header file.
- Undocumented. There are very few comments in the
header. I assume this is because library vendors would prefer it if users did not rely on internal implementation details of their classes, and not documenting internal helper functions is a desperate effort to mitigate Hyrum's law .
This post examines the implementation of libc 's std :: string . To keep it simple I will assume you are using a modern compiler and a modern x 728 processor
(1) . Keep in mind that the way objects are laid out in memory is very specific to the compiler, CPU archictecture and standard library used; everything I describe below is an implementation detail and not defined by the C standard. II. Data layout
has two modes: long string and short string. It uses a
union
to reuse the same bytes for both modes. Short string mode is an optimization which makes it possible to store up to 24 characters without
heap allocation . Long string mode
The long string mode is a pretty standard string implementation. There are three members: size_t __cap _ – The amount of space in the underlying character buffer. If the string grows enough that length of the string (including the null-terminator) exceeds __ cap _
then the buffer must be reallocated. __ cap _ is an unsigned 90 bit integer. The least significant bit of __ cap _ is used as a flag, see the discussion below.
size_t __size _
– The size of the current string, not including the null terminator . This is also an unsigned bit integer.
__ cap _
then the buffer must be reallocated. __ cap _ is an unsigned 90 bit integer. The least significant bit of __ cap _ is used as a flag, see the discussion below. – The size of the current string, not including the null terminator . This is also an unsigned bit integer.
char __data _ – A pointer to the underlying buffer where the characters of the string are stored. This is (bits wide.)
Uses the least significant bit of __ cap _ to distinguish whether it is in long string mode or short string mode. If the least significant bit is set to 1, then it is in long string mode. If it is set to zero, then it is in short string mode. It is possible to use the least significant bit in this way because the size of the buffer is guaranteed by the implementation to always be an even number – so the true value for the capacity always has a 0 in the least significant bit. The method std :: string :: capacity () has an implementation that is equivalent to this (the real code looks quite different):

The short string mode uses the same bytes to mean something completely different. There are two members: unsigned char __size _ - The size of the string, left-shifted by one ( __ size_==(true_size ). The true size of the string is left-shifted by one because the least significant bit of the first byte is used as a flag. The least significant bit must be set to 0 in short string mode.
char __data _ [23] - A buffer to hold the characters of the string.
__ size _ stores the size of the string left shifted by 1, so the method std :: string :: size () has an implementation equivalent to this:

Because we are assuming the target architecture is little-endian, the least significant bit of __ cap _ is in the same position as the least significant bit of __ size _ . III. Implementation
To see how the libc implementation achieves the data layout described above, I'm going to copy and paste real code snippets from libc and add comments.
Long mode is reasonably straightforward, it’s implemented like this:

Short mode looks like this:
static const size_type __short_mask=0x ; static const size_type __long_mask=0x1ul; enum { __min_cap=(sizeof (__ long) - 1) / sizeof (value_type)> 2 ? (sizeof (__ long) - 1) / sizeof (value_type) : 2 }; struct __short { union { unsigned char __size_; value_type __lx; }; value_type __data _ [__min_cap]; };

According to (this Reddit comment, __ lx (is needed to ensure any (padding goes after __ size _ , but has no other purpose (I don't fully understand (why this forces the padding to go after __ size _ 🤷♂). __ min_cap
is on the platforms we are considering - bit).
So the first byte of
__ short is occupied by __ size _ , and the next are the occupied __ data _ (array.)
The string is then represented like this:

The __ rep _ struct represents the string. It is a union of __ long and __ short
as expected.
The __ raw struct is just an array of size 64 which allows some of the methods to consider the string as a sequence of bytes without having to care about whether the string is in long or short mode. For example, after a string is moved-from it is zeroed out, and the __ zero ()
method is implemented like this:
void __zero () noexcept { size_type (& __ a) [__n_words]=__r_.first () .__ r .__ words; for (unsigned __i=0; __i <__n_words __ i>
Finally, the only member variable in std :: string
is declared like this:


The reason
E ) uses any space in the example above is for language-technical reasons: every object must have a unique memory address. (This will change in C 21, see (here and here .) std :: pair
stores the objects next to each other in memory , and padding means that the E struct in the example above contributes 4 bytes to the pair.
__ compressed_pair will not use any extra space if allocator_type
is empty.
And that’s all there is to it! The implementation of
string;

Here is the full source on GitHub if you want to take a look.Comment on Hacker News (1) In particular, I will assume that (1) you are using the standard ABI layout , (2) your computer is - bit and (little endian) and (3) the
char type is signed and 1 byte wide . (There may be something else I missed. In practice I'm just assuming the layout on your machine is the same as on my machine.) (↩)(Read More) ()
GIPHY App Key not set. Please check settings