Faster integer formatting - James Anhalt (jeaiii)’s algorithm

29 minute read

Published:

This post is about an ingenious algorithm for printing integers into decimal strings. It sounds like an extremely simple problem, but it is in fact quite more complicated than one might imagine. Let us more precisely define what we want to do: we take an integer of specific bit-width and a byte buffer, and convert the input integer into a string consisting of its decimal digits, and then write it into the given buffer. For simplicity, we will assume that the integer is unsigned and is of 32-bits. So, we want to implement the following function written in C++:

char* itoa(std::uint32_t n, char* buffer) {
  // Convert n into decimal digit string and write it into buffer.
  // Returns the position right next to the last character written.
}

There are numerous algorithms for doing this, and I will dig into a clever algorithm invented by James Anhalt (jeaiii), which seems to be the fastest known algorithm at the point of writing this post.

Disclaimer

I actually have not looked (and will not look) carefully at his code (MACROS, oh my god 😱) and have no idea what precisely was the method of analysis he had in mind. All I write here is purely my own analysis inspired by reading these lines of comment he wrote:

// 1. form a 7.32 bit fixed point numner: t = u * 2^32 / 10^log10(u)
// 2. convert 2 digits at a time [00, 99] by lookup from the integer portion of the fixed point number (the upper 32 bits)
// 3. multiply the fractional part (the lower 32 bits) by 100 and repeat until 1 or 0 digits left
// 4. if 1 digit left mulptipy by 10 and convert it (add '0')
//
// N == log10(u)
// finding N by binary search for a 32bit number N = [0, 9]
// this is fast and selected such 1 & 2 digit numbers are faster (2 branches) than long numbers (4 branches) for the 10 cases
//          
//      /\____________
//     /  \______     \______
//    /\   \     \     \     \
//   0  1  /\    /\    /\    /\
//        2  3  4  5  6  7  8  9

So it is totally possible that in fact what’s written in this post has nothing to do with what he actually did; however, I strongly believe that what I ended up with is more or less equivalent to what his code is doing, modulo some small minor differences.

Naïve implementations

The very first problem that anyone who tries to implement such a function will face is that we want to write digits from left to right, but naturally we compute the digits from right to left. Hence, unless we know the number of decimal digits in the input upfront, we do not know the exact position in the buffer that we can write our digits into. There are several different strategies to cope with this issue. Probably the simplest (and quite effective) one is to just print the digit from right to left but into a temporary buffer, and after we get all digits of nwe copy the temporary buffer back to the destination buffer. At the point of obtaining the left-most decimal digit of n, we also get the length of the string, so we know what exact bytes to copy.

With this strategy, we can think of the following implementation:

char* itoa_naive(std::uint32_t n, char* buffer) {
  char temp[10];
  char* ptr = temp + sizeof(temp) - 1;
  while (n >= 10) {
    *ptr = char('0' + (n % 10));
    n /= 10;
    --ptr;
  }
  *ptr = char('0' + n);
  auto length = temp + sizeof(temp) - ptr;
  std::memcpy(buffer, ptr, length);
  return buffer + length;
}

(Demo: https://godbolt.org/z/7G7ecs7r4)

The size of the temporary buffer is set to 10, because that’s the maximum possible decimal length for std::uint32_t.

The mismatch between the order of computation and the order of desired output is indeed a quite nasty problem, but let us forget about this issue for a while because there is something more interesting to say here.

There are several performance issues in this code, and one of them is the division by 10. Of course, since the divisor is a known constant, our lovely compiler will automatically convert the division into multiply-and-shift (see this classic paper for example), so we do not need to worry about the dreaded idiv instruction which is extremely infamous of its performance. However, for simple enough algorithms like this, multiplication can still be a performance killer, so it is reasonable to expect that we will get a better performance by reducing the number of multiplications.

Regarding this, Andrei Alexandrescu popularized the idea of generating two digits per a division, so halving the required number of multiplications. That is, we first prepare a lookup table for converting 2-digit integers into strings. Then in the loop we perform divisions by 100, rather than 10, to get 2digits per each division. The following code illustrates this:

static constexpr char radix_100_table[] = {
    '0', '0', '0', '1', '0', '2', '0', '3', '0', '4',
    '0', '5', '0', '6', '0', '7', '0', '8', '0', '9',
    '1', '0', '1', '1', '1', '2', '1', '3', '1', '4',
    '1', '5', '1', '6', '1', '7', '1', '8', '1', '9',
    '2', '0', '2', '1', '2', '2', '2', '3', '2', '4',
    '2', '5', '2', '6', '2', '7', '2', '8', '2', '9',
    '3', '0', '3', '1', '3', '2', '3', '3', '3', '4',
    '3', '5', '3', '6', '3', '7', '3', '8', '3', '9',
    '4', '0', '4', '1', '4', '2', '4', '3', '4', '4',
    '4', '5', '4', '6', '4', '7', '4', '8', '4', '9',
    '5', '0', '5', '1', '5', '2', '5', '3', '5', '4',
    '5', '5', '5', '6', '5', '7', '5', '8', '5', '9',
    '6', '0', '6', '1', '6', '2', '6', '3', '6', '4',
    '6', '5', '6', '6', '6', '7', '6', '8', '6', '9',
    '7', '0', '7', '1', '7', '2', '7', '3', '7', '4',
    '7', '5', '7', '6', '7', '7', '7', '8', '7', '9',
    '8', '0', '8', '1', '8', '2', '8', '3', '8', '4',
    '8', '5', '8', '6', '8', '7', '8', '8', '8', '9',
    '9', '0', '9', '1', '9', '2', '9', '3', '9', '4',
    '9', '5', '9', '6', '9', '7', '9', '8', '9', '9'
};

char* itoa_two_digits_per_div(std::uint32_t n, char* buffer) {
  char temp[8];
  char* ptr = temp + sizeof(temp);
  while (n >= 100) {
    ptr -= 2;
    std::memcpy(ptr, radix_100_table + (n % 100) * 2, 2);
    n /= 100;
  }
  if (n >= 10) {
    std::memcpy(buffer, radix_100_table + n * 2, 2);
    buffer += 2;
  }
  else {
    buffer[0] = char('0' + n);
    buffer += 1;
  }
  auto remaining_length = temp + sizeof(temp) - ptr;
  std::memcpy(buffer, ptr, remaining_length);
  return buffer + remaining_length;
}

(Demo: https://godbolt.org/z/vnMTf7s9r)

The core idea of James Anhalt’s algorithm

So, with the above trick of grouping 2digits, how many multiplications do we need for integers of, say, 10decimal digits? Note that we need to compute both the quotient and the remainder, and as far as I know at least 2multiplications should be performed to get both of them and there is no way to do it with just one multiplication. Hence, for each pair of 2digits, we need to perform 2multiplications, thus for integers with 10digits we need 8multiplications, since we need 4divisions to separate 5pairs of 2digits.

Quite surprisingly, in fact we can almost halve that number again into 5, which (I believe) is the core benefit of James Anhalt’s algorithm. The crux of the idea can be summarized as follows: given n, we find an integer ysatisfying

(1)n=10ky2D

for some nonnegative integer constants kand D.

This transformation is a real deal, because after we get such y, we can extract 2digits of nper a multiplication. To see how, recall that in general

(2)abc=a/bc

holds for any positive integers a,b,c; that is, dividing aby bcis equivalent to dividing aby bfirst and then by c. Therefore, for any lk, we have

(3)10kly2D=n10l.

So, for example, let k=l=8, then

(4)n108=y2D.

Assuming that nis of 10digits, the left-hand side is precisely the first 2digits of n, while the right-hand side is just the right-shift of yby D-bits.

On the other hand, the next 2digits of ncan be computed as

(5)(n106 mod 102)=(102y2D mod 102).

Note that, if we write yas y=2Dq+rwhere qis the quotient and ris the remainder, then

(6)102y=2D(102q)+102r,

so

(7)(102y2D mod 102)=((102q+102r2D) mod 102)=(102r2D mod 102).

Also, since r<2D, 102r2Dis strictly less than 102, so we get

(8)(n106 mod 102)=102r2D.

This means that, in order to compute the next 2digits of n, we first obtain the remainder of ydivided by 2D, and then multiply 102to it, and then obtain the quotient of it divided by 2D. In other words, we just need to first obtain the lowest D-bits, multiply 102to it, and then right-shift the result by D-bits. As you can see, we only need 1multiplication here.

This trend continues: to compute the next 2digits of n, we only need to obtain the lowest D-bits, multiply 102, and then right-shift the result by D-bits, so in particular we only need 1multiplication for generating each pair of 2digits. Indeed, it can be inductively shown that if we write y0=yand yi+1=102(yi mod 2D), then

(9)(n10k2i mod 102)=yi2D

holds for each i=0,1,2,3,4.

How to compute y?

Alright, so we get that having ysatisfying

(10)n=10ky2D

is pretty useful for our purpose. The next question is how to find such y. Note that the above equality is equivalent to the inequality

(11)n10ky2D<n+1,

or

()2Dn10ky<2D(n+1)10k.

Assuming 2D10k, y=2Dn10kwill obviously do the job, but computing 2Dn10kmight be nontrivial. Hence, let us first try something easier:

(12)y=n2D10k.

Here, 2D10kis just a constant and is not dependent on n, so computing yis just a single integer multiplication.

With k=8and n<232, we can show that this yalways satisfies ()if we take D57; we just need to find the smallest Dsatisfying (2321)(2D10k2D10k)<2D10k. Hence, choosing D=57, we get the magic number

(13)2D10k=1441151881.

This leads us to the following code for always printing 10digits of given nwith possible leading zeros:

char* itoa_always_10_digits(std::uint32_t n, char* buffer) {
    constexpr auto mask = (std::uint64_t(1) << 57) - 1;
    auto y = n * std::uint64_t(1441151881);
    std::memcpy(buffer + 0, radix_100_table + int(y >> 57) * 2, 2);
    y &= mask;
    y *= 100;
    std::memcpy(buffer + 2, radix_100_table + int(y >> 57) * 2, 2);
    y &= mask;
    y *= 100;
    std::memcpy(buffer + 4, radix_100_table + int(y >> 57) * 2, 2);
    y &= mask;
    y *= 100;
    std::memcpy(buffer + 6, radix_100_table + int(y >> 57) * 2, 2);
    y &= mask;
    y *= 100;
    std::memcpy(buffer + 8, radix_100_table + int(y >> 57) * 2, 2);

    return buffer + 10;
}

(Demo: https://godbolt.org/z/9c4Mb76hc)

The constant 1441151881is only of 31-bits and multiplications are performed in 64-bits so there is no overflow.

Consideration of variable length

One can easily modify the above algorithm to omit printing leading decimal zeros and align the output to the left-most position of the buffer. However, the resulting algorithm is not very nice; although it only performs no more than 5multiplications, it always performs 5multiplications, even for short numbers like n=15.

What James Anhalt did with this is complete separation of the code paths for all possible lengths of n. I mean, something like this:

//      /\____________
//     /  \______     \______
//    /\   \     \     \     \
//   0  1  /\    /\    /\    /\
//        2  3  4  5  6  7  8  9
char* itoa_var_length(std::uint32_t n, char* buffer) {
  if (n < 100) {
    if (n < 10) {
      // 1 digit.
    }
    else {
      // 2 digits.
    }
  }
  else if (n < 100'0000) {
    if (n < 1'0000) {
      if (n < 1'000) {
        // 3 digits.
      }
      else {
        // 4 digits.
      }
    }
    else {
      if (n < 10'0000) {
        // 5 digits.
      }
      else {
        // 6 digits.
      }
    }
  }
  else if (n < 1'0000'0000) {
    if (n < 1000'0000) {
      // 7 digits.
    }
    else {
      // 8 digits.
    }
  }
  else if (n < 10'0000'0000) {
    // 9 digits.
  }
  else {
    // 10 digits.
  }
}

It sounds pretty crazy, but it does the job quite well.

Now, recall that our main job is to find ysatisfying

(14)n=10ky2D.

Note that the choice k=8was to make sure that y2Dis the first 2digits, given that nis of 10digits. Since nis not always of 10digits, we may take different k’s for each branch. For example, when nis of 3digits, we may want to choose k=2. Then since n999in this case, the choice

(15)y=n2D10k

is valid for any D12, as 999(212102212102)<212102holds. In this case, it is in fact better to choose D=32rather than D=12, because in platforms such as x86 obtaining the lower half of a 64-bit integer is basically no-op.

As a result, the first digit of ncan be computed as

(16)n102=y232=232/102n232=42949673n232,

and the remaining 2digits can be computed as

(17)(n100 mod 102)=102(y mod 232)232.

Similarly, we can choose D=32(with the above ywith different k’s) for n’s up to 6digits (so k=0,2,4), but for larger nour simplistic analysis does not allow us to do so. For n’s with 7or 8digits, we set k=6, and it can be shown that D=47does the job. For n’s with 9or 10digits, we set k=8, and as we have already seen D=57does the job. With these choices of parameters, we get the following code:

//      /\____________
//     /  \______     \______
//    /\   \     \     \     \
//   0  1  /\    /\    /\    /\
//        2  3  4  5  6  7  8  9
char* itoa_var_length(std::uint32_t n, char* buffer) {
  if (n < 100) {
    if (n < 10) {
      // 1 digit.
      buffer[0] = char('0' + n);
      return buffer + 1;
    }
    else {
      // 2 digits.
      std::memcpy(buffer, radix_100_table + n * 2, 2);
      return buffer + 2;
    }
  }
  else if (n < 100'0000) {
    if (n < 1'0000) {
      // 3 or 4 digits.
      // 42949673 = ceil(2^32 / 10^2)
      auto y = n * std::uint64_t(42949673);
      if (n < 1'000) {
        // 3 digits.
        buffer[0] = char('0' + int(y >> 32));
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 1, radix_100_table + int(y >> 32) * 2, 2);
        return buffer + 3;
      }
      else {
        // 4 digits.
        std::memcpy(buffer + 0, radix_100_table + int(y >> 32) * 2, 2);
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 2, radix_100_table + int(y >> 32) * 2, 2);
        return buffer + 4;
      }
    }
    else {
      // 5 or 6 digits.
      // 429497 = ceil(2^32 / 10^4)
      auto y = n * std::uint64_t(429497);
      if (n < 10'0000) {
        // 5 digits.
        buffer[0] = char('0' + int(y >> 32));
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 1, radix_100_table + int(y >> 32) * 2, 2);
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 3, radix_100_table + int(y >> 32) * 2, 2);
        return buffer + 5;
      }
      else {
        // 6 digits.
        std::memcpy(buffer + 0, radix_100_table + int(y >> 32) * 2, 2);
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 2, radix_100_table + int(y >> 32) * 2, 2);
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 4, radix_100_table + int(y >> 32) * 2, 2);
        return buffer + 6;
      }
    }
  }
  else if (n < 1'0000'0000) {
    // 7 or 8 digits.
    // 140737489 = ceil(2^47 / 10^6)
    auto y = n * std::uint64_t(140737489);
    constexpr auto mask = (std::uint64_t(1) << 47) - 1;
    if (n < 1000'0000) {
      // 7 digits.
      buffer[0] = char('0' + int(y >> 47));
      y = (y & mask) * 100;
      std::memcpy(buffer + 1, radix_100_table + int(y >> 47) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 3, radix_100_table + int(y >> 47) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 5, radix_100_table + int(y >> 47) * 2, 2);
      return buffer + 7;
    }
    else {
      // 8 digits.
      std::memcpy(buffer + 0, radix_100_table + int(y >> 47) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 2, radix_100_table + int(y >> 47) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 4, radix_100_table + int(y >> 47) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 6, radix_100_table + int(y >> 47) * 2, 2);
      return buffer + 8;
    }
  }
  else {
    // 9 or 10 digits.
    // 1441151881 = ceil(2^57 / 10^8)
    constexpr auto mask = (std::uint64_t(1) << 57) - 1;
    auto y = n * std::uint64_t(1441151881);
    if (n < 10'0000'0000) {
      // 9 digits.
      buffer[0] = char('0' + int(y >> 57));
      y = (y & mask) * 100;
      std::memcpy(buffer + 1, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 3, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 5, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 7, radix_100_table + int(y >> 57) * 2, 2);
      return buffer + 9;

    }
    else {
      // 10 digits.
      std::memcpy(buffer + 0, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 2, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 4, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 6, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 8, radix_100_table + int(y >> 57) * 2, 2);
      return buffer + 10;
    }
  }
}

(Demo: https://godbolt.org/z/froGhEn3s)

Note: The paths for (2k1)-digits case and 2k-digits case share a lot of code, so one might try to merge the code for printing (2k2)remaining digits while leaving in separate branches only the code for printing the leading 1or 2digits. However, it seems that such a refactoring causes the code to perform worse, probably because the number of additions performed is increased. Nevertheless, that is also one viable option, especially regarding the code size.

Better choices for y

The above code is pretty good for n’s up to 6digits, but not so much for longer n’s, as we have to perform masking in addition to multiplication and shift for each 2digits. This is due to Dbeing not equal to 32, so it will be beneficial if we can choose D=32even for n’s with digits more than 6. And it turns out that we can.

The reason we had to choose D>32was due to our poor choice of y:

(18)y=n2D10k.

Recall that we do not need to choose ylike this; all we need to ensure is that ysatisfies the inequality

()2Dn10ky<2D(n+1)10k.

Slightly generalizing what we have done, let us suppose that we want to obtain yby computing

(19)y=nm2L

for some positive integer constants mand L. Then the inequality ()can be rewritten as

()1n2Dn10km2L<1n2D(n+1)10k.

At the time of writing this post, I am not quite sure if there is an elegant way to obtain the precise admissible range of mand Lfor the above inequality with any given range of n, but a reasonable guess is that

(20)m=2D+L10k+1

will often do the job. Indeed, in this case we have

(21)mn2L2Dn10k+n2L,

so the left-hand side of ()is always satisfied if

(22)2Ln

holds for all nin the range. On the other hand, we have

(23)mn2L<2Dn10k+n2L1,

so the right-hand side of ()is always satisfied if

(24)n2L12D10k,

or equivalently,

(25)10kn2D12L

holds for all nin the range.

Thus for example, when nis of 7or 8digits (so n[106,1081]), k=6, and D=32, it is enough to have

(26)106(1081)2312L106,

which is equivalent to

(27)16L19.

Hence, we take L=16and accordingly m=281474978. Of course, since mis even we can equivalently take L=15and m=140737489as well, so this method of analysis is clearly far from being tight.

(In fact, it can be exhaustively verified that the left-hand side of ()is maximized when n=1000795, while the right-hand side is minimized when n=1081, which together yield the inequality

(28)42983817961000795m2L<42949672960099999999.

The minimum Lallowing an integer solution for mto the above inequality is L=15and in this case m=140737489is the unique solution.)

When nis of 9or 10digits, a similar analysis does not give the best result, especially for 10digits it fails to give any admissible choice of Land m. Nevertheless, it can be exhaustively verified that, when k=8and D=32, if we set L=25and

(29)m=2D+L10k+1=1441151882,

then

(30)n=10knm/2L2D

holds for all n[108,1091], and similarly, if we set L=25and

(31)m=2D+L10k=1441151881,

then

(32)n=10knm/2L2D

holds for all n[109,2321], so we can still take

(33)y=nm2L

with the above Land m.

(In fact, applying what we have done for 7or 8digits into the case of 9digits gives L=26and m=2882303763. However, 1441151882is a better magic number than 2882303763, because the former is of 31-bits while the latter is of 32-bits. This trivially-looking difference actually quite matters on platforms like x86, because when computing y=nm2L, we want to leverage the fast imul instruction, but imul sign-extends the input immediate constant when performing 64-bit multiplication. Hence, if the magic number is of 32-bits, the multiplication cannot be done in a single instruction, and the magic number must be first loaded into a register and then zero-extended.)

Therefore, we are indeed able to always choose D=32, which results in the following code:

char* itoa_better_y(std::uint32_t n, char* buffer) {
  std::uint64_t prod;

  auto get_next_two_digits = [&]() {
    prod = std::uint32_t(prod) * std::uint64_t(100);
    return int(prod >> 32);
  };
  auto print_1 = [&](int digit) {
    buffer[0] = char(digit + '0');
    buffer += 1;
  };
  auto print_2 = [&] (int two_digits) {
    std::memcpy(buffer, radix_100_table + two_digits * 2, 2);
    buffer += 2;
  };
  auto print = [&](std::uint64_t magic_number, int extra_shift, auto remaining_count) {
    prod = n * magic_number;
    prod >>= extra_shift;
    auto two_digits = int(prod >> 32);

    if (two_digits < 10) {
      print_1(two_digits);
      for (int i = 0; i < remaining_count; ++i) {
        print_2(get_next_two_digits());
      }
    }
    else {
      print_2(two_digits);
      for (int i = 0; i < remaining_count; ++i) {
        print_2(get_next_two_digits());
      }
    }
  };

  if (n < 100) {
    if (n < 10) {
      // 1 digit.
      print_1(n);
    }
    else {
      // 2 digit.
      print_2(n);
    }
  }
  else {
    if (n < 100'0000) {
      if (n < 1'0000) {
        // 3 or 4 digits.
        // 42949673 = ceil(2^32 / 10^2)
        print(42949673, 0, std::integral_constant<int, 1>{});
      }
      else {
        // 5 or 6 digits.
        // 429497 = ceil(2^32 / 10^4)
        print(429497, 0, std::integral_constant<int, 2>{});
      }
    }
    else {
      if (n < 1'0000'0000) {
        // 7 or 8 digits.
        // 281474978 = ceil(2^48 / 10^6) + 1
        print(281474978, 16, std::integral_constant<int, 3>{});
      }
      else {
        if (n < 10'0000'0000) {
          // 9 digits.
          // 1441151882 = ceil(2^57 / 10^8) + 1
          prod = n * std::uint64_t(1441151882);
          prod >>= 25;
          print_1(int(prod >> 32));
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
        }
        else {
          // 10 digits.
          // 1441151881 = ceil(2^57 / 10^8)
          prod = n * std::uint64_t(1441151881);
          prod >>= 25;
          print_2(int(prod >> 32));
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
        }
      }
    }
  }
  return buffer;
}

(Demo: https://godbolt.org/z/7TaqYa9h1)

Note: Looking at a port of James Anhalt’s original algorithm, it seems that the above code is probably a little bit better than the original implementation (which can be confirmed in the benchmark below) because the original algorithm performs an addition after the first multiplication and shift, for digit length longer than some value. With our choice of magic numbers, that is not necessary.

Benchmark

Alright, now let’s compare the performance of these implementations.

2022-02-16-itoa_bench

Link: https://quick-bench.com/q/ieIpsBhWC751YUhyVS2OE83dTO4

itoa_var_length_naive is a straightforward variation of itoa_var_length doing the naive quotient/remainder computation instead of playing with y. Well, compared to itoa_var_length_naive, the performance benefit of itoa_better_y seems not very impressive to be honest. Nevertheless, I still think the idea behind the algorithm is pretty intriguing.

Back to fixed-length case

So far we only have looked at the case of 32-bit unsigned integers. For 64-bit integers, what people typically do is to first divide the input number into several segments of 32-bit integers, and then print them using methods for 32-bit integers. In this case, typically only the first segment is of variable length and remaining segments are of fixed length. As we can see in the benchmark above, when the length is known we can do a lot better. For simplicity of the following discussion, let us suppose that the input integer is at most of 9digits and we want to always print 9digits with possible leading zeros.

While what we have done in itoa_always_10_digits is not so bad, we can certainly do better by choosing D=32which eliminates the need for performing masking at each generation of 2digits. Recall that all we need to do is to find an integer ysatisfying

()2Dn10ky<2D(n+1)10k

for given n. Since we want to always print 9digits, we take k=8. What’s different from the previous case is that now ncan be any integer in the range [1,1091](ignoring n=0case, but that is not a big deal), in particular it can be very small. In this case, one can show by exhaustively checking all possible n’s that the inequality

()1n2Dn10km2L<1n2D(n+1)10k

does not have a solution, because the maximum value of the left-hand side is bigger than the minimum value of the right-hand side. Therefore, it is not possible to compute yby just performing a multiplication followed by a shift.

Instead, we choose

(34)y=nm2L+1,

which means that we indeed omit masking at each generation of 2digits, but at the cost of additionally performing an increment for the initial step of computing y.

With this choice of y, ()becomes

()1n2Dn10k1nm2L<1n2D(n+1)10k1n

instead of (). Then we can perform a similar analysis to conclude that L=25with

(35)m=2D+L10k=1441151881

do the job. In fact, an exhaustive check shows that we can even take L=24and m=720575941.

EDIT: Recall that in the above, the range of nis constrained into [0,1091]. It turns out L=24with m=720575941works only up to n=1133989877, and it starts to produce errors if n1133989878.

I derived a better, but still incomplete analysis here. After a while, I eventually obtained a complete algorithm for computing the optimal bounds, and wrote a program that does the analysis. The algorithm itself is not published anywhere at this moment.

Concluding remarks

I applied a minor variation of the algorithm explained here into my Dragonbox implementation to speed up digit generation, and the result was quite satisfactory. I think probably the complete branching for all possible lengths of the input is not a brilliant idea for generic applications, but the idea of coming up with ythat enables generating each pair of 2digits by only performing a multiply-and-shift is very clever and useful. I expect that this same trick might be applicable to other problems as well, fixed-precision floating-point formatting for example.

Leave a Comment