Advent of Code 2019 in 130 ms: day 16

This is part of a series of posts, starting with Advent of Code 2019 in 130 ms: introduction.

Day 16: Flawed Frequency Transmission

Challenge 16 is probably the one I’ve spent the most time on optimizing, but my solution still takes 8.6 ms to run. There’s quite a lot to unpack here, so this one gets a whole post of its own!

Let d(p, i) be the ith signal digit after p phases, where p = 0, 1, ..., 100 and i = 0, 1, ..., (L - 1) with L as the number of digits. d(0, ...) is the puzzle input, and d(p + 1, ...) is computed from d(p, ...):

d(p + 1, i) = sum(d(p, j) * k(i, j) for j = 0, 1, ..., (L - 1))

k(i, j) is the repeating pattern of multipliers. To generate k(i, ...), we start with the base sequence 0, 1, 0, -1 repeating infinitely. k(0, ...) takes this sequence unmodified, k(1, ...) stretches it so each element appears twice in a row, k(2, ...) stretches to three times in a row, etc., and finally each k(i, ...) ignores the very first element. We can visualize this as a matrix as shown in figure 4.

         j | - |  0   1   2   3   4   5   6   7   8   9  10 ...
i  k(i, j) |   |
-----------+---+-------------------------------------------
0          | 0 |  1   0  -1   0   1   0  -1   0   1   0  -1
1          | 0 |  0   1   1   0   0  -1  -1   0   0   1   1
2          | 0 |  0   0   1   1   1   0   0   0  -1  -1  -1
3          | 0 |  0   0   0   1   1   1   1   0   0   0   0
4          | 0 |  0   0   0   0   1   1   1   1   1   0   0
5          | 0 |  0   0   0   0   0   1   1   1   1   1   1
6          | 0 |  0   0   0   0   0   0   1   1   1   1   1
7          | 0 |  0   0   0   0   0   0   0   1   1   1   1
8          | 0 |  0   0   0   0   0   0   0   0   1   1   1
9          | 0 |  0   0   0   0   0   0   0   0   0   1   1
10         | 0 |  0   0   0   0   0   0   0   0   0   0   1
...

Figure 4: The multiplier pattern

For part 1, we need to compute d(100, 0, 1, ..., 7). My solution doesn’t do anything particularly fancy, but it uses a few tricks to avoid computing multiplications. First, it computes the contributions from positive and negative terms separately and completely skips the zero terms. By inspecting figure 4 we can work out that the pattern is that positive terms appear in runs of i + 1 elements every (i + 1) * 4 terms, starting with the ith, and negative terms appear in runs of i + 1 elements every (i + 1) * 4 terms, starting with the i + (i + 1) * 2th. In Rust, this can be expressed fairly easily using the step_by, flat_map and sum methods of the Range type. This optimization approximately halves runtime compared to computing the multiplication for each element.

We can go further, though: we can see in figure 4 that if i is past L / 2, then the multipliers will be 0 for j < i and 1 for j >= i, so we can simplify the formula to

d(p + 1, i) = sum(d(p, j) for j = i, i + 1, ..., (L - 1))   mod 10
where i >= floor(L / 2)

Similarly, if i is past L / 3, then the multipliers are just a stretch of ones on the middle third:

d(p + 1, i) = sum(d(p, j) for j = i, i + 1, ..., (i + i))   mod 10
where i >= floor(L / 3) and i < floor(L / 2)

Finally, if i is past L / 4, then the multipliers are one stretch of ones and one stretch of negative ones:

d(p + 1, i) = (  sum(d(p, j) for j = i, i + 1, ..., (i + i))
               - sum(d(p, j) for j = (3i + 2), (3i + 3), ..., (3i + 2 + i))
              ) mod 10
where i >= floor(L / 4) and i < floor(L / 3)

These three additional optimizations approximately halves the runtime again, bringing us down to about one fourth the runtime compared to the multiplication method.

Also interesting is that although we’re working with single-digit numbers, it turns out to be faster to store the digits as i32 rather than i16 or i8. With i8 you need to compute a modulo operation after each addition, while with i16 and i32 you can compute the sum over all the digits and the modulo afterwards. I’m guessing the reason i32 is slightly faster than i16 is that my processor is more optimized for 32-bit than 16-bit arithmetic. Furthermore, Rust’s Iterator::sum turns out to be much faster than summing with Iterator::fold.

Part 2

For part 2, it gets more complicated. The digit sequence is now repeated 10,000 times, which means our part 1 solution would take 100,000,000 times longer to run since every digit of every phase depends on every other digit in the previous phase. We need to be a lot smarter about this one.

Fortunately, this time we’re not computing the first 8 digits of the 100th phase, but the first 8 digits starting at an offset defined by the puzzle input. The message offset is the first 7 digits of the puzzle input, which in my case is 5,975,093. My puzzle input is 650 digits long, so the digit sequence is 6,500,000 digits long. A crucial observation here is that the offset is past half the digit sequence. If we assume this will always be the case, we can take some huge shortcuts.

Recall that in figure 4, the lower half of the matrix is all ones above the diagonal, and all zeroes below the diagonal - what’s known as an upper triangular matrix. If we expand the formulae for each digit in the next phase, starting from the end, we get this:

d(p + 1, 10) =                               d(p, 10)
d(p + 1,  9) =                     d(p, 9) + d(p, 10)
d(p + 1,  8) =           d(p, 8) + d(p, 9) + d(p, 10)
d(p + 1,  7) = d(p, 7) + d(p, 8) + d(p, 9) + d(p, 10)

If we proceed a few more phases, we get figure 5:

d(p + 2, 10) =                                               d(p+1, 10)
             =                                             1 d(p,   10)

d(p + 2,  9) =                                d(p+1, 9) +    d(p+1, 10)
             =                              1 d(p,   9) +  2 d(p,   10)

d(p + 2,  8) =                 d(p+1, 8) +    d(p+1, 9) +    d(p+1, 10)
             =               1 d(p,   8) +  2 d(p,   9) +  3 d(p,   10)

d(p + 2,  7) =   d(p+1, 7) +   d(p+1, 8) +    d(p+1, 9) +    d(p+1, 10)
             = 1 d(p,   7) + 2 d(p,   8) +  3 d(p,   9) +  4 d(p,   10)



d(p + 3, 10) =                                               d(p+2, 10)
             =                                             1 d(p,   10)

d(p + 3,  9) =                                d(p+2, 9) +    d(p+2, 10)
             =                              1 d(p,   9) +  3 d(p,   10)

d(p + 3,  8) =                 d(p+2, 8) +    d(p+2, 9) +    d(p+2, 10)
             =               1 d(p,   8) +  3 d(p,   9) +  6 d(p,   10)

d(p + 3,  7) =   d(p+2, 7) +   d(p+2, 8) +    d(p+2, 9) +    d(p+2, 10)
             = 1 d(p,   7) + 3 d(p,   8) +  6 d(p,   9) + 10 d(p,   10)



d(p + 4, 10) =                                               d(p+3, 10)
             =                                             1 d(p,   10)

d(p + 4,  9) =                                d(p+3, 9) +    d(p+3, 10)
             =                              1 d(p,   9) +  4 d(p,   10)

d(p + 4,  8) =                 d(p+3, 8) +    d(p+3, 9) +    d(p+3, 10)
             =               1 d(p,   8) +  4 d(p,   9) + 10 d(p,   10)

d(p + 4,  7) =   d(p+3, 7) +   d(p+3, 8) +    d(p+3, 9) +    d(p+3, 10)
             = 1 d(p,   7) + 4 d(p,   8) + 10 d(p,   9) + 20 d(p,   10)

Figure 5: Expansion for a few phases of the last few digits in a sequence of 10

Here we see a couple of sequences appear: 1, 2, 3, 4, 1, 3, 6, 10, 1, 4, 10, 20. If those seem familiar, it’s because those are exactly the diagonals of Pascal’s triangle. This means that if we denote as P(p, i) the ith element (starting from P(p, 0) = 1), of the pth diagonal (starting from P(0, ...) = 1, 1, 1, ...) of Pascal’s triangle, we can compute d(p, i) as

d(p, i) = sum(P(p, j) * d(0, i + j) for j = 0, 1, ..., (L - 1 - i))   mod 10
where i >= floor(L / 2)

Since we’re working modulo 10, we also only need to compute P(p, j) mod 10, which means we won’t have any overflow issues despite P(p, j) growing very quickly for large p or j. This reduces the number of operations from 100 * (650 * 10,000)² = 4,225,000,000,000,000 to about 100 * 8 * (6,500,000 - 5,975,093) = 419,925,600, a factor 10 million difference. This is enough to bring runtime down to a quite feasible ~280 ms, unlike the naïve method which would take many, many hours. I believe that finding this first trick, or something similar, is necessary to solve part 2 at all. But there are a few more tricks we can use to go even faster!

Even with the previous trick, we still need to compute 524,907 elements each of 100 diagonals of Pascal’s triangle. There are methods to compute a diagonal on its own without needing the previous ones, but they rely on multiplication to generate one element from the previous - and repeated multiplication does not play well under modulo like addition does, so you quickly run into overflow issues. I was able to work around this by representing numbers by their prime factorization, which means you can multiply by simply adding the exponentials, but this turned out to be slower than just computing each diagonal from the previous by addition.

So let’s take a look at what P(100, ...) mod 10 looks like:

1, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 4, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Pretty sparse. Maybe we can find some pattern in there? If we plot P(100, 0 ... 100000) mod 10 we get figure 6:

Figure 6: The first 100,000 elements of the 100th diagonal of Pascal's triangle modulo 10.

The bands at values 1 and 9 suggest that this sequence may be periodic. So I cobbled together a simple MATLAB/Octave script to check:

function [ P ] = find_period(seq)
  P = -1;
  offset = 0;
  for period = 1:floor((length(seq)+1-offset)/2)
    repetitions = floor(length(seq) / period);
    if all(seq == [repmat(seq(1:period), 1, repetitions), seq(1:(mod(length(seq), period)))])
      P = period;
      break
    end
  end
end

and it turns out that P(100, ...) mod 10 is indeed periodic with 16,000 elements! This means we only need to compute 100 * 16,000 elements instead of 100 * 524,907, which reduces runtime for part 2 from ~280 ms to ~28 ms.

The next step is to make even more use of this periodicity: since our digit sequence is also periodic with 650 elements, this means the sequences of products are also periodic with at most lcm(650, 16000) = 208000 elements. We have 524,907 digits to process, which is about two and a half cycles, so we only need to compute the first and last cycles. We can thus reduce our formula to:

c = 16000
C = lcm(c, 650) = 208000
N = floor((L - message_offset) / C)
d(p, i) = C * sum(P(p, j mod c) * d(0, i + j) for j = 0, 1, ..., (C - 1))
            + sum(P(p, j mod c) * d(0, i + j) for j = N * C, N * C + 1, ..., (L - 1 - i))
          mod 10
where i >= floor(L / 2)

This saves about an additional third of runtime, bringing part 2 down to ~19 ms. For my puzzle input it happens that the two full cycles sum to zero, so it might be possible to eliminate those altogether, but I haven’t been able to prove this will always be true. Anyway, some more implementation optimizations further reduce the time to ~11 ms: storing digits as i32, eliminating unnecessary intermediate Vec allocations, hard-coding the number of phases (100) and the period of Pascal’s triangle (16,000), and using Iterator::sum instead of fold.

As a final performance optimization, we can abandon good taste and hard-code the 100th diagonal of Pascal’s triangle. We can do this since the number of phases is specifically defined in the problem statement, because Pascal’s triangle is of course the same for any puzzle input, and because an array of 16,000 elements is after all quite small. This brings part 2 runtime down to ~3.2 ms, and both parts down to 8.6 ms total.