# Advent of Code 2019 in 130 ms: day 16

*This is part of a series of posts, starting with Advent of Code 2019 in 130 ms: introduction.*

## Day 16: Flawed Frequency Transmission

Challenge 16 is probably the one I’ve spent the most time on optimizing, but my solution still takes 8.6 ms to run. There’s quite a lot to unpack here, so this one gets a whole post of its own!

Let `d(p, i)`

be the `i`

th signal digit after `p`

phases,
where `p = 0, 1, ..., 100`

and `i = 0, 1, ..., (L - 1)`

with `L`

as the number of digits.
`d(0, ...)`

is the puzzle input, and `d(p + 1, ...)`

is computed from `d(p, ...)`

:

```
d(p + 1, i) = sum(d(p, j) * k(i, j) for j = 0, 1, ..., (L - 1))
```

`k(i, j)`

is the repeating pattern of multipliers.
To generate `k(i, ...)`

, we start with the base sequence `0, 1, 0, -1`

repeating infinitely.
`k(0, ...)`

takes this sequence unmodified,
`k(1, ...)`

stretches it so each element appears twice in a row,
`k(2, ...)`

stretches to three times in a row, etc.,
and finally each `k(i, ...)`

ignores the very first element.
We can visualize this as a matrix as shown in figure 4.

For part 1, we need to compute `d(100, 0, 1, ..., 7)`

.
My solution doesn’t do anything particularly fancy,
but it uses a few tricks to avoid computing multiplications.
First, it computes the contributions from positive and negative terms
separately and completely skips the zero terms.
By inspecting figure 4 we can work out that the pattern is
that positive terms appear in runs of `i + 1`

elements
every `(i + 1) * 4`

terms, starting with the `i`

th,
and negative terms appear in runs of `i + 1`

elements
every `(i + 1) * 4`

terms, starting with the `i + (i + 1) * 2`

th.
In Rust, this can be expressed fairly easily using the `step_by`

,
`flat_map`

and `sum`

methods of the `Range`

type.
This optimization approximately halves runtime
compared to computing the multiplication for each element.

We can go further, though:
we can see in figure 4 that if `i`

is past `L / 2`

,
then the multipliers will be 0 for `j < i`

and 1 for `j >= i`

,
so we can simplify the formula to

```
d(p + 1, i) = sum(d(p, j) for j = i, i + 1, ..., (L - 1)) mod 10
where i >= floor(L / 2)
```

Similarly, if `i`

is past `L / 3`

,
then the multipliers are just a stretch of ones on the middle third:

```
d(p + 1, i) = sum(d(p, j) for j = i, i + 1, ..., (i + i)) mod 10
where i >= floor(L / 3) and i < floor(L / 2)
```

Finally, if `i`

is past `L / 4`

,
then the multipliers are one stretch of ones and one stretch of negative ones:

```
d(p + 1, i) = ( sum(d(p, j) for j = i, i + 1, ..., (i + i))
- sum(d(p, j) for j = (3i + 2), (3i + 3), ..., (3i + 2 + i))
) mod 10
where i >= floor(L / 4) and i < floor(L / 3)
```

These three additional optimizations approximately halves the runtime again, bringing us down to about one fourth the runtime compared to the multiplication method.

Also interesting is that although we’re working with single-digit numbers,
it turns out to be faster to store the digits as `i32`

rather than `i16`

or `i8`

.
With `i8`

you need to compute a modulo operation after each addition,
while with `i16`

and `i32`

you can compute the sum over all the digits
and the modulo afterwards.
I’m guessing the reason `i32`

is slightly faster than `i16`

is that
my processor is more optimized for 32-bit than 16-bit arithmetic.
Furthermore, Rust’s `Iterator::sum`

turns out to be much faster
than summing with `Iterator::fold`

.

### Part 2

For part 2, it gets more complicated. The digit sequence is now repeated 10,000 times, which means our part 1 solution would take 100,000,000 times longer to run since every digit of every phase depends on every other digit in the previous phase. We need to be a lot smarter about this one.

Fortunately, this time we’re not computing the first 8 digits of the 100th phase, but the first 8 digits starting at an offset defined by the puzzle input. The message offset is the first 7 digits of the puzzle input, which in my case is 5,975,093. My puzzle input is 650 digits long, so the digit sequence is 6,500,000 digits long. A crucial observation here is that the offset is past half the digit sequence. If we assume this will always be the case, we can take some huge shortcuts.

Recall that in figure 4, the lower half of the matrix
is all ones above the diagonal, and all zeroes below the diagonal -
what’s known as an *upper triangular matrix*.
If we expand the formulae for each digit in the next phase,
starting from the end, we get this:

```
d(p + 1, 10) = d(p, 10)
d(p + 1, 9) = d(p, 9) + d(p, 10)
d(p + 1, 8) = d(p, 8) + d(p, 9) + d(p, 10)
d(p + 1, 7) = d(p, 7) + d(p, 8) + d(p, 9) + d(p, 10)
```

If we proceed a few more phases, we get figure 5:

Here we see a couple of sequences appear: `1, 2, 3, 4`

, `1, 3, 6, 10`

, `1, 4, 10, 20`

.
If those seem familiar,
it’s because those are exactly the diagonals of Pascal’s triangle.
This means that if we denote as `P(p, i)`

the `i`

th element (starting from `P(p, 0) = 1`

),
of the `p`

th diagonal (starting from `P(0, ...) = 1, 1, 1, ...`

)
of Pascal’s triangle,
we can compute `d(p, i)`

as

```
d(p, i) = sum(P(p, j) * d(0, i + j) for j = 0, 1, ..., (L - 1 - i)) mod 10
where i >= floor(L / 2)
```

Since we’re working modulo 10, we also only need to compute `P(p, j)`

mod 10,
which means we won’t have any overflow issues despite `P(p, j)`

growing very quickly
for large `p`

or `j`

.
This reduces the number of operations from
100 * (650 * 10,000)^{2} = 4,225,000,000,000,000
to about 100 * 8 * (6,500,000 - 5,975,093) = 419,925,600,
a factor 10 million difference.
This is enough to bring runtime down to a quite feasible ~280 ms,
unlike the naïve method which would take many, many hours.
I believe that finding this first trick, or something similar,
is necessary to solve part 2 at all.
But there are a few more tricks we can use to go even faster!

Even with the previous trick, we still need to compute 524,907 elements each of 100 diagonals of Pascal’s triangle. There are methods to compute a diagonal on its own without needing the previous ones, but they rely on multiplication to generate one element from the previous - and repeated multiplication does not play well under modulo like addition does, so you quickly run into overflow issues. I was able to work around this by representing numbers by their prime factorization, which means you can multiply by simply adding the exponentials, but this turned out to be slower than just computing each diagonal from the previous by addition.

So let’s take a look at what `P(100, ...) mod 10`

looks like:

```
1, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 0, 0, 0, 5, 4, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
```

Pretty sparse. Maybe we can find some pattern in there?
If we plot `P(100, 0 ... 100000) mod 10`

we get figure 6:

The bands at values 1 and 9 suggest that this sequence may be periodic. So I cobbled together a simple MATLAB/Octave script to check:

```
function [ P ] = find_period(seq)
P = -1;
offset = 0;
for period = 1:floor((length(seq)+1-offset)/2)
repetitions = floor(length(seq) / period);
if all(seq == [repmat(seq(1:period), 1, repetitions), seq(1:(mod(length(seq), period)))])
P = period;
break
end
end
end
```

and it turns out that `P(100, ...) mod 10`

is indeed periodic with 16,000 elements!
This means we only need to compute 100 * 16,000 elements instead of 100 * 524,907,
which reduces runtime for part 2 from ~280 ms to ~28 ms.

The next step is to make even more use of this periodicity:
since our digit sequence is also periodic with 650 elements,
this means the sequences of products are also periodic
with at most `lcm(650, 16000) = 208000`

elements.
We have 524,907 digits to process, which is about two and a half cycles,
so we only need to compute the first and last cycles.
We can thus reduce our formula to:

```
c = 16000
C = lcm(c, 650) = 208000
N = floor((L - message_offset) / C)
d(p, i) = C * sum(P(p, j mod c) * d(0, i + j) for j = 0, 1, ..., (C - 1))
+ sum(P(p, j mod c) * d(0, i + j) for j = N * C, N * C + 1, ..., (L - 1 - i))
mod 10
where i >= floor(L / 2)
```

This saves about an additional third of runtime, bringing part 2 down to ~19 ms.
For my puzzle input it happens that the two full cycles sum to zero,
so it might be possible to eliminate those altogether,
but I haven’t been able to prove this will always be true.
Anyway, some more implementation optimizations further reduce the time to ~11 ms:
storing digits as `i32`

, eliminating unnecessary intermediate `Vec`

allocations,
hard-coding the number of phases (100) and the period of Pascal’s triangle (16,000),
and using `Iterator::sum`

instead of `fold`

.

As a final performance optimization, we can abandon good taste and hard-code the 100th diagonal of Pascal’s triangle. We can do this since the number of phases is specifically defined in the problem statement, because Pascal’s triangle is of course the same for any puzzle input, and because an array of 16,000 elements is after all quite small. This brings part 2 runtime down to ~3.2 ms, and both parts down to 8.6 ms total.