Ziv Scully

The Information Theory of Brooklyn 99

Sun, 04 Aug 2019 00:00:00 +0000

During my internship at IBM Research this summer, my office mate Renbo and I discussed the following extremely important research problem.

There are twelve men on an island. Eleven weigh exactly the same amount, but one of them is slightly lighter or heavier. You must figure out which. The island has no scales, but there is a see-saw. The exciting catch: you can only use it three times.

To clarify: the riddle is asking us to find the odd-weight islander, but we do not need to determine whether the odd-weight islander is lighter or heavier.

Instead of solving this riddle, we’re going to take the riddle-writer’s perspective: what is the maximum number of islanders such that the riddle is solvable? Along the way, we’ll uncover some hints about how to approach the original riddle.

The Problem

Suppose we have $n$ islanders and are allowed $w$ weighings on the see-saw. We ask: under what conditions on $n$ and $w$ is the riddle solvable? (In the original riddle, $n = 12$ and $w = 3$.) There are two sorts of conditions we might look for.

A sufficient condition on $n$ and $w$ is one such that if the condition holds, the riddle is solvable.
A necessary condition on $n$ and $w$ is one such that if the condition fails, the riddle is unsolvable.

The way to prove that a condition is sufficient is pretty intuitive: given the condition, find a solution! For example, solving the original riddle shows that $n = 12, w = 3$ is a sufficient condition. Of course, as those who paused the Brooklyn 99 episode to try out the riddle quickly found out, finding the solution can still be tricky. But the overall approach to proving a sufficient condition is clear.

Necessary conditions are more difficult. To prove that a condition is necessary, we have to show that if the condition fails, then there does not exist any islander-weighing strategy that solves the riddle. For example, suppose we want to prove that the puzzle is impossible for $n = 13$ islanders in $w = 3$ weighings. It’s not enough to take a solution that finds the odd-weight islander for $n = 12, w = 3$ and demonstrate that it fails for $n = 13, w = 3$. Instead, we need to somehow consider every islander-weighing strategy.

In this post, we will find a necessary condition on $n$ and $w$ for the puzzle to be solvable. Specifically, we’ll prove that the puzzle is solvable only if

\[2n - 1 < 3^w.\]

To do so, we’ll use information theory, a branch of mathematics that helps us prove necessary conditions like this without explicitly considering every possible islander-weighing strategy.

The Short Version

We’re going to start by proving a slightly weaker necessary condition: the puzzle is solvable only if $2n - 1 \leq 3^w$ (note the non-strict inequality). Proving this weaker condition introduces all the key ideas without using any information theory.

We begin by observing that if the riddle is solvable, then it is solvable with a deterministic strategy (see Appendix, Lemma 1), so we can restrict our attention to deterministic strategies.

Suppose the riddle asked us to both find the odd-weight islander and determine whether they are lighter or heavier. To solve this modified riddle, we need to distinguish between the $2n$ possible scenarios: there are $n$ possibilities for the odd-weight islander, and they are either lighter or heavier. Each see-saw weighing has three possible outcomes: the left side is heavier, the right side is heavier, or the sides are balanced. So there are $3^w$ possible outcomes of a sequence of $w$ weighings. After observing one of these $3^w$ outcomes, we should be able to unambiguously determine which of the $2n$ scenarios we are in. We can’t do this if there are fewer outcomes than scenarios. This means $2n \leq 3^w$ is a necessary condition to solve the modified riddle.

The original riddle asks us just to find the odd-weight islander, so it seems like we only have to distinguish between $n$ possible scenarios. But it turns out that, even if we’re not trying to, we almost always end up figuring out whether the odd-weight islander is lighter or heavier. To see why, suppose we figure out that Charles is the odd-weight islander. If we ever put Charles on the see-saw in the process, then we see whether Charles was on the lighter or heavier side of the see-saw. This means the only way we can find the odd-weight islander without finding out whether they are lighter or heavier is if we never weigh the odd-weight islander.

Given an islander-weighing strategy, we say an islander is “lonely” if whenever they are the odd-weight islander, our strategy never weighs them. It turns out that any successful strategy can have at most one lonely islander (see Appendix, Lemma 2). This means that to solve the riddle, we must distinguish between at least $2n - 1$ possible scenarios:

a single scenario in which the odd-weight islander is the lonely islander, and
$2n - 2$ scenarios in which the odd-weight islander is one of the $n - 1$ non-lonely islanders.

There are still $3^w$ possible outcomes of $w$ weighings, so $2n - 1 \leq 3^w$ is a necessary condition to solve the riddle.

A Stricter Condition

Let’s think about what the $2n - 1 \leq 3^w$ condition says about the original $n = 12, w = 3$ riddle. Could we solve it with fewer weighings? How about more islanders?

Given $n = 12$, we must have $23 \leq 3^w$, which means we need $w \geq 3$, so the riddle cannot be solved in fewer weighings.
Given $w = 3$, we must have $2n - 1 \leq 27$, which means $n \leq 14$. So the riddle might be solvable with more islanders.

It turns out that the riddle is solvable for $n = 13$ but not $n = 14$. But we could rule out $n = 14$ using the necessary condition $2n - 1 < 3^w$ (note the strict inequality) that we promised in the introduction. Proving this stricter condition is where the information theory comes in.

How Much Information is in Each Weighing?

One of the key insights of information theory is to draw a connection between information and randomness. Information theory views anything we don’t know as a random variable. The entropy of a random variable tells us, roughly speaking, how much information we expect to gain by learning the outcome of that random variable. For brevity, I’m not going to explain in detail how to define entropy, instead explaining just the bits we need for this post. (If you’re curious, sources like Wikipedia have pretty good explanations.)

In our context, the main random variable we want to know is which of the $2n - 1$ scenarios we are in. Let’s give it a name:

\[S = \text{``the scenario we're in''.}\]

If each of the scenarios is equally likely, then $S$ has entropy $H(S) = \log_2(2n - 1)$.

Solving the riddle entails learning the outcome of $S$, but we never observe $S$ directly. Instead, we observe the results of each weighing. These weighing outcomes are also random variables, so let’s write

\[T_i = \text{``the outcome of the $i$th weighing''.}\]

Let’s call the possible outcomes of a weighing $\ell$, $r$, and $b$ for tipping left, tipping right, and staying balanced, respectively. Writing $p_i$ for the probability mass function of $T_i$ (meaning, for example, that $p_3(l)$ is the probability that the see-saw tips left on the third weighing), we can write the entropy of $T_i$ as

\[H(T_i) = p_i(\ell) \log_2\biggl(\frac{1}{p_i(\ell)}\biggr) + p_i(r) \log_2\biggl(\frac{1}{p_i(r)}\biggr) + p_i(b) \log_2\biggl(\frac{1}{p_i(b)}\biggr).\]

To solve the riddle, it must be that the amount of information we get from the weighings is at least the amount of information we would get by directly learning which scenario we are in. That is, a necessary condition for the riddle to be solvable is

\[H(S) \leq \sum_{i = 0}^w H(T_i). \qquad (*)\]

(To be precise, the amount of information we learn from the $i$th weighing is never more than $H(T_i)$, but might be less if later weighings tell us information we already learned from earlier weighings. Look up conditional entropy for details.)

Our question thus becomes: how do we maximize $H(T_i)$, the entropy of the $i$th weighing? It turns out that the entropy of a random variable is maximized when all of its outcomes are equally likely. In the case of $T_i$, this happens when each outcome has probability $1/3$, so $H(T_i) \leq \log_2 3$. Plugging this bound into $(*)$, we get necessary condition

\[\log_2(2n - 1) \leq w\log_2(3)\]

… which is equivalent to the weaker necessary condition $2n - 1 \leq 3^w$ we already derived. What’s missing?

To get the stricter condition, we need one last observation: we can’t actually make the three outcomes of each weighing equally likely! Suppose that in the first weighing we put $k$ islanders on each side of the see-saw. The probability of each outcome is

\[p_1(\ell) = \frac{2k}{2n - 1} \quad p_1(r) = \frac{2k}{2n - 1} \quad p_1(b) = \frac{2n - 1 - 4k}{2n - 1}.\]

That is, of the $2n - 1$ scenarios, there are $2k$ that would make the see-saw tip left: $k$ where someone on the left is heavier and $k$ where someone on the right is lighter. (Remember that, by definition, the lonely islander is not on the see-saw.) But we cannot possibly have $2k = 2n - 1 - 4k$, because $2k$ is even and $2n - 1 - 4k$ is odd. So we can’t make the three outcomes equally likely, which means $H(T_i) < \log_2(3)$. Plugging this bound into $(*)$, we get necessary condition

\[\log_2(2n - 1) < w\log_2(3),\]

which is equivalent to $2n - 1 < 3^w$, as desired.

Appendix

Here are a few of the details we skipped above. The proofs don’t use any fancy techniques, so they make good exercises if you can resist peeking.

Lemma 1.

If the riddle is solvable, then it is solvable with a deterministic islander-weighing strategy.

Proof.

Suppose we have a nondeterministic islander-weighing strategy solves the riddle. That means when we get to a nondeterministic step, we choose between one of several options for to proceed. But our strategy always works, which means it should work no matter which option we pick. In particular, it still works if we always arbitrarily pick the first option, which makes our strategy deterministic. $\square$

Lemma 2.

Any deterministic islander-weighing strategy that solves the riddle has at most one lonely islander.

Proof.

Recall that an islander is “lonely” for a given strategy if whenever they are the odd-weight islander, they are not placed on the see-saw. We’re going to show that if a strategy has a lonely islander, then none of the other islanders are lonely. Say that Jake is the lonely islander and consider Charles, another islander. We need to show that if Charles is the odd-weight islander, then we have put Charles on the see-saw at least once.

The key is this: because Jake is lonely, we can’t put Jake on the see-saw until we get at least one unbalanced weighing. This is because until we see an unbalanced weighing, Jake could still be the odd-weight islander.

Suppose now that Charles is the odd-weight islander. If we never put Charles on the see-saw, then all the weighings are balanced. But because all the weighings are balanced, we must never have put Jake on the see-saw. Because neither Jake nor Charles have been weighed, we cannot possibly tell the difference between them. Therefore, to solve the riddle, if Charles is the odd-weight islander, our strategy must at some point put Charles on the see-saw. $\square$

How I Draw Slides

Thu, 08 Jun 2017 00:00:00 +0000

After my MAMA talk a few days ago, many people were curious how I made my slides. The short version: I drew them with a tablet in an SVG editor, and I recommend it! Details below.

Hardware

I use an Intuos Art, medium size, which I got specifically for this purpose. Its only fancy feature is pressure sensitivity, but this is enough for my simple drawings. Drawing on the tablet while looking at the computer screen takes some getting used to. The tablet has a grid of dots that make it feasible to draw while looking at the tablet rather than the screen, which I found easier for shapes with lots of right angles. I’d consider getting a Surface or iPad Pro in the future to be able to see more directly what I’m drawing.

Software

There are two steps to making the slides: drawing pictures and assembling them with text as a presentation. I did not find any tool that was good at both, but I did find a pair that works well.

I use Autodesk Graphic for drawing and Keynote for slides. The key feature of this pair is that you can copy and paste any selected part of a drawing directly from Graphic into Keynote—no exporting or importing required! This is invaluable for iterating quickly. I would have needed maybe 50 files if I had to export and import each animated component individually, plus an extra 10 or so from slides that got cut.

Here are some other programs I tried.

PowerPoint seems to work with Graphic well, too.
I actually found the most paper-like drawing experience in raster (pixel-by-pixel) art programs like Corel Painter (a version of which comes bundled with the Intuos Art), but I’m willing to put up with a little clunkiness to produce SVGs (scalable vector graphics).
Inkscape could maybe replace Graphic on Windows or Linux, but on my Mac, I could only copy-paste raster images from it. I think this is some sort of Pateboard-CLIPBOARD incompatibility.
Curiously, OneNote creates vector graphics while feeling as natural as the raster programs. However, I couldn’t copy-paste vector drawings from OneNote into Keynote, PowerPoint, Preview, or anything else I tried. For instance, when pasting into PowerPoint, the image gets rasterized. The workaround is clunky and involves exporting and importing.
Xournal also creates vector graphics and has a good drawing feel, but it doesn’t use smooth curves in its paths, which makes it look worse than Graphic and OneNote.

Here are some more details about how I use Graphic.

I use the brush tool in Graphic with 10% smoothing. This works pretty well, but I have to try drawing each shape a few times. When a shape comes out well except for a small part, I sometimes manually tweak it with the path tool.
I use lots of layers in Graphic to break up drawings into pieces. If a complicated drawing has a complicated animation, each stage of the animation gets its own layer at the very least.
- Graphic makes it very easy to select all objects that live in an arbitrary subset of layers, so parts of the drawing that are common to multiple animation stages each get their own layer, too.
- For example, the animation on slide 2 has 7 layers. In order: the queue, jobs, speech bubbles, one for each of the red, green, and blue distributions, and the coordinate axes.
- Some of my layers collect many small utility drawings. One of them has several arrows and curly braces, for instance.
I made a previous presentation by drawing each slide individually in Graphic. This worked okay, but each slide being a separate file made it too cumbersome to make animations.

Conclusion

I find that tablet-drawn slides give a presentation a friendly vibe while keeping a crisp look. Perhaps more importantly, it reduces the activation energy (for me, at least) of including pictures and animations. After finding the right setup, I’m now faster drawing pictures in Graphic than building similar diagrams directly in PowerPoint or Keynote (or—shudder—TikZ). If you give this approach a try, let me know how it goes!

Adjoint Functors and Computation

Tue, 02 May 2017 00:00:00 +0000

I’ve been sitting on this post not finishing it for almost a year. At this point it’s as finished as it will ever be, so I’m putting what I’ve got so far out there.

This post is about category theory. If you don’t yet see the “fun” in “functor”, it will probably be difficult to follow. If you want to try to follow along anyway, look up what categories/product objects/exponential objects/functors/natural transformations/adjunctions are, try to read this post, fail to get very far, find three or four more introductions to category theory, read all of them to gain as much intuition as possible from their slightly different perspectives, try again, fail again, become a monk at one of those isolated-in-the-mountains math temples, vow not to speak until you truly understand the Yoneda Lemma, realize this all escalated rather quickly, decide you’ve had enough, move to a small island in the Carribean, conclude that maybe the beach is more fun than math, have an epiphany while brushing teeth revealing a tiny aspect of what adjoint functors might be all about, and write a blog post about it. That’s more or less how I got here.

Only slightly more seriously: you can probably get something out of this post only if you know what categories and functors are, and familiarity with their basic concepts and notation shall be mercilessly assumed. Before continuing, if you do not yet know what a natural transformation is, you should at least attempt to read a precise definition of the term elsewhere, because you won’t find one here. We’ll start with an attempt to grasp that formal definition intuitively, and we’ll continue that trend throughout.

Definitions

If $F$ is a functor and $X$ is an object, then $FX$ is in some sense an object with “outer structure” of type $F$ and “inner information” of type $X$. A functor “maps” a morphism by using the morphism only on the level of inner information while leaving the outer structure intact. For example, consider the list functor, $L : \mathbf{Set} \to \mathbf{Set}$. On objects, it brings each set $X$ to the set of lists of elements of $X$. On morphisms, it maps each function $f$ to “mapped $f$”, which, given a list as input, outputs the list of results of applying $f$ to each member of the list. For instance, if $f(x) = x + 3$, then $Lf([1,2,3]) = [4,5,6]$. Mapped functions deal only with inner information, applying a function to individual elements of a list, but they don’t modify outer structure by adding, removing, or rearranging elements of the list.

Natural transformations are the opposite: they are morphisms that act on outer structure only, leaving the inner information intact. A natural transformation $\alpha : F \to G$ maps outer structure $F$ to outer structure $G$. Of course, $F$ and $G$ aren’t objects (of the categories we care about right now), so $\alpha$ is represented as a collection of components, with an $\alpha_X : FX \to GX$ for each object $X$ in the domain of $F$, but in a certain sense all its components do the same thing. As an example, for any $X$, we can define a list reversal function $\rho_X : LX \to LX$. But this is kind of tedious: to reverse a list, we don’t care which set its members are from. We just change their order. We just change the outer structure. List reversal is “polymorphic” in that any choice can be made for what’s inside the list being reversed. That is, list reversal is a natural transformation $\rho : L \to L$.

The notion of operating on outer structure only is made precise by the naturality condition. Given a morphism $f : X \to Y$, which acts only on inner information, and a natural transformation $\alpha : F \to G$, which acts only on outer structure, there are ways we can imagine building a morphism that transforms both, $FX \to GY$: either use a map of $f$ on inner information followed by $\alpha$ on outer structure, or vice versa. If $\alpha$ really does ignore inner information and maps of $f$ really do ignore outer structure, these two choices should be the same. The naturality condition captures this in an equation: $Gf \circ \alpha_X = \alpha_Y \circ Ff$.

An adjunction of two functors $F : \mathcal{C} \to \mathcal{D}$ and $G : \mathcal{D} \to \mathcal{C}$ is a pair of natural transformations:

the unit, $\eta : 1_{\mathcal{D}} \to GF$, and
the counit, $\varepsilon : FG \to 1_{\mathcal{C}}$;

satisfying a pair of natural transformation composition laws called the triangle identities (because when drawn as commuting diagrams, each equation is a triangle): for all objects $X$ in $\mathcal{C}$ and $Y$ in $\mathcal{D}$,

$F\eta_X \circ \varepsilon_{FX} = 1_{FX}$, and
$\eta_{GY} \circ G\varepsilon_Y = 1_{GY}$.

If you didn’t know what an adjunction was already, well, now you… probably still don’t. But don’t panic! If you followed most of the discussion of natural transformations, you’re all set to keep reading. The internet is full of many detailed explanations of the definition of adjunctions written by people who know it better than I do. My personal favorite is a series of videos by The Catsters, but, as mentioned in the introduction, seeing many explanations and intuitive perspectives helped me a lot. Instead of giving additional definitional detail, this post introduces another such intuitive perspective: some adjunctions can be thought of as describing evaluation of computation.

Free and Forgetful Functors

Some classic adjoint functor pair examples are “free” and “forgetful” functors for various algebraic structures over sets, such as groups, rings, and monoids. For concreteness, we consider monoids, which are quickly defined and explained below.

A monoid is a set with an associative binary operation and an element that’s the left and right identity of that operation. Monoids are like groups in which inverses might not exist. Indeed, all groups are also monoids. One monoid that isn’t a group is the set of all $n \times n$ matrices: multiplication is an associative operation with an identity, but not all matrices are invertible. A monoid homomorphism is, analogous a group or ring homomorphism, a function that preserves the operation and its identity. In equations, using $\bullet_A$ and $1_A$ to denote the operation and identity element of a monoid $A$, we say $f : A \to B$ is a monoid homomorphism if $f(x \, \bullet_A \, y) = f(x) \, \bullet_B \, f(y)$ and $f(1_A) = 1_B$. There is a category of all monoids, which we creatively call $\mathbf{Mon}$, with all monoids as objects and all monoid homomorphisms as morphisms. As with groups, by default we call the operation “multiplication”, and we write it as juxtaposition, often without parentheses, which associativity makes unnecessary.

We have two functors to define: the free functor, $F : \mathbf{Set} \to \mathbf{Mon}$, and the forgetful functor, $G : \mathbf{Mon} \to \mathbf{Set}$. The forgetful functor is easy to describe. On objects, it maps each monoid to its underlying set of elements, “forgetting” what the operation does and which element is the identity. On morphisms, it maps each monoid homomorphism to its underlying function between two sets, “forgetting” that the function happened to satisfy any equations.

The free functor is slightly trickier to describe. The free monoid on a set $X$, written $FX$ (hint, hint), is a monoid “freely generated” by the elements of $X$. This means two things.

By “generated”, we mean that the underlying set of $FX$ has all the elements of $X$ plus anything else needed to be a monoid. For example, if we were to generate a monoid from $\{17,42\}$ using $+$ as the operation, our generated monoid would need $0$, because it’s the identity, $59$, because it’s $17+42$, and many more numbers.
By “freely”, we mean that the operation of $FX$ never assumes two things are equal if they don’t have to be. For example, if $X = \{x,y,z\}$, then $x(yz) = (xy)z$ is required by associativity, but $xy \neq yx$ because no monoid axiom says they have to be equal.

It turns out that $FX$ has a concise interpretation: the free monoid on $X$ is lists of elements of $X$, with concatenation of lists as the operation. For instance, concatenating the lists $[x,y,y]$ and $[x,z,x]$ gives

\[[x,y,y][x,z,x] = [x,y,y,x,z,x].\]

The identity of $FX$ is the empty list, $[]$. We sometimes call the free monoid the “list monoid”.

As suggested by our notation, the free functor $F : \mathbf{Set} \to \mathbf{Mon}$ maps each set to the free monoid on it. To finish the definition, we need to define how to turn a function $f : X \to Y$ into a monoid homomorphism $Ff : FX \to FY$. Ignoring the monoid homomorphism conditions, our task is this: given a function $f : X \to Y$ and a list of elements of $X$, generate a list of elements of $Y$. Recalling our discussion of the list functor $L$, we take $Ff$ to be list-mapped $f$. It’s not hard to check that this satisfies the axioms for both functors and monoid homomorphisms.

A subtle distinction bears mentioning: though we call both “list-mapped $f$”, $Ff$ and $Lf$ are not the same thing. They’re not even the same type of thing! The former is a monoid homomorphism, and the latter is a function. That said, they are related: $Lf$ is the underlying function of $Ff$. (In fact, $GF = L$. More on this in a bit.)

Similar notions exist for groups and rings. We’ll focus on monoids in the next section but will mention rings as well, for which we’ll need the following result (with proof left as an exercise, of course): the free (commutative) ring on a set $X$ is the polynomial ring with integer coefficients where each element of $X$ is a variable.

The Free-Forgetful Counit is Expression Evaluation

Let’s summarize the story so far.

The free functor, $F : \mathbf{Set} \to \mathbf{Mon}$, maps each set to its list monoid and each function to its list-mapped version.
The forgetful functor, $G : \mathbf{Mon} \to \mathbf{Set}$, maps each monoid to its underlying set and each monoid homomorphism to its underlying function.

As mentioned at the beginning of the previous section, these are adjoint functors, which means there are natural transformations $\eta : 1_{\mathbf{Set}} \to GF$ and $\varepsilon : FG \to 1_{\mathbf{Mon}}$ satisfying the triangle identities. Before trying to figure out what $\eta$ and $\varepsilon$ are, let’s first understand what the relevant functor compositions are.

$GF : \mathbf{Set} \to \mathbf{Set}$ brings a set $X$ to the underlying set of the list monoid on $X$, which is the set of lists of elements of $X$. We’ve actually seen $GF$ before: it’s the list functor $L$ from the discussion of natural transformations.
$FG : \mathbf{Mon} \to \mathbf{Mon}$ brings a monoid $Y$ to the list monoid on the underlying set of $Y$. I like to think of this as the monoid of “unevaluated expressions” in $Y$ by thinking of a list of elements of $Y$ as a list of terms to be multiplied. Multiplying unevaluated expressions corresponds to list concatenation. For example, we can multiply $17 \times 42$ and $38 \times 99$ without simplifying to get $17 \times 42 \times 38 \times 99$.

This, along with the intuition of natural transformations as “polymorphic” morphisms, is enough to guess what the unit and counit are.

Let’s start with the unit, $\eta : 1_{\mathbf{Set}} \to GF$. Given a set $X$, a component $\eta_X : X \to GFX$ is a function from $X$ to lists of elements of $X$, which we called $LX$ earlier on and call $GFX$ now. That is, $\eta_X$ gets a single element of $X$ as input and has to produce a list of elements as output. A simple way to do this is to produce a singleton list, so we define

\[\eta_X(x) = [x].\]

It’s straightforward to mechanically verify that $\eta$ is a natural transformation. It certainly fits our polymorphism intuition. Each component $\eta_X$ wraps a list “outer structure” around its argument in the exact same way, without regard for the “inner information” about what the argument is or what set it’s from.

We turn to the counit, $\varepsilon : FG \to 1_{\mathbf{Mon}}$. Given a monoid $Y$, a component $\varepsilon_Y : FGY \to Y$ is a monoid homomorphism from the list monoid on the underlying set of $Y$ to $Y$ itself. That is, $\varepsilon_Y$ gets a list of elements of $Y$ as input and has to produce a single element as output. A simple way to do this is to multiply everything in the list together to produce a single result (with the empty list mapping to the identity of $Y$), so we define

\[\varepsilon_Y([y_1, y_2, \ldots, y_n]) = y_1 y_2 \cdots y_n.\]

That is, if we think of a list of elements of $Y$ as an unevaluated monoid expression, then $\varepsilon_Y$ evaluates the expression.

It’s straightforward to mechanically verify that $\varepsilon$ is a natural transformation. That said, it doesn’t clearly fit our polymorphism intuition because we use multiplication, which feels like using “inner information”. However, as we’re about to see, this feeling is wrong!

In $\mathbf{Set}$, given multiple arbitrary elements of an arbitrary set, there’s no way for the multiple elements to interact. Natural transformations can move elements around, as we saw with our earlier example of list reversal, but there’s no way to use the given elements to get a new element of the set. If this seems restrictive, it’s because it is. The morphisms in $\mathbf{Set}$ are arbitrary functions, so the inner information that morphisms can modify is basically as unrestricted as possible. This flexibility when modifying inner information is what puts such strong restrictions on modifying outer structure. Given functors $F, G : \mathbf{Set} \to \mathbf{Set}$, if a family of functions $\alpha_X : FX \to GX$ does anything too fancy, we can find some function $f : X \to Y$ such that $Gf \circ \alpha_X \neq \alpha_Y \circ Ff$ because there are so many things arbitrary functions can do.

The story is different for $\mathbf{Mon}$ because its morphisms are more restricted than those in $\mathbf{Set}$: they preserve multiplication and identity elements. Furthermore, given multiple arbitrary elements of an arbitrary monoid, there are two ways we can get new elements that weren’t initially given: multiplication and getting the identity. Together, these facts mean both multiplication and using the identity are fair game when modifying outer structure. This is intuitively why $\varepsilon$ is a natural transformation: it uses only identity (when given the empty list) and multiplication (when given a list with more than one element), both of which are outer structure for the purposes of monoid homomorphisms.

Showing that $\eta$ and $\varepsilon$ actually satisfy the triangle identities is an unsurprising exercise that can be left for another day.

More than Monoids

Hmmm, this post is pretty long already. Here are the two further examples I was going to talk about before my arms fell off.

The free-forgetful counit for rings is polynomial evaluation. The reasoning is pretty similar to what we’ve seen for monoids, except we use $\mathbf{Ring}$ instead of $\mathbf{Mon}$. This means the outer structure that natural transformations can use includes addition, subtraction, and additive identity as well as the multiplication and multiplicative identity we had for monoids.
As an example that does not involve a free-forgetful adjunction: the product-exponential counit is function evaluation.

Hopefully this helped in understanding what natural transformations, or maybe even adjunctions, are at a level that’s intuitive but not just hand-waving. If you’re curious for more material on relating adjunctions to computation, I’m pretty sure “something something free algebra” is a relevant next step.

Knowing that Everyone Knows

Sat, 09 Jan 2016 00:00:00 +0000

We consider a classic “paradox” where a simple inductive proof seems to clash with intuition. Though the proof makes clear that the naive intuition is wrong, it’s hard to pinpoint exactly where the intuition’s logical error is. After discussing the paradox at some length with my family, we came up with an angle of attack that gives an intuitive framework that both matches the math and makes the problem with the naive intuition clearer.

The situation is as follows. Dragons, as you probably already know, are a perfectly honest and rational species with color vision and either red or blue eyes. One hundred red-eyed dragons are on an island, sworn to a two-part pact:

they will not communicate with each other, look at reflections, or otherwise directly find out what color eyes they have, and
if any dragon can logically deduce some day that they have red eyes, then that dragon will leave the island the following night.

The dragons live for years on the island, each of them seeing ninety-nine red-eyed dragons but none of them able to logically deduce that they too have red eyes. One day, a perfectly honest visitor comes to the island, announces that at least one of the dragons has red eyes, and leaves.

If you haven’t heard this before, try to figure out before continuing: what happens?

On the one hundredth night after being told that at least one of them has red eyes, all the dragons leave the island!

Here’s the argument.

If there were exactly one dragon $X$ with red eyes, they would have seen only blue eyes and deduced that they must be the one with red eyes, so $X$ would leave on the first night following the announcement.
If there were exactly two dragons $X$ and $Y$ with red eyes, they would both stay the first night. The following day, each would see that the other hadn’t already left. $X$ knows by the previous bullet point that if $Y$ were the only dragon with red eyes, then $Y$ would have left on the first night. This didn’t happen, so $X$ deduces that they must also have red eyes. Symmetrically, so does $Y$, and both leave on the second night.
More generally, if exactly $k$ dragons have red eyes, then after $k-1$ nights of no dragons leaving, each of them realizes that, if the other $k-1$ red-eyed dragons were the only dragons with red eyes, they would have left on night $k-1$. This didn’t happen, so they deduce that they must also have red eyes, and all $k$ red-eyed dragons leave on night $k$.

This is a pretty simple inductive argument, but there’s an apparent paradox: the announcement made by the visitor was something all of the dragons already knew! What difference does it make? The typical (and entirely correct) answer is that without the announcement, the first bullet point doesn’t hold. That bullet point is the crucial base case of the inductive argument each dragon uses to deduce they have red eyes. But even though I know how induction works, I find it very counterintuitive that this should matter, because every dragon sees at least two other red-eyed dragons and therefore knows they aren’t in the base case!

The rough reason that the base case matters, even though all the dragons know they aren’t in it, is that we have to not just consider what each dragon knows, but also what each dragon knows about what each other dragon knows… and what each dragon knows about what each other dragon knows about what each other dragon knows, and so on. I was able to figure things out for up to three red-eyed dragons, but after that there were too many cases to keep track of.

Following a common mathematical theme, to give ourselves better intuition about a complicated situation, we’re going to define a new concept and build intuition about that new concept instead of about the situation directly. Let us call a dragon $k$-aware for positive integer $k$ under the following conditions.

A dragon is $1$-aware when they know at least one dragon has red eyes.
For $k \geq 2$, a dragon is $k$-aware when they know every dragon is $(k-1)$-aware.

For example, if only one dragon $X$ has red eyes, then every other dragon is $1$-aware. If only two dragons $X$ and $Y$ have red eyes, then they are both $1$-aware and every other dragon is $2$-aware: not only do the other dragons know that $X$ and $Y$ have red eyes, but they know that each of $X$ and $Y$ can see the other, so they know that every dragon can see a red-eyed dragon. We can generalize this.

Theorem.

Before the visitor’s announcement, if a dragon can see at least $k$ red-eyed dragons, they are $k$-aware, and if they can see at most $k$ red-eyed dragons, they are not $(k+1)$-aware.

Proof.

We prove each statement separately by induction.

If a dragon can see another red-eyed dragon, then they are $1$-aware.
If a dragon can see at least $k \geq 2$ red-eyed dragons, then they know that every other dragon can see at least $k-1$ red-eyed dragons. By the inductive hypothesis, they know every other dragon is $(k-1)$-aware, so they are $k$-aware.
If a dragon sees no red-eyed dragons, then they are not $1$-aware.
If a dragon $X$ can see at most $k \geq 1$ red-eyed dragons, then because $X$ must consider that they might have blue eyes, it is possible that each of those red-eyed dragons can see just $k-1$ other red-eyed dragons. By the inductive hypothesis, $X$ cannot know for sure that those red-eyed dragons are $k$-aware, so $X$ is not $(k+1)$-aware. $\square$

This theorem means that before the visitor’s announcement, the dragons are all $99$-aware. After the visitor’s announcement, the dragons become $k$-aware for every $k \geq 1$ because of the public nature of the announcement: not only does everyone know that at least one dragon has red eyes, but everyone knows that everyone knows this, and everyone knows that everyone knows that everyone knows this, and so on. This makes all the difference.

Theorem.

If there are exactly $k$ red-eyed dragons and they simultaneously become $k$-aware, they will leave $k$ nights later.

Proof.

We prove only that the dragons that are supposed to leave do so at the right time, given that the other dragons stay. It’s not too hard to add the details to rigorously show that all the other dragons do indeed stay.

Suppose a dragon sees no red-eyed dragons but becomes $1$-aware. They immediately deduce they have red eyes because nobody else does, so they leave on the first possible night.
Suppose for some $k \geq 2$ that a dragon $X$ is $k$-aware and sees exactly $k-1$ red-eyed dragons. By $k$-awareness, $X$ knows that those red-eyed dragons are all $(k-1)$-aware. $X$ reasons that if they had blue eyes, then those red-eyed dragons would have each seen exactly $k-2$ red-eyed dragons and, by the inductive hypothesis, would have left on night $k-1$. Therefore, if this doesn’t happen, $X$ can deduce that they must have red eyes and will leave on the next night, which is night $k$. $\square$

The above proof is essentially the same as the initial argument, but the explicit definition and usage of $k$-awareness helped me (and, hopefully, you!) build better intuition for it.