It's natural to think of mathematical objects as 'things'. $1$ is a natural. $\varnothing$ is a set. $(6, 7)$ is a tuple. And we can then do things to these objects. We can, for instance, add numbers, or perhaps, find the union of sets. These are the actions we pair with our objects. However, it can be useful to instead think of our objects as, themselves, actions.
How does this work? Consider the real number line. It has a bunch of numbers on it. It's got $0$. It's got $1$. It's got $1.5$. It's got $\frac{1 + \sqrt{5}}{2}$. You name it. As discussed, these numbers don't just sit there and do nothing; we get to have fun with them! We can add them! We can subtract them! We can multiply them! We can divide them! And we can do many other things as well.
When we apply one of these operations to two numbers, we get another number back out. For instance, $2 + 3 = 5$. And $5$ is a number (I'm sure of it). In this manner, numbers somehow kind of have actions associated with them: adding $2$ is an action which moves a given number two to the right on the number line. Subtracting $1$ moves a given number one to the left. (Multiplication and division represent dilations by some factor.)
Okay, but so what? Well, now we have a new way to think about numbers and a new way to read operations. Now we can read $2 + 3$ not as just plain "two plus three, the result of an algebraic operation", but rather as "the result of applying the action 'move $2$ to the right' to the number $3$". Either way, we get $5$, but the differing thought process is interesting.
And this thought process gets especially interesting when you apply it to associativity...
Addition is an associative operation. This is a fact. We write this fact as follows:
$$ x + (y + z) = (x + y) + z $$Let's experiment and try to think of this statement in terms of actions, like we just discussed. In order to facilitate this, I will make a variable replacement, as follows:
$$ f + (g + x) = (f + g) + x $$Now let's read it. The left side says to start with $x$, and then apply the action of $g$, within the context of addition, and then, to the result, apply the action of $f$. Take $x$, and apply two actions to it, as specified by $f$ and $g$. The right side says something slightly different. Start with $x$, it says, and apply the action $f + g$ to it.
Now, hold up. What is $f + g$? If we're thinking of both $f$ and $g$ as actions, it doesn't really make sense to add them—to apply one action to another action. However, since $f$ and $g$ are just numbers, we can add them; we just don't yet know what doing so means.
Luckily, we can find out! Note the equal sign in the middle of the equation. It tells us that the two sides are just two ways of doing the same thing.
This solves our mystery. Since the two sides are equal, then applying $f + g$ to $x$, which is given by the right side, must be the same as applying $g$ and then applying $f$, which is given by the left side. That is, $f + g$ somehow combines the two actions $f$ and $g$ into only one action, so that applying $f+g$ to a number applies $f$ and then $g$. Canonically, $f + g$ is a kind of composition of the actions $f$ and $g$.
Thus, I claim, the deep meaning of associativity is about composition: if an operator $\star$ is associative, that means that the operator represents both application, like in $f \star x$, and composition, like in $f \star g$.
On this note, I'd like to juxtapose the definition of associativity with that of composition, and note that they are, indeed, quite similar. And under the "objects as actions" understanding, they essentially read the same:
$$ \begin{align*} \text{Composition: } & (f \circ g)(x) \hspace{2pt} = f(g(x)) \\ \text{Associativity: } & (f \star g) \star x = f \star (g \star x) \end{align*} $$Matrices are a great example of how this idea can be put to use. If you're unfamiliar with matrices,
Matrices are just big boxes of numbers. Here's a matrix, for instance:
$$ T = \left[ \begin{matrix} 0 & -1 \\ 1 & 0 \end{matrix} \right] $$—it's just a bunch of numbers organized into a grid. Without getting into too much detail, matrices can be used to represent n‑dimensional points, as well as n‑dimensional transformations, such as rotations, dilations, and shears. $T$, for instance, represents 2d rotation by 90 degrees.
We may also encode points as matrices. The top number encodes the $x$ value, and the bottom number encodes the $y$ value. To represent $(1, 0)$ for instance, we write:
$$ p = \left[ \begin{matrix} 1 \\ 0 \end{matrix} \right] $$Let's apply our $T$ rotation to $p$! There's an operation for this. In order to apply the transformation represented by one matrix to the point represented by another matrix, we multiply them. (We will leave out the discussion of how matrix multiplication is defined.) For instance, to apply $T$ to $p$, we find $Tp$:
$$ Tp = \left[ \begin{matrix} 0 & - 1 \\ 1 & 0 \end{matrix} \right] \left[ \begin{matrix} 1 \\ 0 \end{matrix} \right] = \left[ \begin{matrix} 0 \\ 1 \end{matrix} \right] $$The result, $\left[ \begin{matrix} 0 \\ 1 \end{matrix} \right]$, is the matrix that encodes the point $(0, 1)$, which is, indeed, the rotation by 90 degrees of $(1, 0)$.
Besides encoding single points, we can actually encode entire polygons in matrices. We simply use a matrix with more than one column. Then each column represents a point, and the matrix as a whole represents a polygon. For instance, a square, going from $(0, 0)$ to $(1, 0)$ to $(1, 1)$ to $(0, 1)$ and then back to $(0, 0)$ has 4 points and
This isn't universally true; matrix representation of polygons is sometimes more complicated. In order to express translation as a matrix, for instance, you need to do something more complex. However, using this naive representation will work for our purposes here.
Matrix multiplication generalizes to polygons. That is, if we want to apply $T$ to the square $s$, we don't have to do anything special; we can just do multiplication like we did with $p$. That is, $Ts$ represents rotating the square by 90 degrees.
Let's put ourselves in the shoes of a game designer for a second. We're making a game about an adventurous hedgehog. We represent the hedgehog as a really big polygon with a bunch of points, and call it $h$:
$$ h = \left[ \begin{matrix} 0 & 3 & 5 & 8 & -2 & 16 & \cdots \\ -1 & 12 & 22 & 56 & 19 & 82 & \cdots \end{matrix} \right] $$The actual values in $h$ aren't important, just the fact that it's really big (and represents a hedgehog).
In our game, our little hedgehog buddy has fallen off into the void and is headed towards a black hole. We'll animate the fall with a shrink an rotation, and we'll add a shear, as well, to show the the black hole is stretching him. All three of these transformations can be represented as matrices! (
Rotation by, say, a degree, is:
$$ Rot(1^\circ) = \left[ \begin{matrix} \cos(1^\circ) & -\sin(1^\circ) \\ \sin(1^\circ) & \cos(1^\circ) \end{matrix} \right] $$Shrinking will be represented as a dilation by a factor of $0.9$:
$$ Di(0.9) = \left[ \begin{matrix} 0.9 & 0 \\ 0 & 0.9 \end{matrix} \right] $$And we can shear by, say, a factor of 1.1, as follows:
$$ Sh(1.1) = \left[ \begin{matrix} 1 & 1.1 \\ 0 & 1 \end{matrix} \right] $$Now in order to apply all these transformations to our hedgehog, we simply need to do successive multiplication. If we want to rotation, then dilate, then shear, this would look like:
$$ h' = Sh(1.1) \cdot [ Di(0.9) \cdot [ Rot(1^\circ) \cdot h ]] $$And this would work! However, there's one issue...
Recall that we're in the shoes of a game developer here. We want things to be fast. However, this is not very fast. Matrix multiplication is relatively cheap, certainly, but it gets slower as the number of points in your matrices increases. And $h$, as noted previously, is a very large matrix—so multiplication with $h$ is slow, relatively speaking.
The first calculation we do here is $Rot(1^\circ) \cdot h$, which will be slow since it involves $h$. The result, just a rotation of $h$, will have the same number of points as $h$, and thus be just as large. Then we multiply this matrix by $Di(0.9)$. Again, this will be slow, since one of the matrices is so big. And, again this will produce a large matrix. Finally, we multiply by $Sh(1.1)$, which has all the same problems: it's slow, and produces a large matrix.
In this manner, we do a lot of work every time we apply a transformation to the hedgehog.
"But", I hear you say, "what else can we do?" Well, great question! I've neglected to mention it until now, but matrix multiplication is actually associative. As such, we can rewrite our computation for $h'$ as follows:
$$ h' = [[Sh(1.1) \cdot Di(0.9)] \cdot Rot(1^\circ)] \cdot h$$where we first multiply all the transformation matrices and then apply the result to our hedgehog. This way, we only have to multiply by a large matrix once, at the very end. All other multiplications are just between transformation matrices, which are each small—just 2-by-2. In this manner, we reduce the number of slow computations from 3 down to just 1, thus speeding up the calculation.
See what's happening here, though? We first build up a composite matrix, called $Sh(1.1) \cdot Di(0.9) \cdot Rot(1^\circ)$, and then applied this composite to the hedgehog. Associativity guarantees that this will produce that same result as just applying all three of these matrices to the hedgehog in order. Without associativity, who knows what we would have gotten?
Okay, so what does non-associativity look like?
First things first, and I want to emphasis this, it does not mean that you cannot compose operations. Non-associativity of an operator simply means that, unlike with an associative operator, composition of actions is not the same operation as application of actions.
Subtraction makes a great example of this. It's not associative; however, it can still be composed.
First, we need to address an ambiguity.
This ambiguity didn't arise with addition because addition is commutative. Since $a+b=b+a$, the two cases are actually the same.
This ambiguity actually also exists for matrices, but I craftily avoided it.
The version that is composable is to think of $a - b$ as "the action of subtracting $b$, applied to $a$". This means we should read $a - b$ as "start at $a$, and move $b$ to the left".
Now, associativity reads like so:
$$ (x - f) - g = x - (f - g) $$and is plain false. For instance,
$$ \begin{align*} (0 - 1) - 1 &\stackrel{?}{=} 0 - (1 - 1) \\ -2 &\neq 0 \end{align*} $$However, subtraction still is composable. To compose $f$ and $g$, we want to find the $h$ such that:
$$ (x - g) - f = x - h $$This is given by $h = f + g$. As such, we can still compose subtraction operations, however, we do so by adding the actions, not subtracting them—this is why subtraction isn't associative.
Earlier, I teased you a little bit. I said that in $a-b$, we would think of $b$ as the action, but that thinking of $a$ as the action is also interesting. Let's do that now.
In $a-b$, our understanding is now to read this as "the action of subtracting from $a$, as applied to $b$". Subtraction is non-associative; we already know that; the question now is, how do we compose these two actions?
Given an $f$ and $g$, we want to find $h$ such that:
$$ f - (g - x) = h - x $$This is given by $h = f - g + 2x$. What's interesting about this is that $h$, the supposed composition of $f$ and $g$, depends on $x$. As such,
Technically, I have no deductive basis for this claim. I never defined that composition may not depend on the input that the composition is applied to. However, this is the more canonical definition of composition.