TheEpicCowOfLife's Blog

Imagine reviving this blog, couldn't be me.

Project maintained by Hosted on GitHub Pages — Theme by mattgraham

In this article, I’ll show you the first feasibility problem explored: nm-queens! I’ll show you how to formalise the problem as a feasibility problem, how the two algorithms performed, and how I poked and prodded them to learn more about how they work.

If you haven’t read the first part, do it now, otherwise things will simply not make sense.

Introducing NM queens

Firstly this problem has a hella dumb name. Is it nm queens? $nm$ queens? n-m queens? $n$-$m$ queens? N-M Queens? I don’t know, I don’t care, for me reality will be whatever I want.

The nm queens problem is the first of the two problems studied, generalising the $n$ queens problem. It asks to place $nm$ queens to be placed on an $n \times m$ chessboard such that exactly $m$ queens are placed on each row and column, and at most $m$ queens are placed in each diagonal

Image a of an invalid (8-2) queens solution, and a valid one. Pictured is an invalid solution and a valid solution to the (8-2) queens problem.

\[\def\reals{\mathbb{R}} \newcommand{\bx}{\mathbf x} \newcommand{\bw}{\mathbf w} \newcommand{\by}{\mathbf y} \newcommand{\bz}{\mathbf z} \newcommand{\ba}{\mathbf a} \newcommand{\bv}{\mathbf v} \DeclareMathOperator{\Fix}{Fix} \newcommand{\integ}[2]{[[{#1},{#2}]]}\]

As a feasibility problem

To formulate this as a feasibility problem, we will choose $E$ to be $\reals^{n\times n}$, and for any $x \in E$, $x_{ij}$ corresponds to whether or not position $(i,j)$ is a queen. More specifically, $x_{ij} = 0$ means there is no queen, $x_{ij} = 1$ means there is a queen, and any other value can mean whatever we want it to mean. Personally, I see $x_{ij}$ to be “how confident” that the algorithm is that there is a queen on that cell.

The constraint sets

Firstly, define

$\{0,1\}^{n\times n} = \{ x \in E \mid x_{ij} \in \{0,1\} \text{ for all } i,j\}$

Now define the following 4 constraint sets:

$C_1 = \{ x \in E \mid \text{ Each row sums to } m\}$
$C_2 = \{ x \in E \mid \text{ Each column sums to } m\}$
$C_3 = \{ x \in E \mid \text{ Each forward diagonal sums to a value }\leq m\}$
$C_4 = \{ x \in E \mid \text{ Each backwards diagonal sums to a value }\leq m\}$

Note that the values of any $x \in C_i$ are not necessarily only 0’s and 1’s. To mathematically define this constraint, define

\[\hat{C_i} = C_i \cap \{0,1\}^{n\times n}\]

for $i \in \{1,2,3,4\}$

To actually solve the problem at hand, we will need to find an $x$ such that

\[\begin{align*} && x &\in C_1 \cap C_2 \cap C_3 \cap C_4 \cap \{0,1\}^{n\times n} &&\\ && &= C_1 \cap C_2 \cap \hat{C_3} \cap \hat{C_4} && \text{(Formulation 1)}\\ && &= \hat{C_1} \cap \hat{C_2} \cap \hat{C_3} \cap \hat{C_4} && \text{(Formulation 2)} \end{align*}\]

As you can see, there are quite a few handful of equivalent choices for constraint sets- while it may seem like repeating the $\{0,1\}^{n\times n}$ constraint may seem redundant, it affects performance. Formulation 2 converges much faster than formulation 1, and we will focus mainly on formulation 2 for this blog. However I have introduced formulation 1, just because it will look really cool in visualisations.

Also, note that $\{0,1\}^{n\times n}$ and all of the discretised sets are not convex. Therefore, no matter the formulation there should be no mathematical guarantee that these algoritms will find a solution at all, so keep that in mind.

Projections

To see the full details and formulas for calculating the projections I cannot more succinctly cover it all better than this reference (starts at page 19). However for the rough overview of how it’s done:

Observe that, for example, when projecting to $C_1$ each row can be considered independently of each other
Use that to reduce the problem to calculating the projections to a 1 dimensional space where each entry sums to/is less than $m$
See that these spaces are actually hyperplanes or half-spaces, and use existing projection formulas from linear algebra.
In the case of discretised sets (like $\hat{C_i}$), also calculate rows/columns/diagonals independently
Use rearrangement inequality arguments similar to the projection to set of permutations above, obtaining a projection that can be calculated in $O(nlogn)$ time.

Defining a “trial”

How do we measure the performance of a completely deterministic algorithm?

Simple! We can just run trials on different values for $x_0$, the starting value. Roughly speaking, each trial looks like this

Initialise x_0 to a random point in E
For k from 1 to 5000:
    Calculate the value of x_k
    If x_k is a solution:
        Record the number of iterations, and stop.

More pedantry

“Initialise x_0 to a random point in E” is a bit of a lie, in fact for Douglas-Rachford we’re really initialising an $\bx_0 = (x_0, \dots, x_0) \in E^n$ for some $x_0 \in E$, and for Tam-Malitsky we’re just initialising a random $\bz^0 = (z^0, \dots, z^0) \in E^{n-1}$. The choice to initialise all components to the same value is fairly arbitrary, but it makes more physical sense and it probably doesn’t changes the results anyways.

If you stop and think about it, “If $x_k$ is a solution” is also quite ambiguous, since we’re working in $E^n$ instead of $E$. Additionally, each cell’s values aren’t necessarily 0’s or 1’s, what does 0.5 mean in terms of a N-M queens solution?

For now, we check $x_k$ by projecting one of the components of $P_\mathbf{D} (x_k)$ to $\{0,1\}^{n\times n}$, and then check each of the $n-m$ queens condition (which is now well-defined, as that operation results in a single $n\times n$ board of 0’s and 1’s).

Coding Douglas Rachford

It didn’t take very long to code, and when I finished, I ran 100 trials for $(n,m) = (20,2), using formulation 2$

Success rate	Average iterations	Median iterations	Average time (ms)
87/100	205.74	187	122.3

Checking my reference, it looks like I successfully reproduced the results, save for a couple differences due to slightly different testing methodology.

Except for the fact that they were using python, and I was using c++. Isn’t python meant to be awfully slow? What was wrong with my code? I frantically searched for answers. I was about a day into researching c++ libraries to micro-optimise my code to the bone, and then I realise.

I didn’t compile my c++ code with -O2 or -O3 flags.

:facepalm:

Either way, here are some initial results for 20-2 queens (compiled with -O3, also I did stuff for 200 trials from now on):

Success rate	Average iterations	Median iterations	Time (ms)
175/200	210.54	189	12.8

That’s a 10x speedup, just by remembering to compile with optimisation flags. And we’re also 10x faster than python. Order has been restored.

Visualisation

Cool right? Well now time to see what this algorithm is doing. Here is a visualiser (which didn’t take nearly as much time as I thought it would, thank you, Processing).

This video is a visualisation of $P_\mathbf{D}(\bx_k)$, using formulation 2. The red lines around the border indicate violations of any of the $n-m$ queens rules.

I think it looks cool. Don’t you?

Yeah, it’s quite hard to gain any insight just by looking at it. It’s just very chaotic, people have commented that it looks like cellular automata, and I agree. In my opinion it is quite reminiscent of simulated annealing, just wandering aimlessly roughly towards a solution trying to minimise violations- which is quite an interesting thought since that means that DR is emergently minimising the number of violations.

How does it fail?

That aside, we see that DR fails 12.5% of the time. Why?

Well I put some of the failing cases through the visualiser, and lo and behold…

It seems to get stuck in the most unfortunate positions. They get stuck with cells at a value of 0.5, perpetually indecisive of whether a cell should be a queen or empty. These fixed points are truly unfortunate, since if some of those values were, say, 0.5001 or 0.4999 instead of 0.5, it’s likely that would have nudged the algorithm out of its rut and let it keep going.

But wait, more pedantry!

If you were outrageously eagle-eyed, you may be wondering about this theorem about Douglas-Rachford I mentioned earlier:

Even when $A$ and $B$ are non-convex, if $x_{k+1} = x_k$, then $P_A(x_k) \in A \cap B$

But clearly, by the ruts observed above, we’ve found such an $x_k$ such that $x_{k+1} = x_k$ but $$P_A(x_k) \notin A \cap B. Is (the person that proved the theorem)’s math wrong?

Fortunately, of course (the person that proved the theorem)’s is correct all along. The visualisation shows $P_\mathbf{D}(\bx_k)$ stuck in a fixed point, but in reality $\bx_k$ may be oscillating behind the scenes, or continuously increasing in magnitude.

How often does it fail this way?

Within the 200 trials I did, when successful it always found a solution within 600 iterations, even though the iteration limit is 5000.

In fact, have a histogram (courtesy of plotly) showing the distribution of successful trials.

The significance? We can conclude that the algorithm either gets stuck, or it finds a solution relatively quickly. After about 1000 iterations (at least for $n-m = (20,2)$ we can rest assured that Douglas-Rachford is making no meaningful progress: this is not the case with other algorithms or problems.

The noise optimisation.

Motivated by Douglas-Rachford getting stuck in fixed points, I came up with a simple new optimisation. Every single iteration, we add a small amount of noise (I added uniform noise with max magnitude 0.01) to $\bx_k$. The aim was to hopefully kick Douglas-Rachford out of the ruts it gets stuck in.

The details of how to add the noise are probably not super important, I just added in noise at an intermediate step which minimised the number of calculations needed.

The results were simply awesome (taken over 200 trials, formulation 2).

$(n,m)$	Successful trials, no noise	Successful trials, with noise
(20,2)	175	200
(20,10)	182	200
(50,2)	164	200
(50,25)	158	200

Yeah… adding a bit of noise every single tick does that. Incredible. Well surely this comes at cost of speed, right? You would think that while this noise kicks the algorithm out of ruts, it also kicks the algorithm around whether it likes it or not, slowing it down?

$(n,m)$	Median iterations, no noise	Median iterations, with noise
(20,2)	189	112
(20,10)	48	53.5
(50,2)	422.5	288.5
(50,25)	74.5	112.5

Not necessarily. In fact, in many cases we see a distinct improvement. I suppose whether there are performance benefits depends on the structure of the problem, but honestly, it is very surprising that are even problems where adding noise helps at all.

Also note that we do not include unsuccessful trials in the median, so any improvement seen is even more astounding.

(TODO: Maybe I should add the histogram for this with noise.)

Formula 1

Remember formulation 1 of the n-m queens problem? The one where we are looking for an $x$ such that

\[x \in C_1 \cap C_2 \cap \hat{C_3} \cap \hat{C_4}\]

Well it doesn’t perform as well so I never investigated it, but please enjoy this moderately satisfying video of a board being solved. Hang in there though, it takes 1200 iterations to solve it though!

TODO: I kinda want to see if adding noise helps it at all, but I also want to finish this blog post.

Optimising the termination condition

Before I figured out I should add noise, I tried a different microoptimisation: Instead of terminating by checking the projection of one of the components of $P_\mathbf{D} (x_k)$ onto $\{0,1\}^{n\times n}$, I checked the projection onto $\hat{C_1}$.

For (20,2), here is a rough comparison.

	Success rate	Average iterations	Average time per trial (ms)
Projecting onto $\{0,1\}^{n\times n}$	175	210.54	12.8
Projecting onto $\hat{C_1}$	184	193.57	13.7

The other values of $(n,m)$ I tested also yielded roughly the same improvements.

Since we’re only changing the termination condition, the improvement in success rate seems to indicate that in general, it usually only gets stuck when close to the solution, explaining the amazing efficacy of adding noise.

However, despite the average number of iterations needed going down, time needed per trial goes up: showing that projection to $\hat{C_1}$ is still a relatively expensive operation (at least the way I coded it).

Coding Malitsky-Tam

Despite the rather nasty-looking recurrence, it was quite easy to code. Here it is being run on 20-2 queens, with the parameter $\gamma = 0.5$.

It’s interesting how the algorithm doesn’t seem to really noticeably decrease the number of violations over time, it seems to jiggle the board a bit according to it’s special rules until it finds a solution. Also interesting are the emergent patterns of cells “bubbling” up the board.

I did play around with the value of $\gamma$ a bit, and it turns out a $\gamma \in (0.5,0.6)$ works reasonably well. According to my supervisor this is quite unusual, since he’s found that $\gamma = 0.9$ typically works best.

In a trial of 200, here’s how it benchmarks (compared against noiseless Douglas-Rachford)

Algorithm	Success rate	Average iterations	Median iterations	Time (ms)
MT	199/200	1124.80	1069	59.4
DR	175/200	210.54	189	12.8

And here is the distribution of iterations (note the larger bins, and different scale compared to the previous histogram).

MT is unremarkably slower than DR- however what is remarkable is its success rate! Which begs the question, what happened in the one trial where it failed to find a solution?

Turns out it got stuck in a cycle. Lol.

What’s truly fascinating is that in these 200 trials I haven’t seen Malitsky-Tam get stuck in a fixed point, it only fails by getting stuck in cycles. @Matthew Tam if you’re reading this I think there’s a theorem here to be proved… do it because I want to co-author a paper!

Adding noise to Malitsky Tam.

It’s actually quite confusing where to add noise to Malitsky Tam: for now, I added noise to $\bz_k$ each tick. Here’s what happened:

Algorithm	Success rate	Median iterations	Average time (ms)
MT noiseless	199/200	1069	59.4
MT with noise	31/200	3165	202.9

Noise completely and utterly disrupted Malitsky-Tam, and it’s hard to see why. I ran it through the visualiser, but I didn’t really see anything notable. It just kept searching, searching, searching, indistinguishable from noiseless Malitsky-Tam, except for the fact that it just seems to be unable to find a solution.

Perhaps, given the high median iterations, I would see more successes in raising the 5k iteration limit, but the conclusion would almost certainly hold the same: Noise really messes with Malitsky-Tam.

Formula 1, again

While writing this blog I ran MT on formulation 1 of this problem, and I found something equally fascinating. Running the benchmarks…

Algorithm	Success rate	Median iterations	Average time (ms)
MT noiseless	35/200	429	19.9
MT with noise	200/200	242.5	19.3

This trend is reversed! Now MT needs noise to function, apparently. And it’s actually fairly competitive with Douglas-Rachford! If we have a look at how the algorithm runs when there’s no noise…

Yeah there is some weird oscillitory behaviour going on there, and honestly this is quite hard to watch. Perhaps the value of $\gamma$ should be tweaked. Either way, let’s see how it fails…

It also get stuck in a cycle. And apparently MT can escape these cycles with noise. I haven’t investigated too many failures, so maybe there’s more to learn here. Either way, the main takeaway that the effects of noise are not necessarily a property of the algorithm: but is also dependent on the problem it’s applied to as well.

Final comparison

TODO: Clean up the final spreadsheet, and link it.

The next article: Sudoku!

That’s all my findings for NM-Queens, join me in the next part for a similar exploration in Sudoku, where I poke around these algorithms to see how they tick!

Click here for the next part!

	Success rate	Average iterations	Average time per trial (ms)
Projecting onto \(\{0,1\}^{n\times n}\)	175	210.54	12.8
Projecting onto \(\hat{C_1}\)	184	193.57	13.7