<![CDATA[Tom Lord's blog - Blog]]>Tue, 17 Nov 2015 22:52:55 -0800Weebly<![CDATA[Missy Elliott's Reciprocal Cipher, And Perfect Oscillating Sequences]]>Sun, 29 Mar 2015 10:13:33 GMThttp://tom-lord.weebly.com/blog/missy-elliotts-reciprocal-cipher-and-perfect-oscillating-sequencesMissy Elliott's encoding algorithm
This is an excerpt from the README of a ruby gem I published, aptly named: missy_elliott.
It is obvious, at a glance, that this encoding algorithm is easily reversible: You simply repeat the same 3 steps ("shift, flip, reverse") in reverse order ("reverse, flip, un-shift").

However, after further investigation, I noticed something strange: MissyElliott.encode, and MissyElliott.decode - despite having different implementations - are actually doing exactly the same thing!!
This completely threw me off when I first saw it, but it's true! If you'd like to have a go at proving this for yourself, stop reading now. I'll show my answer after the following message from my sponsor:

Reciprocal Ciphers

A reciprocal cipher is an encoding algorithm that is the inverse of itself. Perhaps the simplest, best-known example of such a cipher is ROT-13; a special case of the Caesar cipher.

The reason why ROT-13 is reciprocal is quite obvious: We shift each letter of the input down (or up!) 13 places in the alphabet. And, since there are 26 letters in the alphabet, repeating the process gets us back to where we started. For example:

"flip me" --> "syvc zr" --> "flip me"

But how on Earth does Missy Elliott manage this, with her much more complicated encoding?! Here's one example, to show the reciprocal encoding in action:

ORIGINAL: 10011101
shift:           00111011
flip:             11000100
reverse:      00100011 (encoded)
shift:           01000110
flip:             10111001
reverse:      10011101 (twice encoded == ORIGINAL!!)

Is this always true? Can we prove it? Yes - here's what I came up with:

We only need to consider what happens to an individual bit, when applying the Missy Elliot algorithm (twice). Each bit an be precisely described by two things: its value (1 or 0), and its position (how many bits are to the left/right).

Without loss of generality, let's consider what happens to a single bit, of value B (the opposite of which is b), which has x bits to its left (and for the sake of clarity, y bits to its right). The following syntax should be fairly self explanatory:
* There is a slight edge case here, which I have omitted: What happens when x=0, i.e. the bit is "shifted" onto the back of the list? However, this is a fairly trivial edge case to cover; I leave this as an exercise for the reader. (I've always wanted to use that phrase, after having it drilled into me at university!)

Missy Elliott's Graph

Missy Elliott's algorithm essentially works by mapping each character's ASCII code to an "encoded" number, then converting it back to the corresponding character in the ASCII table. Since the algorithm is reciprocal, this creates a 1-to-1 pairing between ASCII codes when repeating the encoding.

For example:

0 = 00000000 <--> 11111111 = 255
1 = 00000001 <--> 10111111 = 191
2 = 00000010 <--> 11011111 = 223

Do you wonder what it would look like to plot all 256 points on a scatter graph? Well, I did:
The obvious property that this graph shows is: If x<128, then Encoding(x) >=128 (and vice versa). Proving this is quite easy:

Using similar technique to the above proof, only in this case B represents 7 digits, rather than just 1 (and b represents its "flipped" value), consider what happens when we apply the Missy Elliott encoding to any number < 128:
In other words, if the original value starts with a zero (i.e. is < 128) then its encoded value must start with a 1 (i.e. is >= 128). However, there is a far more interesting (less obvious) feature of this graph: it always oscillates in value...

Encoding(0) = 255 > Encoding(1) = 191 < Encoding(2) = 223 > Encoding(3) = 159 < ...

... Except for one point, in the middle:

Encoding(126) = 192 > Encoding(127) = 128 > Encoding(128) = 127 > Encoding(129) = 64

This can be visualised by taking a subsection of the above graph:
In fact, this sequence has an even crazier property hiding beneath the surface. Let's only look at the second half of the mappings, i.e. Encoding(128), Encoding(129), ..., Encoding(255). This sub-sequence is equal to:

Perfect Oscillating Sequences

* Disclaimer: I have no idea if sequences with this property have ever been named before; I certainly cannot find one! I invented the name "perfect oscillating", but if you feel an alternative name is more suiting, or are aware of pre-existing name, please let me know in the comments!

Let's take a look at a few sub-sequences of the above:

(a_2n) = 63, 31, 47, 15, 55, 23, 39, 7, 59, 27, ...
(a_3n) = 95, 47, 119, 23, 71, 59, 107, 11, 83, ...
(a_4n) =
31, 15, 23, 7, 27, 11, 19, 3, 29, 13, ...
(a_2n-1) = 127, 95, 111, 79, 119, 87, 103, 71, 123, ...
(a_5n-3) = 63, 79, 23, 123, 43, 83, 3, 109, 53, 69, ...

Any sub-sequence of the form (a_xn+y) also oscillates!!

As it turns out, Missy Elliott's song is all about a novel way to generate the sequence: A030109[7]. Who would have guessed?!

In fact, there's one very useful application for sequences like these - the answer might come as a surprise!
It's a knock-out!
To keep things simple from here on, let's use a somewhat shorter sequence. The following perfect oscillating sequence can also be generated using the Missy Elliott algorithm (on the numbers 0 - 7), and adding 1 to each term:

8, 4, 6, 2, 7, 3, 5, 1

I found a clue for how this sequence is used, in the comments for A049773. To summarise, this is the "optimally fair" starting line-up for ranked players in a knock-out tournament! Assuming the favourite always wins, the result of such a tournament would look like this:
Missy Elliott truly is a lyrical genius, after all.

Phew, well that got a little side-tracked from the original purpose of this blog post!

Was it worth it?
<![CDATA[Reverse Engineering Regular Expressions]]>Thu, 19 Mar 2015 23:02:20 GMThttp://tom-lord.weebly.com/blog/reverse-engineering-regular-expressions I recently published a powerful ruby gem on Github: regexp-examples. This library allows you to generate all (or one random) strings that match any regular expression! (With just a few limitations on what's possible.) To install it yourself and have a quick play is dead easy:
In this post, I will try to explain some of the core logic and techniques that I used to build this library. What is a regular expression, really? How/why is it always possible de-construct a regex, to list all possible strings that match it? How on Earth did I get back-references to work?!

All shall be revealed...

What Is A "Regular" Expression?

Perhaps the most confusing aspect of regular expressions comes from their formal definition, and the fact that several features in the regex language are not really "regular" at all! These "irregular" pieces of syntax are, in short (and by no coincidence!), the "illegal syntax" in my regexp-examples gem.

However, rather than mysteriously telling you what a regular expression isn't, let's try to explain what it is:

There are only really four (yes, that's right, four!) pieces of syntax allowed in a "true" regular expression:
  1. The "empty string", usually denoted by: ε
  2. Literal characters, e.g. /abc123/
  3. The * repeater, e.g. /a*b*c*/
  4. The | ("Or") operator, e.g. /a|b|c/
Oh, and there's also brackets - so maybe five pieces of syntax, if you want to count those as well!

Every other piece of syntax is really just a nice way to simplify writing out horrendously long, complicated combinations of the above. Let's try a few examples:

...Hopefully, you get the idea. Or, to put it another way, any regex that can't be expressed in this way is not really regular!
An easy way to see whether or not this is the case is: Does (part of) the regex need to know its surrounding context, in order to determine a match? For example:

These all need to know "what came before", or "what comes next", and are therefore not True Regular Expressions. Hopefully this makes the common claim that "back-references are not regular" a little more obvious to understand: You need to know what the capture group matches before you can know what the back-reference matches (i.e. knowledge of context). So of course you cannot express such patterns using only those four symbols!

One final point to make, before we move on: There is only really one type of repeater in regex; the others are all nothing more than shorthand:

the fundamental structure of all regex patterns

Understanding this structure is at the very heart of my ruby gem; the whole library architecture depends on (and, for some occasional edge cases, is restricted by!) it.

All True Regular Expressions
are built using this structure:


Where every group can, itself, be built using that same structure.
What?! Show me some examples!
I'm glad you asked. Consider the following:
(Yuck! Thankfully we don't normally need to write them out like this!...)
But this lays the foundations for the main purpose of this blog post:

How To Parse A Regular Expression

Without getting bogged down in the nitty-gritty implementation details of parsing, let's dive straight in and look at the internal objects generated by RegexpExamples::Parser:
This may look complicated, but it's essentially not much different to what I described above. There is only one key additional thing to consider: In order to avoid problems with infinity, we must restrict repeaters like * and + to have an upper limit. Taken straight from the gem's README:
Or, in other words, the above regex has been interpreted as equivalent to the following:


Like I said above:

How To Generate Examples From A Regular Expression

So, we have our parsed regex. All that remains is to transform this into its possible strings. The trick to this is that all groups and repeaters are given a special method: #result. These results are then built up, piece by piece, to form the full strings that match the regex. Let's take the above example, one step at a time:

  • The SingleCharGroup ("a") has one possible result: ["a"]
  • Therefore the StarRepeater has three possible results: ["", "a", "aa"]
  • Similarly, SingleCharGroup ("b") has one possible result: ["b"]
  • Therefore, PlusRepeater has three possible results: ["b", "bb", "bbb"]
  • Next, the OrGroup simply concatenates these arrays of possible results, i.e. it has six possible results: ["", "a", "aa", "b", "bb", "bbb"]
  • And finally, the top level OneTimeRepeater just returns these same values.
And there you have it, for a fairly simple example! Let's look at one more, to demonstrate perhaps the most important method in the whole gem:
Once again, we make use of PlusRepeater#result and SingleCharGroup#result to build up the final answer from each "partial result".
However, in this case we end up with the following:

[["a", "aa", "aaa"], ["b", "bb", "bbb"], ["c", "cc", "ccc"]]

Where each of those inner arrays is the result of each PlusRepeater. We need to make one more step: Find all possible results, from joining one element from each array, to form a "final result" string. Enter the magic glue that holds this whole thing together:
*I've actually simplified this method slightly, to avoid confusion. The real deal can be found here.
And so, after applying this method to the above array, we end up with:

["abc", "abcc", "abccc", "abbc", "abbcc", .....]

This method gets used a lot, when dealing with more complicated regexes! It is the magic function that allows patterns to be made arbitrarily complicated, with unlimited nesting of groups and so on.

So, there you have it! Now you understand all about how to generate examples from regular expressions, right?...
PictureA subtle metaphor

Oh, but...
How do you deal with escaped characters, like \d, \W, etc?
What about regexp options (ignorecase, multiline, extended form)?
What about unicode characters, control codes, named properties, and so on?
How on earth do you correctly parse all of the possible syntax in character sets, such as:
  • /[abc]/.examples
  • /[a-z]/.examples
  • /[^\d\ba-c]/.examples
  • /[[:alpha:]&&[a-c]]/.examples
...And I'm barely getting started here! There is a huge range of syntax to consider!

To cut a long story short: parsing is complicated!! However, all the basic principles discussed above still apply.

There is just one final piece of the puzzle that I have mostly avoided up until this point: back-references.
As discussed earlier, back-references are not regular, as they require knowledge of context. They are not strictly possible to fully support with this gem! (And indeed, there are some rare edge cases where my solution does not work.)

But, as promised, all shall be revealed...

How to generate examples with back-references

The important thing to recognise here is that you cannot know what the back-references need to match, until after the rest of the regex example has been generated. For example, consider the following:


You cannot possibly know whether the "\1" is meant to be an "a" or a "b", until after the capture group's "partial example" is chosen!
The solution? We cheat, and use a place-holder - then substitute the correct pattern back in later!

The pattern I chose is: __X__, where X is the name of your back-reference (in this case, "1").
There is a lot of intricate logic involved in actually keeping track of the results of these capture groups (perhaps the topic for a follow-up blog post?), so let's gloss over this detail for now. So in summary, examples for the above regex are calculated as follows:

  • The SingleCharGroup ("a") has one possible result: ["a"]
  • The SingleCharGroup ("b") has one possible result: ["b"]
  • The OrGroup has two possible results: ["a", "b"]
  • The MultiGroup with group_id=1 has two possible results: ["a", "b"]
  • The BackReferenceGroup has one possible result: ["__1__"]
  • This gives us a final array of possible results: [["a", "b"], ["__1__"]]
  • After applying the permutations_of_strings method, this gives us two "final results": ["a__1__", "b__1__"]
  • We now do one final step: Apply a #substitute_backreferences method on each string, to reveal the true strings that match the original regex: ["aa", "bb"]
And now finally, young Padawan, you are ready to see the actual implementation of Regexp#examples:
*Once again, I've been naughty and shown you a slightly simplified version, to avoid confusion. See the real thing over here.

I leave you with one final example, showing the true power of this gem:

Question: What the hell does this ridiculous regex match?! (Note: Don't ever use a regex like that to validate an email address!!)

Answer: (On my average machine, it takes ~0.01 seconds to generate an example string!!)
<![CDATA[Interactive Eurovision Map]]>Mon, 02 Jun 2014 12:30:45 GMThttp://tom-lord.weebly.com/blog/interactive-eurovision-map This year, I was (un?)fortunate enough to visit Copenhagen and watch the Eurovision Song Contest, performed live.
However, like every recent year, the voting system was clearly skewed by countries voting for neighbours / political reasons.
I really wanted to see some sort of map of the Eurovision results, to see what these "political" votes really looked like - but unfortunately, nothing I found even came close to what I was hoping for. Well except for maybe this map, which is at least interactive but with fairly limited scope... 
Everyone else seems to have either already done the analysis for you and presented their results, or displayed the raw data in an ugly way!

So, I decided to make one myself. This is, by far, the most interactive Eurovision map I've found on the internet. Compared to this map, for example, everything seen there can be achieved by clicking "To country", "2014" and Austria. Here's what I made:
In case the colour scheme is confusing anyone:
Yellow = Selected country (if applicable. You cannot select non-participating countries, or countries that voted but did not sing, when in "To Country" mode.)
Grey = The country did not participate.
Pink = The country voted, but did not compete (in the final). So of course, they did not receive any votes - hence I've given them a special colour when in "From Country" or "Final Score" mode.
Red --> Green = Lowest --> Highest ranked. In the case of "Spearman Ranked" mode, this is supposed to mean "worst --> best at voting fairly", although this statistical analysis doesn't work as well as I'd hoped, yet...

Rather than simply telling you what I think, I'll let you click around and come to your own conclusions.
San Marino
San Marino is there, I promise ;)
Performing a good statistical analysis on this data is difficult, to say the least. In my "beta" version above, I have performed a Spearman's rank on the data, which - in a nutshell - displays how accurately each country's votes align to the final score.
For the moment, this is only comparing like-for-like, e.g. a country's televote with the overall televote
However, there are a few big problems with this, when it comes to showing how "unfairly" a country voted, such as:
  • Since so many countries were "block voting" for their neighbours, this had such a big impact that Spearman's rank correlation shows these offending countries as voting more "accurately"!
  • If a country only gives unfair votes to a few neighbours, but votes fairly elsewhere, their Spearman's rank score will still be quite high.
  • If a country's neighbour happened to do well overall (maybe even because their song was good?!?!), then this otherwise-block voting is seen as a fair  vote. For example, Norway and Finland are well known for giving Sweden an "unfairly" high score*... But in 2014, Sweden came 3rd, so does this mean they were "right" to vote for their neighbours???
*Sorry to pick on you!
...If I can be bothered, I'll have a go at improving, or at least extending, the statistical analysis methods used to give some more meaningful information.
For example, I could potentially rank countries by how often they vote (or don't vote!) for the same people year after year!

Gathering data for this project is also particularly challenging. Results are published in a variety of formats (or even not published at all!), on different websites. Some include all the data, but most don't. I was quite lucky, in fact, that the 2014 data is so complete.

The source code (including the raw data I used) is all available on my github page, here.
(Disclaimer: This is my first ever javascript/jQuery/web app project, and I just threw this together in a few days without unit testing etc... Please be forgiving if you look at the code!)