Thursday, December 8, 2011

Cryptography 101

About four years ago, I was given a book for Christmas called "The Cracking Codebook," by Simon Singh. Before I read the book, I had a passing interest in codes, occasionally making one every now and again (because I've pretty much always been a nerd to some degree). But after I read that book, well, I really felt I understood how codes work, and how to make the best codes. With that in mind, Wayward Letters is proud to present Cryptography 101, starting with the simplest codes, and heading to a more difficult Vigenère Square cipher. At the end, I'll include my own little code, that attempts to emulating the power of a one-time pad without resorting to an incredibly large key.

Perhaps I should back up here, and explain some of the terms I'll use before I dive into the world of cryptography. The plaintext is the text to be encoded, the key is the way you encode and decode it, and the ciphertext is the resulting text. An example might be simpler. Say you encode "hello" into "uryyb." The plaintext here is "hello," and "uryyb" is the ciphertext. The key is the way you encoded that "hello" into "uryyb."* So that's that.

There's one thing to understand, before we jump into coding: every code has a weakness of some sort. Some weaknesses might be bigger or smaller than others, but every code has a weakness. One problem is (usually*) the fact that you have to get the key to the person. Another problem might be inherent problems in the code, such as a guessable key or weaknesses in the key used.

One last note: according to Wikipedia, in cryptography, the term "code" is not strictly synonymous with the term "cipher" (think of the contrast between the regular use of the word "theory" and the scientific use). If you feel like being technical, then mentally replace every usage of the terms "code" and "encode" with "cipher" and "encipher" or "encrypt." But I'm trying to cut down on jargon as much as I can to try and make things simpler.

There are also two different types of codes: monoalphabetic ciphers and polyalphabetic ciphers. And because I can't be bothered typing those words out again and again, I'll be calling them MACs and PACs and I will be explaining them very shortly.

The easiest code to start off with is a Caesar shift code, which is a MAC. You can test out a Caesar shift code here, but I can try to explain it here. Say you have a Caesar shift of ROT3 (you don't need to know what that means yet). What this means is that, to encode, you shift every letter down three, so "a" becomes "d," "b" becomes "e," and so on. To decode, you shift the letters back up three. It's a simple enough code, which obviously is pretty simple to decode. If you think somebody has used a Caesar shift, then it's a simple enough matter to test out the 25 possible keys* and decode the ciphertext.

The next step is a regular MAC, called a monoalphabetic substitution cipher (which I'm calling a MASC because that is way too long to write out). What this means is that every letter in the alphabet is replaced with another letter, number or symbol. So "a" might become "r" or "4," while "b" might become "w" or "&." This is the code a lot of people stop at, the code a lot of people go to. I've seen people replace every letter with a symbol they just made up, and claim the code is invincible, because hey, how can anybody decode the mess of letters and symbols?

This code is not invincible. In fact, this code was beaten in the 9th century by an Islamic philosopher named Al-Kindi, using something called frequency analysis. The trick to beating a MASC is this: in any language, some letters are more common than others. Thus, in the code, some symbols will be more common than others. By tallying up the number of all the symbols and figuring out the frequency of each symbol used, and attempting to match them up with the frequency of the letters used in the English language (or any language, really, but I'm assuming English here), then the code can be deciphered.

I'm guessing the reaction here is either 'Oh, of course!' or 'I still don't get it.' Allow me to use an example. The two most common letters used in English are e and t. If the most common symbols used in the ciphertext are j and 4, then it is reasonable to guess that the MASC used encodes "e" as "j" and "t" as "4." That's two letters you've got already. It might involve a fair amount of guesswork, but it's definitely not an uncrackable code. Frequency analysis (as this is called) also requires a large amount of plaintext to get more accurate letter frequencies.

The second problem with a MASC is a more simple one, and easily remedied when producing a code: the existence of spaces. If you see a single letter separated by spaces, it's pretty much either "a" or "I." If you see two-letter words, then chances are they'll have a vowel. This sorta thing is a decipherer's godsend (note, I do not know if decipherer is a word. But it probably should be.), and I'm going to call them hints because I can't find a proper word for it (cribs is close, but it's technically not right. Ah well.).

Now we've gotten past that, we can move onto the much more powerful polyalphabetic cipher, or PAC. What this means is that instead of having one key for every letter, as in a MASC, there'll be more than one key. In other words, say you have three keys. The first letter would be encoded using the first key, the second letter the second key, the third letter the third key, and the fourth letter would be encoded the using the first key again.

The PAC I'll be going into, before I reveal my own PAC I've been working on, is called the Vigenère Square cipher. This is a simple, but very powerful code that can be used arguably more simply than a MASC, so this is the first code that you should go to if you're going for a non-professional cipher (because it is breakable, and I'll be explaining how).

What you'll need, first of all, is this square here, which is easy enough to draw up if you don't have a computer. It's essential to the encoding and decoding. You'll also need a keyword or a keyphrase, which is another fancy term for a password.

OK, time to get to the actual encoding. Say you want to encode the phrase "Hello there" using the password "codes." For the first letter, you look up the intersection of "h" and "c," which in this case is "j." Then, you look up the intersection of "e" and "o," and you get "s." When you get to the first letter in "there," then you go back to the "c" in codes, and do the intersection of "t" and "c."

This sounds more complicated than it is, and you eventually get the hang of it. So eventually, you find that "Hello there" is encoded as "Jsopg vvhvw." You will notice that this bears no real resemblance to our original phrase. As you can see, the Vigenère Square cipher is indeed a powerful one.

However, it is not uncrackable. Its weakness lies in the repetition of the password, which is less of a problem in shorter plaintexts but becomes more evident in longer ones. Say, you wanted to encode "the cat and the dog" using the password "fox." You will get "yvb hoq fba yvb icd" (ideally you'd take out the spaces, but I'm leaving them here for this example). You might notice here that the combination "yvb" repeats itself. The distance between the two y's is nine letters. So it can be assumed that the password is some factor of nine, so either three or nine (we're crossing out one because that would just be a Caesar shift). If we try three, then we use frequency analysis for every third letter.

Obviously, I'm simplifying a little, but that is pretty much it (there are other ways of decoding it, but that's the main one). If you want to read more about it, then it's technically called the Kasiski examination,* and it can be read about on Wikipedia here.

There's one thing about the Vigenère Square cipher I haven't mentioned (except for a passing reference in the first paragraph), and that is the one-time pad. As I mentioned, one of the weaknesses of the cipher is the repetition of certain combinations of letters. So what if you could have a password as long as the ciphertext.

Well, as it turns out, if the password is completely random, only used once, and kept completely secret, then the code is literally uncrackable. Of course, then the problem leads to "How do I get them the password?" then to "If I can get them the password secretly, why don't I just give them the actual thing I want to tell them secretly?" and finally "What is the point of a one-time pad then?" Well, there is a point one-time pads aren't used.

So now we've covered MACs and PACs, let's go onto my code. I just mentioned one-time pads, and one day I was thinking if it was possible to have a seemingly random password which doesn't require having a completely random password that is the whole problem with one-time pads. And then it struck me: irrational numbers. In other words, numbers like pi and the square root of 2. They have essentially random strings of numbers (as far as I'm aware) and they're fairly easy to calculate.

The immediate problem I ran into was that they are in different bases, which would lead to a ciphertext letter representing 10 possible other letters rather than 26, which would be a large problem. So I eventually figured out the two best possible options: base 3 with three-digit numbers, where there is 27 options (so the alphabet and something else); and base 6 with two-digit numbers, where there are 36 options (the alphabet and numbers 0-9). The problem with option 1 is the something else, and the problem with option 2 is the fact that numbers aren't used as often. For the purposes of this, I'm using the second option, as I foresee option 1 will run into some problems whether the something else is a space, a full stop/exclamation mark/etc, or just a junk letter. So base 6 it is.*

I considered explaining it in words, but quite frankly, it's easier with a picture. Eventually I might make it on Excel, if it turns out to be possible/practical, but for the minute, admire this pretty picture, which can be clicked on for easier reading. I used pi as the password just because it's that much easier, but if you do end up using my code, go for something a little less predictable, say, for example, the cube root of the natural logarithm of 1386. Also, in case it's not clear, you only look at the numbers after the decimal point. I think that's more or less my code. It makes sense to me at least, but if you're confused, comment or something. Or don't use my code. To each their own.

And that, as they say, is that. I have the urge to apologise for the post being late, but then I realised there is no schedule and so it cannot be late. So allow me to acknowledge the fact that this post took several days, but point out that there were several things going on such as parties, sickness, and a job interview. I'm not going to recap my boring life, suffice to say, those were the reasons. Well, that and procrastination. For the future, I'm going to try to alternate this type of long-winded stuff with simple stuff, like uploading comics or something. But, we'll see. I hope you had fun dipping your toe into the water of cryptography, and I'll see you next time in Wayward Letters.

AB

No comments:

Post a Comment