Decoding Sit & Solve Cryptograms

Proposal, Background, Review

The Problem

The problem solved by this project is the automatic decoding of monoalphabetic substitution ciphers. There are a number of short (about 90 characters) ciphertext passages and the project will attempt to decode them to plaintext English using statistical analysis and certain properties of English.

Since the system does not attempt to understand English by using natural lanaguage parsing techniques, it cannot decide whether a given decoding is actually correct. That task is left to the user of the system. The system will provide a number of measures and all relevant detail it used in providing its "best guess."

Studying the methods used in monoalphabetic substitution ciphers allows us to lay a groundwork for a formal specification process of more powerful ciphers.

Methods

Methods for analyzing mono and polyalpabetic subsitutions are presented below. They include:

Frequency Disitributions
Visual Inspection for Simple Rotation
Contraction Rules
Short Words

Monoalphabetic Analsysis

The primary method of decoding the ciphers is to create two frequency distributions, one for a selection of English plaintext, and the other for the ciphertext under analysis. The frequency of each letter is the number of times that letter appears in the text compared to the total amount of text. This yields a number representing what percent of the text the letter makes up. Generally, English letter frequencies are around five or size percent and not much more than 14 percent. However, depending on the exact text, certain letters may be over-represented. For example, a book on zebras will have an abnormally high frequency for the letter 'z'.

The display of the frequency distributions allows for a visual inspection that may reveal a rotation pattern. The naked eye is very good at detecting such patterns, and is given the chance to do so, thereby saving future work at matching frequencies.

English punctuation rules, especially contraction rules, are also very useful. Ideally, a good ciphertext should not include this sort of "markup." However, an equally devastating markup is to include spaces in the ciphertext, thereby betraying short common words. In English, 'LL, N'T, 'S, 'D, 'M, 'RE are common contraction patterns. Since D, T, S, M, L, R and E have distinct frequencies, any contractions in the ciphertext can quickly be resolved. For example, say a ciphertext letter G can map to either a plaintext T or plaintext A. If 'G appears in the ciphertext, then the system can easily decide that G maps to T.

Preserving spaces in the plaintext to ciphertext transformation is another weakness that is easily exploited. These two weaknesses lead us to conclude that a good cipher should preserve as little semantic markup as possible in the plaintext; more importantly, the cipher should not reflect this markup in the ciphertext.

Polyalphabetic Analsysis

One of the project goals was to create a system to assist in decoding polyalphabetic substitution ciphers. No software system was created to perform this task. It should be sufficient to perform the tasks below by hand and then use the regular system to do the monoalphabetic analsysis.

Polyalphabetic substitution ciphers seek to frustrate the type of statistical analysis of ciphertext that can easily be performed on monoalphabetic substitution ciphers. Polyalphabetic substitution ciphers will flatten the frequency distribution of the ciphertext, making it more difficult to match letters. The main idea in decoding polyalphabetic substitution ciphers is to determine the number of alphabets used in the ciphertext and then split the ciphertext into pieces corresponding to those alphabets. From there, each piece is solved as a simple monoalphabetic substitution cipher. The Kasiski method and the index of coincidence are two tools used in this process.

The Kasiski Method

The Kasiski method takes advantage of the natural redundancy in many languages. Many words and letter groups are disproportionantely repeated in any given plaintext. Even though the plaintext may be encoded with many alphabets, unless the cipher is the one-time pad, a word or letter group will be repeated under the same alphabet.

Index of Coincidence

The index of coincidence is a measure used to compare frequency distributions. English has a predictable frequency distribution. Thus, any ciphertext's frequency distribution can be analyzed to see if it closely matches the frequency distribution of English. In addition, if the ciphertext is encoded with a polyalphabetic cipher, the index of coincidence can predict how many alphabets the ciphertext is encoded with. However, the index of coincidence is of little help if more than four alphabets were used.

The System

The software system consists of a number of filters that act on the input and pass it down the chain. The main driver is responsible for managing the chain of filters and displaying the output. The system should be given a file name on the command line that represents the given frequencies of plaintext in the following format:

	mc.A = 12
	mc.B = 2
	mc.x = xx
	...
	mc.Z = 1

The system then constructs the following filters:


		private static Normalize normalize = new Normalize();
		private static AposRule aposRule = new AposRule();
		private static CharCounter charCounter = new CharCounter();
		private static CipherChart cipherChart = new CipherChart();
		private static PlainChart plainChart = new PlainChart();
		private static AlphabetLoader alphabetLoader = new AlphabetLoader(); 
		private static CommonWords commonWords = new CommonWords();

Using the database and the simple filter actions, the system produces metadata that assists in decoding the ciphertext. An example is provided at the end of this paper.

The components constructed include the basic system, a web application for training the system and for entering ciphertext to be analyzed via the web, and a small GUI that visually assists the cryptoanalyst.

Results

The results of this project are a system for aiding in the decoding of simple monoalphabetic substitution codes, and a formalized approach to solving such problems. One particular cipher that is illustrative of both the difficulties and advantages of using the frequency distribution and formalized approach is illustrated below.

A major result of the project was the software system Monocrack built to decipher the subsitution codes. Another result was the realization that frequency distributions require a certain amount of interpretation because frequencies for any given plaintext database and ciphertext selection will not exactly match. Instead, several groups of letters form, and ciphertext letters map into the group rather than a single letter.

Based on the plaintext database frequency distribution shown below, there are nine groups of letters in the English alphabet. They are listed here in roughly descending order by frequency:

{ E }
{ A, T }
{ H, I, S }
{ N, O }
{ R }
{ D, L }
{ B, C, K, P }
{ F, G, M, U, W, Y }
{ J, Q, V, X, Z }

It's also interesting to note that the frequency mappings pare down the tree of 26! possible plaintext translations for a given ciphertext. For example, ciphertext letter X, without any given knownledge of the plaintext language, has a 1/26 chance of mapping to any plaintext character. X can map to any one of 26 characters! Then, Xi can map to any 25. Each mapping has an equal but increasing chance of being correct and leads to worse than exponential space and time analysis. However, after a frequency distribution analysis, X maps into a small target group and has at worst a 1/5 chance of being correct. While the general problem is still worse than exponential, this specific instance is now vulnerable to a brute force attack. Worse, it is almost trivial to use some heuristics to decode the ciphertext, avoiding the need for a greatly reduced brute force attack.

The formalized approach includes the following steps, which are closely aligned with the filters described in the software system. First, the input is "normalized"; that is, the input is converted to uppercase and stripped of punctuation. More sophisticated systems may derive information from this punctuation markup, but the current system does not. Second, the input is scanned for the occurance of a contraction and candidate letters are saved. Third, the frequency distribution is calculated. Fourth, single words are pre-mapped to either A or I. Then, two and three letter words are saved for later comparison to known two and three letter words. Last, the plaintext database is loaded in and compared with the frequency distribution of the ciphertext. The output of these steps provides enough detail and analysis to quickly crack the particular substitution.

Lessons Learned From Monoalphabetic Substitutions

We can formulate some general guidelines for designing a "good" cipher by observing the strengths and weaknesses of the simple monoalphabetic substitution. First, MS ciphers do not have enough confusion; that is, they predictably map a plaintext bit to a ciphertext bit. Second, MS ciphers do not have good diffusion; that is, they do not map a plaintext bit to a number of ciphertext bits.

Example

The frequency distribution below is based on about 47,000 characters from the first chapter of the fifth book in Stephen King's Dark Tower series. This is a fairly characteristic distribution for English, and closely matches the other collected databases of plaintext (from both published sources and collected by the author).

     8% 1% 1% 4% 12% 2% 2% 6% 6% 0% 1% 4% 2% 7% 7% 1% 0% 5% 6% 8% 2% 0% 2% 0% 2% 0%
     ----------------------------------------------------------------------------
 20 |
    |
    |
    |
    |
 15 |
    |
    |
    |             *
    |             *
 10 |             *
    |             *
    | *           *                                            *
    | *           *                          *  *              *
    | *           *        *  *              *  *           *  *
 5  | *           *        *  *              *  *        *  *  *
    | *        *  *        *  *        *     *  *        *  *  *
    | *        *  *        *  *        *     *  *        *  *  *
    | *        *  *  *  *  *  *        *  *  *  *        *  *  *  *     *     *
    | *  *  *  *  *  *  *  *  *     *  *  *  *  *  *     *  *  *  *     *     *
 0  | *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *
     ----------------------------------------------------------------------------
      A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z

The frequency distribution below is for the following 71 character ciphertext:

KBT UBI IB WM LVQMXTY NX KBT ABHI CHBG GRMQM KBTQM UBNHU WMLVTFM KBT SNURI HBI UMI IRMQM

This ciphertext includes two apostrophes: one before 'I' and one before a 'QM'.

     1% 14% 1% 0% 0% 1% 2% 5% 9% 0% 5% 2% 14% 4% 0% 0% 5% 4% 1% 8% 7% 2% 2% 2% 1% 0%
     ----------------------------------------------------------------------------
 20 |
    |
    |
    |
    |
 15 |
    |    *                                *
    |    *                                *
    |    *                                *
    |    *                                *
 10 |    *                                *
    |    *                    *           *
    |    *                    *           *                    *
    |    *                    *           *                    *  *
    |    *                    *           *                    *  *
 5  |    *                 *  *     *     *           *        *  *
    |    *                 *  *     *     *  *        *  *     *  *
    |    *                 *  *     *     *  *        *  *     *  *
    |    *              *  *  *     *  *  *  *        *  *     *  *  *  *  *
    | *  *  *        *  *  *  *     *  *  *  *        *  *  *  *  *  *  *  *  *
 0  | *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *
     ----------------------------------------------------------------------------
      A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z

Given that B or M are likely candidates for a plaintext 'E' and that I could be M, D, S, or T (from the contraction rule), and QM could only be RE or VE, M is probably E. There also doesn't appear to be any rotation in effect here.

If M is a plaintext E, then we can start to fill in a translation table:

KBT UBI IB WM LVQMXTY NX KBT ABHI CHBG GRMQM KBTQM UBNHU WMLVTFM KBT SNURI HBI UMI IRMQM
            E    E                       E E     E        E    E                E    E E

We also suppose that the ciphertext Q can map to R or V. Q appears about 5% in the ciphertext, and this closely matches the frequency of a plaintext R (the frequency of a plaintext V is near zero). Thus, our translation quickly becomes:

KBT UBI IB WM LVQMXTY NX KBT ABHI CHBG GRMQM KBTQM UBNHU WMLVTFM KBT SNURI HBI UMI IRMQM
            E   RE                       ERE    RE        E    E                E    ERE

Now, we remember that B was an extremely common letter in the ciphertext. We also remember that I likely maps to some other common letters, M, D, S, or T. The ciphertext B could map to plaintext A, T, N, or O. The ciphertext I's frequency suggests that it is probably a T or S. This could make the ciphertext B map to a plaintext A or T. So, our decoded text could be either:

KBT UBI IB WM LVQMXTY NX KBT ABHI CHBG GRMQM KBTQM UBNHU WMLVTFM KBT SNURI HBI UMI IRMQM
 A   AT TA  E   RE        A   A T   A    ERE  A RE  A     E    E  A      T  AT  ET T ERE

KBT UBI IB WM LVQMXTY NX KBT ABHI CHBG GRMQM KBTQM UBNHU WMLVTFM KBT SNURI HBI UMI IRMQM
 T   TS ST  E   RE        T   T S   T    ERE  T RE  T     E    E  T      S  TS  ES S ERE

Neither seems particularly promising. Here also the number of short words helps us out. Since we have said that ciphertext M maps to plaintext E, none of the three letter words will be a plaintext "THE". More likely, the three letter words may be "ARE", "AND", or some other three letter word. We see 'KBT' and 'BI' repeated as patterns. Also, the two letter word 'IB' cannot map to ST or TA. It's likely that I is T and that B is O. The ciphertext and decoded text then become:

KBT UBI IB WM LVQMXTY NX KBT ABHI CHBG GRMQM KBTQM UBNHU WMLVTFM KBT SNURI HBI UMI IRMQM
 O   OT TO  E   RE        O   O T   O    ERE  O RE  O     E    E  O      T  OT  ET T ERE

This result seems particularly promising. 'IB' has resolved to 'TO' and some words are starting to suggest themselves. Now, we notice that KBTQM was a contraction, and that KBT appears a number of times. If B is O, then KBT probably represents the common three letter word 'YOU'. It's highly likely that ABH'I is DON'T. Now, ciphertext U seems to match plaintext G, even though the frequency distribution is misleading in this regard. We can further conjecture that W is B, giving us:

KBT UBI IB WM LVQMXTY NX KBT ABH'I CHBG GRMQM KBTQM UBNHU WMLVTFM KBT SNURI HBI UMI IRMQM
YOU GOT TO BE   RE U     YOU DON'T  NO    ERE YOURE GO NG BE  U E YOU   G T NOT GET T ERE

At this point, a fair amount of the ciphertext has been decoded. R is probably H, which makes the RMQM translate to HERE, and plaintext W for ciphertext G along with the I/T subsitution gives 'WHERE' and 'THERE'. Also, GO_NG suggests GOING, BE__U_E suggests BECAUSE and _IGHT suggests MIGHT. Our familiarity with English allows us to complete the decoding from this stage on. The final message is:

KBT UBI IB WM LVQMXTY NX KBT ABH'I CHBG GRMQM KBT'QM UBNHU WMLVTFM KBT SNURI HBI UMI IRMQM
YOU GOT TO BE CAREFUL IF YOU DON'T KNOW WHERE YOU'RE GOING BECAUSE YOU MIGHT NOT GET THERE

The decoding of the author's name is left as an exercise or memory recall:

	by: - KBUN WMQQV

Conclusion

While strong encryption techniques and the analytic techniques used by this project have rendered simple monoalphabetic substitution ciphers virtually useless for modern cryptography, examining the principles behind these simple ciphers provides movitvation for modern work and lays the groundwork for the driving principles in the design of a modern cryptographic algorithm.

Furthermore, simple subsititutions provide countless hours of fun and frustration as games and puzzles, and can be used for non-critical "secret writing." The subsitutions are easy to compute and provide secrecy from the eye of the casual observer, or even a few hours of protection from a determined adversary.

We have also seen that the study and analysis of these simple ciphers provides a framework for constructing ciphers that are more resiliant to statistical attack. The two most important properties for a better cipher are confusion (destroying the relationship between the plaintext and the ciphertext) and diffusion (spreading the effect of encrypting one plaintext bit over several ciphertext bits).