Useful Links

Recent Posts

Archive

I've Gone Encrypted

28 March 2011

Courtesy of a few lines in a .htaccess file, I've converted my entire web site so that it is accessed via https only. All old URLs will still work; however, they'll be rewritten to http form.

Here's what a manual connection to the web server looks like now:

$ telnet www.cs.columbia.edu 80
Trying 128.59.11.206...
Connected to webcluster.cs.columbia.edu.
Escape character is '^]'.
GET /~smb/ HTTP/1.0

HTTP/1.1 302 Found Date: Tue, 29 Mar 2011 00:58:03 GMT Server: Apache Location: https:///~smb/ Content-Length: 318 Connection: close Content-Type: text/html; charset=iso-8859-1 ...

It's taken me a while, but I've finally fulfilled the promise I made 14 months ago...

Let me know if there are any problems.

The RSA SecurID Problem

18 March 2011

As has been widely reported, RSA suffered a serious security breach aimed at its SecurID product. The SecurID is a major product in its space ("token authentication", rather than the commonly reported "two factor authentication"; see below); a security problem with it would be a major issue indeed. But what, precisely, is the problem RSA experienced? They haven't said, which of course has led to a lot of speculation. There are many possible scenarios here; some are serious and some are not. I'm going to lay out a few possibilities.

Note carefully: I am not saying that RSA made any of the bad decisions I'm outlining here. I hope they didn't. I outline them solely to provide a list of questions people can ask their vendors. Again, I am not accusing RSA of having done anything wrong.

Authentication is conventionally divided into three types: something you know (i.e., passwords), something you are (biometrics), and something you have, such as some form of token. Two-factor authentication means using two of these three, to avoid problems from things like stolen tokens or worse. The normal way to use a SecurID these days is to combine its output with a PIN, i.e., a short password.

Fundamentally, a SecurID is a display, a clock T, a secret key K, and a keyed cryptographic hash function H, all in a tamper-resistant package. Every 30-60 seconds, the token calculates H(K, T) and displays the result on the screen. When you log in, you supply your userid, the displayed value, and a PIN. The system consults one database to map your userid to the serial number of your token; it consults another to find the secret key for your token. It then does the same H(K, T) calculation to make sure the results match what you sent; it also checks your PIN. If everything is ok, the login is successful. (Note: I've oversimplified things; in particular, I've left out some details that are quite essential to making the product successful in the real world. But none of those items are important to what I'm going to discuss here.)

The two most crucial security items here are the strength of the hash function and protection of the key database. Let's take them one at a time.

If an enemy who observes H(K, T) can learn K, the whole scheme falls apart. (People shouldn't even send that except over an encrypted connection, but we know that that happens.) That said, at least older (alleged) versions of H have been cryptanalyzed. (I do not know if that is the current algorithm; the paper speaks of a 64-bit key, but assorted web pages and Wikipedia speak of a 128-bit key.) The easiest way to build a new, secure algorithm would be to use something like HMAC. Assorted web pages mention an AES-based solution; that's another good choice.

Is the risk that the attackers have now learned H? If that's a problem, H was a bad choice to start with. In 1883, Kerckhoff wrote "Il faut qu'il n'exige pas le secret, et qu'il puisse sans inconvénient tomber entre les mains de l'ennemi." (Roughly, "the system must not require secrecy and can fall into enemy hands without causing inconvenience.") But that principle has been known for a long time, so it doesn't seem like the most likely choice. The original 1985 version of H is, as noted, too weak, but that was a reflection of the algorithms and technology of the time. Any newer algorithm would certainly be stronger; I am not seriously worried about this possibility.

The second big risk is compromise of the confidentiality of the key database. If RSA stored copies of customer databases (see below), these could be at risk. That would be nothing short of a disaster for any affected company. (Note, of course, that every customer has to have its own database; if that version is poorly protected, it's game over for that customer.) However — where does K come from, and how does it get into both the token and the key database? There are many possibilities, and they interact deeply with the manufacturing process: one has to ensure that knowledge of the key somehow stays with the device and its seriol number. It is tempting to calculate K=G(K',serial_number), where K' is a closely-held RSA secret and G is some other hash function. The risk here is discovery of K'; any new devices would be at risk of compromise if the attacker knew the serial numbers associated with some target customer. This is the really big question: how are the K values generated? Is there some master secret with RSA that allows recalculation of them? If so, any devices manufactured between when the compromise occurred and when K' was changed are much less secure than they should be. It would be great if there was a new K' every day, and perhaps there is; again, though, if the machine holding it was hacked, there's a problem.

Note, of course, that no matter how K is generated, even by a true-random number generator, that value somehow has to be kept with its token. That information could also be a target.

The last obvious issue is the availability of the key database. Customers care about this; if they lose their database, none of their users can log in. Careful sites keep good backups, but as we all know not everyone is that careful. Does RSA (or any other token vendor) offer a managed service, where they hold backup copies? It would be tempting for all concerned, but it also poses a serious confidentiality risk. Such databases should be stored offline and encrypted, with really strong physical procedural protections. It's hard to imagine that a company as sophisticated as RSA doesn't understand that. However, there's a scarier possibility: is there a database of K' values? That's a lot harder to keep offline, because you'll always need it for some customer. There's a difficult tradeoff here of confidentiality and availability.

We can also look at less obvious possibilities. Perhaps what was compromised is a database of customer authentication data. That is, suppose you want RSA to send you your backed-up keys. How do you prove you're you? A SecurID token? Why not — but that means there's a key database.

I suspect that the real risk is none of these. They're too obvious to any defender; I strongly suspect that these issue have been handled properly. RSA is a sophisticated company that does understand security. Instead, I'm worried about the source code to any of the myriad back end administrative products necessary to use SecurID. Are there flaws? That's a lot easier to believe than (simple) flaws in an AES-based hash function. There's a lot of code needed for maintaining databases, adding and deleting users, making backups, synchronizing master and secondary copies of databases, and more. An attacker who could penetrate these administrative systems doesn't have to worry about key generation or cryptanalysis; they could simply steal existing keys or insert new ones of their own. To quote myself, "you don't go through strong security, you go around it". The crypto may be strong, but what about the software?

Doing History

2 March 2011

I know I've been silent lately — among other things, I've been frantically working on a cryptologic history paper. If you're interested, I stumbled on a telegraph codebook that showed that the one-time pad — the only provably secure algorithmic cryptosystem — was actually invented 35 years earlier than had been thought. I think that that's a neat story, but there's another story: even 5 years ago, I don't think I could have done the necessary research; 10 years ago, it would have been impossible. The difference is not just the Internet, but also the older documents that have been digitized and made available. There are, however, a couple of caveats and warnings for the future.

I started with what seemed like an extremely hard problem: how could I learn about someone with a common name — Frank Miller — who published his book in 1882? From the book itself, I had exactly two hard facts: he lived in Sacramento, and he had 16 years of banking experience. To this, I added some background knowledge (Sacramento wasn't a large city then), some luck (he was in Sacramento and not, say, New York), some educated guesswork (16 years of experience in 1882 sounds like a post-(U.S.) Civil War career) — and the Internet. I was able to find an old, out-of-copyright book on the financial history of California. I was able to find another old, out-of-copyright book on the history of Sacramento. These, in turn, gave me the names of a few banks that I could put into my search queries. I was able to consult the online 1880 census database at FamilySearch.org, which told me that there were only two Frank Millers in Sacramento then, and only one was a banker. I was able to engage in rapid communication with assorted librarians and historians. I consulted a relatively recent book on the family's genealogy; though it was still in copyright, the author gave Google Books permission to scan it and make it available. All of this was extremely useful, but was in some sense "just" an accelerator; in principle, I could have done all of that manually, albeit it much more slowly. It's also questionable if I'd have expended the effort; I'm primarily a computer scientist, not a historian, and probably couldn't have justified the time and travel it would have taken.

More interestingly, though, I was able to do things that would have been effectively impossible without search engines. For example, Miller used the phrase "test word" for what I would call an "authenticator". Where did this come from? Via Google Books, I was able to establish that it once referred to a "password" people used to prove their membership in secretive social organizations like the Freemasons. I would never have thought of consulting an 1824 anti-Masonic screed. I also learned, the same way, that a rare 1876 codebook used it the way Miller did. I might have wanted to check that book, but as far as I can tell the only copy is at Oxford University; neither the Library of Congress nor the NSA museum's library has one. By World War I, there were special provisions in the censorship regulations for test words, and banks had special departments for generating and verifying them; again, I would not have found those references. (I should also note that the Oxford English Dictionary does not have either meaning listed, despite both meanings being well-attested in 19th and early 20th century books. I've emailed them.)

Indexing doesn't have to be free to be useful, though it helps. I was looking for connections between Miller and local military officers; I conjectured that they might have met at some social occasion. A military historian I contacted agreed that generally speaking, such contacts occurred in that era, and that consulting the Society pages of a San Francisco newspaper might answer the question. Fine — but there were literally hundreds and possibly thousands of archived papers I'd have to consult, probably on blurry microfiches or microfilms. Fortunately, the papers I wanted had been scanned and indexed, and the Columbia University library was able to get a trial subscription from the vendor. The ability to do that search in finite time was crucial to one point my paper makes: that Miller virtually certainly met and chatted with Parker Hitt, an important American cryptologist of the era.

There's a warning here, though: information that isn't indexed is much harder to find and use, and will tend to be ignored. I was in Boston recently and thought to check what codebooks the Boston Public Library had. Unfortunately, their electronic catalog only includes works acquired since 1974. The older card catalog had everything, of course, and they did microfilm it — but that catalog now appears to reside in two dusty (and as best I can tell, little used) volumes.

Information that one has to pay for is often only slightly more accessible. I checked the IEEE price for a recent short paper of mine: $30. If you need many such references, the cost is prohibitive. I'm lucky; I'm at a university with a superb library system and institutional subscriptions to very many journals. If I need other resources, the New York Public Library is just a short subway ride away. Independent researchers or those at smaller schools would have much more difficulty doing such research, and therein lies the danger: information is often becoming less accessible, not more. What worse, professional societies such as the ACM and IEEE prohibit authors from posting papers on their own web sites, because their traditional business model can't flourish that way.

It's also worth noting that most of the publications I consulted electronically were from before 1923. Partly, that's because of the era I was researching; however, it's also because of copyright restrictions: in general, works published before then are in the public domain. Google is scanning some later works, but the issue is tremendously controversial, especially for academic works.

In any field, people with more resources will, on the whole, do better; that's more or less by definition. However, we are stumbling towards a future where the difference is increasing, not decreasing. As libraries put more of their money into electronic resources, they'll have less to spend on traditional archival material. The information will be inaccessible to all but a privileged few, and we'll all be poorer.