SMBlog -- 17 November 2019

The Early History of Usenet, Part IV: File Format

17 November 2019

When we set out to design the over-the-wire file format, we were certain of one thing: we wouldn’t get it perfectly right. That led to our first decison: the very first character of the transmitted file would be the letter "A", for the version. Why not a number on the first line, including perhaps a decimal point? If we ever considered that, I have no recollection of it.

A more interesting question is why we didn’t use email-style headers, a style later adopted for HTTP. The answer, I think, is that few, if any, of us had any experience with those protocols at that time. My own personal awareness of them started when I requested and received a copy of the Internet Protocol Transition Workbook a couple of years later—but I was only aware of it because of Usenet. (A few years earlier, I gained a fair amount of knowledge of the ARPANET from the user level, but I concentrated more on learning Multics.)

Instead, we opted for the minimalist style epitomized by 7th Edition Unix. In fact, even if we had known of the Internet (in those days, ARPANET) style, we may have eschewed it anyway. Per a later discussion of implementation, the very first version of our code was a shell script. Dealing with entire lines as single units, and not trying to parse headers that allowed arbitrary case, optional white space, and continuation lines was certainly simpler!

The next question was what to do about duplicate articles. One obvious necessity is an article ID, since that would allow duplicate detection. In our design, the article ID was the rest of the first line, after the A. (Note: it’s been 40 years and I no longer remember exactly what we decided at that meeting. Per the implementation discussion, there was experimentation and change. The details I’m giving here are taken from the final format as documented in RFC 850, but there is no doubt that there were changes during development.)

We also wanted to minimize transfer costs. As I noted in the previous post, article transmission was by expensive, dial-up connections; sending something that wasn’t needed would cost real money. Accordingly, articles had to include a list of systems known to have already seen the article. This consisted of a series of hostnames separated by exclamation points, with the last element being the login name of the user who posted it. Thus, an article created by me at UNC Chapel Hill and relayed through Duke and alice, a computer at Bell Labs Research, would contain "alice!duke!unc!smb". If a possible next hop appeared in the path, the duplicate copy would not be sent. (Yes, that meant that it was easy to ensure that some sites would never see some articles. To my recollection, we did not worry about that issue and perhaps didn’t even notice it.)

Why did we pick that format, instead of something like commas or blanks as separators? The format we chose was that used by uucp for email relaying; someone at some computer that alice talked to could type

mail alice!duke!unc!smb

and it would be relayed through alice and duke before reaching my department’s computer and then me. (That sort of email relaying was to prove problematic; again, more on that later.)

Today, with full connectivity over the Internet, we wouldn’t do things the same way. Instead, one party would send the next a list of article IDs; that party would then request the ones it had not yet seen. We did consider something like that, but rejected it. Why? Because we were using infrequent, dial-up connections to relay articles, and the number of loops (and hence duplicate articles received) seemed unlikely to be high.

Consider: in our original scheme, many sites would be polled once per night by Duke. If, during that call, Duke sent them a list of articles, they couldn’t request it until the next night, and wouldn’t receive them until the following night. That amount of delay was unacceptable. Instead, we accepted the chance of sending unnecessary text. While there certainly would be extra transmissions some of the time, we felt that the amount would not be prohibitive—this was before JPG and before MP3, so articles were entirely text and hence would be relatively small and thus cheap.

Sending a date and an article title were obvious enough that these didn’t even merit much discussion. The date and time line used the format generated by the ctime() or asctime() library routines. I do not recall if we normalized the date and time to UTC or just ignored the question; clearly, the former would have been the proper choice. (There is an interesting discrepancy here. A reproduction of the original announcement clearly shows a time zone. Neither the RFC nor the ctime() routine had one. I suspect that announcement was correct.) The most interesting question, though, was about what came to be called newsgroups.

We decided, from the beginning, that we needed multiple categories of articles—newsgroups. For local use, there might be one for academic matters ("Doctoral orals start two weeks from tomorrow"), social activities ("Reminder: the spring picnic is Sunday!"), and more. But what about remote sites? The original design had one relayed newsgroup: NET. That is, there would be no distinction between different categories of non-local articles.

This approach was hotly debated. Was it really the case that there would be so little traffic of interest beyond the local machine that no further categorization was needed? (Our estimates of traffic volume were very, very wrong, and this error affected several implementation decisions.) The objection that carried the day: "What if someone wants to sell their car? They want it to reach other computers in the geographical area, but not beyond." We instead decided that anything in newsgroups beginning "NET." would be relayed. This, though, created a problem that is still not resolved: we conflated the notions of interest with the scope of relaying. That is, suppose that instead of duke and unc being directly connected, both sites spoke to alice. Material of regional interest—the two schools were only about 16 km apart—should be seen on both sites, but there would be no reason to send such items as used car ads to a Bell Labs machine in New Jersey. (Aside: years later, when Usenet was already reasonably widespread, someone posted a used car ad to a group with world-wide distribution. The author was rather confused when several of the original designers sent him congratulatory notes…)

There was one more interesting point. From the very beginning, we knew that some articles belonged in more than one category. We therefore supported cross-posting to multiple newsgroups from the very beginning. Cross-posting later came to be seen as impolite, but it was an intentional feature from the very beginning.

Using the example from RFC 850, the final format of a news article looked like this:

Aeagle.642
net.general
cbosgd!mhuxj!mhuxt!eagle!jerry
Fri Nov 19 16:14:55 1982
Usenet Etiquette - Please Read
The body of the article comes here, with no blank line.

We decided on one last issue at the meeting: the name of our system. We called the technology "Netnews"—Network News—and the particular instantiation we hoped for was "Usenix". Why Usenix? The Wikipedia article (as of the 17 November 2019 version) has it almost right: "The name ’Usenet’ emphasizes its creators’ hope that the USENIX organization would take an active role in its operation." However, there was a bit more. Until some time in 1979, the organization now known as Usenix was called the Unix User’s Group. But Bell Labs’ lawyers took exception to this use of their trademark, so a new name was chosen: Usenix. The technical folks, being innocent in the ways of lawyers, were bemused by this. Part of our reason for the name "Usenet" was as a gentle tease about this forced renaming.

Here is the table of contents, actual and projected, for this series.