March 2018
Please Embed Bibliographic Data in Online Documents (7 March 2018)
Ed Felten to be Named as a PCLOB Board Member (13 March 2018)
Crypto War III: Assurance (24 March 2018)

Please Embed Bibliographic Data in Online Documents

7 March 2018

When I teach, I assign a lot of primary sources—technical papers, but also (especially in courses like Computers and Society) news stories. And when I assign something, I have to do laborious copying and pasting: I ask my students to use complete bibligraphy entries, rather than just URLs, so I do the same. Why? Among other things, "link rot": URLs are rarely good for more than a few years, save at places that have seriously thought through their naming scheme and made a commitment to stick to it.

Being the sort of person I am, I use scripts to generate my class syllabus pages. Since I already have copious BibTeX files, I use bibtex2html to generate (most of) the readings for each class. And therein lies the rub: I want all "archival" files—journal or conference paper PDFs, articles from major newspapers (e.g., the New York Times), etc., to include machine-readable metadata. The HTML file should, by itself, be self-identifying to scholars (or at least to scholars with the right tools….). I don’t care about the format chosen; I just one want single one that I can parse with a rational Python script.

This isn’t a new concept. Most books published in recent years in the US contain Library of Congress cataloging information. Web pages and academic papers should, too. And there are plenty of standards to choose from; ideally, pick one.

For now, I’ve written scripts for two of the sites I cite the most, the New York Times and Ars Technica. They have most of the right information, but it’s not in the same format. Ars Technica, for example, puts the interesting stuff in a single HTML tag, but in JSON format inside the tag. The New York Times uses a bunch of separate tags. I tried writing a variant for the Washington Post; as best I can tell, most of the information I need is there, but the reporters’ names are in some JavaScript assignment statements.

I’m trying to do my part. My own web site has .bib entries for all of my papers, and I’m rewriting my blog software to generate similar files for each blog post. (Not, I think, that anyone but me has ever formally cited my blog…)

I’m not a librarian or archivist, but if I’m seeing this problem, I suspect that the pros are seeing it even more. And maybe I’m wrong, and there are standards that the New York Times is following—but in that case, can others please follow suit? The future will thank you.

https://www.cs.columbia.edu/~smb/blog/2018-03/2018-03-07.html