FontCode: Hiding information in plain text, unobtrusively and across file types

Share
Like an invisible QR code, FontCode can be used to embed a URL, as in this example where each of four paragraphs is embedded with a different URL. Retrieving one of the URLs (here for a YouTube video) is done by taking a snapshot with a smart device (right) and decoding the hidden message using the FontCode application.

By imperceptibly changing, or perturbing, the shapes of fonts, Columbia researchers have invented a way to embed hidden information in ordinary text, without the existence of the secret message being perceived. The method, called FontCode, both creates font perturbations, mapping them to a bit string, and later decodes them to recover the message. To ensure robust decoding when font perturbations are obscured, researchers introduced redundancy using the 1700-year-old Chinese Remainder Theorem, and were able to demonstrate that a messages can be fully recovered even with a recognition failure rate of 25% (and theoretically even higher). FontCode works with all fonts and, unlike other text and document methods that hide embedded information, works with all document types, even maintaining the hidden information when the document is printed on paper or converted to another file type. While having obvious advantages for spies, FontCode has perhaps more practical application for companies wanting to prevent document tampering or protect copyrights, and for retailers and artists wanting to embed QR codes and other metadata without altering the look or layout of a document.

Compare these two lines of text. Can you see the difference? One carries a hidden message.

Each character in the second line differs slightly from its counterpart in the top line. (Compare the “d” in the two “looked” instances to see how the bottom instance has slightly thicker strokes.) And in these subtle differences, or perturbations, can be hidden a secret, encoded message without its existence being detected; the secret message is retained when printing a document or photograph or converting it to another file type.

These font perturbations are created and later recognized by FontCode, a text steganographic method created by three Columbia researchers—Chang Xiao, Cheng Zhang, and Changxi Zheng. Described in FontCode: Embedding Information in Text Documents using Glyph Perturbation, the FontCode method embeds text, metadata, a URL, or a digital signature into regular text or a photograph. It works with all common font families (Times Roman, Helvetica, Calibri) and is compatible with most word processing programs (Word, FrameMaker) as well as image-editing and drawing programs (Photoshop, Illustrator). Since each letter can be perturbed, the amount of information conveyed secretly is limited only by the length of the regular text.

“Changing any letter, punctuation mark, or symbol into a slightly different form allows you to change the meaning of the document,” says Chang Xiao, the paper’s lead author. “This hidden information, though not visible to humans, is machine-readable just as bar and QR codes are instantly readable by computers. However, unlike bar and QR codes, FontCode won’t mar the visual aesthetics of the printed material, and its presence can remain secret.”

FontCode is part of a broader research effort by Changxi Zheng, the paper’s senior author, to directly link the physical and digital worlds in a way that is both unobtrusive and not external to the object itself (unlike the highly visible QR and bar codes). Working with others in Columbia’s Computer Graphics Group, which he co-directs, Zheng is instead looking to use an aspect of the physical object to uniquely tag an object and encode information for digital devices to read. Recent projects from the lab include acoustic voxels and aircodes, which use an object’s acoustic properties for tagging and information embedding.

Rather than acoustic properties, FontCode encodes information using minute font perturbations—changing the stroke width, adjusting the height of ascenders and descenders, or tightening or loosening the curves in serifs and the bowls of letters like o, p, and b. Perturbations judged visually similar to the original letter are stored in a codebook. The numbered location in the codebook can be changed, providing FontCode with its own optional built-in encryption scheme.

Five perturbations. Perturbations are stored in a numbered location in a codebook, a fragment of which is shown at right. The locations are not fixed, allowing for an encryption scheme where a private key specifies the particular ordering of the perturbations.

FontCode is not the first technology to hide a message in text—programs exist to hide messages in PDF and Word files or to resize whitespace to denote a 0 or 1—but it is the first to be document-independent and to retain the secret information even when a document or an image with text (PNG, JPG) is printed or converted to another file type. This means a FrameMaker or Word file can be converted to PDF, or a JPEG can be converted to PNG, all without losing the secret information.

The embedding process

Someone using FontCode would supply a secret message and a carrier text document. FontCode converts the secret message to a bit string (ASCII or Unicode) and then into a sequence of integers. Each integer is assigned to a five-letter block in the regular text where the numbered locations of each letter sum to the integer.

 

Accurately recovering the message even with recognition errors

Recovering hidden messages is the reverse process. From a digital file or from a photograph taken with a smartphone, FontCode matches each perturbed letter to the original perturbation in the codebook to reconstruct the original message.

Matching is done using convolutional neural networks (CNNs). Recognizing vector-drawn fonts (such as those stored as PDFs or created with programs like Illustrator) is straightforward since shape and path definitions are computer-readable. However, it’s a different story for PNG, IMG, and other rasterized (or pixel) fonts, where lighting changes, differing camera perspectives, or noise or blurriness may mask a part of the letter and prevent an easy recognition.

While CNNs are trained to take into account such distortions, recognition errors will still occur, and a key challenge was ensuring a message could always be recovered in the face of such errors. Redundancy is one obvious way to recover lost information, but it doesn’t work well with text since redundant letters and symbols are easy to spot.

Instead, the researchers turned to the 1700-year-old Chinese Remainder Theorem, which identifies an unknown number from its remainder after it has been divided by several different divisors. The theorem has been used to reconstruct missing information in other domains; in FontCode, researchers use it to recover the original message even when not all letters are correctly recognized.

“Imagine having three unknown variables,” says Zheng. “With three linear equations, you should be able to solve for all three. If you increase the number of equations from three to five, you can solve the three unknowns as long as you know any three out of the five equations.”

Using the Chinese Remainder theory, the researchers demonstrated they could recover messages even when 25% of the letter perturbations were not recognized. Theoretically the error rate could go higher than 25%.

Obscurity not only means of security

Data hidden using FontCode can be extremely difficult to detect. Even if an attacker detects font changes between two texts—highly unlikely given the subtlety of the perturbations—it simply isn’t practical to scan every file going and coming within a company.

Furthermore, FontCode not only embeds but also optionally encrypts messages. This encryption is based on the order in which the perturbations occur in the codebook. Two people wanting to communicate through embedded documents would agree on a private key that specifies locations of perturbations in the codebook.

Encryption however is just a backup level of protection in case an attacker was able to detect the use of font changes to convey secret information. Given how hard it is to perceive those changes, detection is very difficult to do, making FontCode a very powerful technique to get data past existing defenses.

About the Researchers

Chang Xiao

Chang Xiao is a second-year PhD student at Columbia University working in the Computer Graphics Group and advised by Changxi Zheng. His research interests focus on the principles and applications of computer graphics, with a particular emphasis on computational design. He received his bachelor degree in the College of Computer Science of Zhejiang University.

Cheng Zhang

Cheng Zhang is a first-year PhD student at UC Irvine working in the Interactive Graphics and Visualization Lab advised by Shuang Zhao. His research interests focus on the principles and applications of computer graphics, with a particular emphasis on physically based rendering. He received his master degree in Computer Science in Columbia University.

Changxi Zheng

Changxi Zheng is an Associate Professor in Columbia’s Computer Science Department where he co-directs Columbia’s Computer Graphics Group, working on computational fabrication, computer graphics, acoustic and optical engineering, and scientific computing. He is also a member of the Data Science Institute and collaborates with other researchers across the university to better understand and processing audiovisual information. Zheng received his Ph.D. from Cornell University with the Best Dissertation Award and his B.S. from Shanghai Jiaotong University. He currently serves as an associated editor of ACM Transactions on Graphics. He was a Conference Chair for SCA in 2017, has won a NSF CAREER Award, and was named one of Forbes’ “30 under 30” in science and healthcare in 2013.

FontCode in the News

Helvetica Is Now An Encryption Device, via FastCompany

This algorithm can hide secret messages in regular-looking text, via Digital Trends

Clever Technique Can Hide Secret Messages in the Most Unassuming Text, via Popular Mechanics

Posted 04/20/2018
Linda Crane

Share