Here’s Why We Need To Democratize Artificial Intelligence

With millions of AI professionals needed, we need to reach beyond elite schools, says Sameer Maskey (PhD’08, advisor Julia Hirschberg), writing in Forbes. His company Fusemachines trains AI engineers in Nepal and other developing countries.

FontCode: Hiding information in plain text, unobtrusively and across file types

Like an invisible QR code, FontCode can be used to embed a URL, as in this example where each of four paragraphs is embedded with a different URL. Retrieving one of the URLs (here for a YouTube video) is done by taking a snapshot with a smart device (right) and decoding the hidden message using the FontCode application.

By imperceptibly changing, or perturbing, the shapes of fonts, Columbia researchers have invented a way to embed hidden information in ordinary text, without the existence of the secret message being perceived. The method, called FontCode, both creates font perturbations, mapping them to a bit string, and later decodes them to recover the message. To ensure robust decoding when font perturbations are obscured, researchers introduced redundancy using the 1700-year-old Chinese Remainder Theorem, and were able to demonstrate that a messages can be fully recovered even with a recognition failure rate of 25% (and theoretically even higher). FontCode works with all fonts and, unlike other text and document methods that hide embedded information, works with all document types, even maintaining the hidden information when the document is printed on paper or converted to another file type. While having obvious advantages for spies, FontCode has perhaps more practical application for companies wanting to prevent document tampering or protect copyrights, and for retailers and artists wanting to embed QR codes and other metadata without altering the look or layout of a document.

Compare these two lines of text. Can you see the difference? One carries a hidden message.

Each character in the second line differs slightly from its counterpart in the top line. (Compare the “d” in the two “looked” instances to see how the bottom instance has slightly thicker strokes.) And in these subtle differences, or perturbations, can be hidden a secret, encoded message without its existence being detected; the secret message is retained when printing a document or photograph or converting it to another file type.

These font perturbations are created and later recognized by FontCode, a text steganographic method created by three Columbia researchers—Chang Xiao, Cheng Zhang, and Changxi Zheng. Described in FontCode: Embedding Information in Text Documents using Glyph Perturbation, the FontCode method embeds text, metadata, a URL, or a digital signature into regular text or a photograph. It works with all common font families (Times Roman, Helvetica, Calibri) and is compatible with most word processing programs (Word, FrameMaker) as well as image-editing and drawing programs (Photoshop, Illustrator). Since each letter can be perturbed, the amount of information conveyed secretly is limited only by the length of the regular text.

“Changing any letter, punctuation mark, or symbol into a slightly different form allows you to change the meaning of the document,” says Chang Xiao, the paper’s lead author. “This hidden information, though not visible to humans, is machine-readable just as bar and QR codes are instantly readable by computers. However, unlike bar and QR codes, FontCode won’t mar the visual aesthetics of the printed material, and its presence can remain secret.”

FontCode is part of a broader research effort by Changxi Zheng, the paper’s senior author, to directly link the physical and digital worlds in a way that is both unobtrusive and not external to the object itself (unlike the highly visible QR and bar codes). Working with others in Columbia’s Computer Graphics Group, which he co-directs, Zheng is instead looking to use an aspect of the physical object to uniquely tag an object and encode information for digital devices to read. Recent projects from the lab include acoustic voxels and aircodes, which use an object’s acoustic properties for tagging and information embedding.

Rather than acoustic properties, FontCode encodes information using minute font perturbations—changing the stroke width, adjusting the height of ascenders and descenders, or tightening or loosening the curves in serifs and the bowls of letters like o, p, and b. Perturbations judged visually similar to the original letter are stored in a codebook. The numbered location in the codebook can be changed, providing FontCode with its own optional built-in encryption scheme.

Five perturbations. Perturbations are stored in a numbered location in a codebook, a fragment of which is shown at right. The locations are not fixed, allowing for an encryption scheme where a private key specifies the particular ordering of the perturbations.

FontCode is not the first technology to hide a message in text—programs exist to hide messages in PDF and Word files or to resize whitespace to denote a 0 or 1—but it is the first to be document-independent and to retain the secret information even when a document or an image with text (PNG, JPG) is printed or converted to another file type. This means a FrameMaker or Word file can be converted to PDF, or a JPEG can be converted to PNG, all without losing the secret information.

The embedding process

Someone using FontCode would supply a secret message and a carrier text document. FontCode converts the secret message to a bit string (ASCII or Unicode) and then into a sequence of integers. Each integer is assigned to a five-letter block in the regular text where the numbered locations of each letter sum to the integer.

 

Accurately recovering the message even with recognition errors

Recovering hidden messages is the reverse process. From a digital file or from a photograph taken with a smartphone, FontCode matches each perturbed letter to the original perturbation in the codebook to reconstruct the original message.

Matching is done using convolutional neural networks (CNNs). Recognizing vector-drawn fonts (such as those stored as PDFs or created with programs like Illustrator) is straightforward since shape and path definitions are computer-readable. However, it’s a different story for PNG, IMG, and other rasterized (or pixel) fonts, where lighting changes, differing camera perspectives, or noise or blurriness may mask a part of the letter and prevent an easy recognition.

While CNNs are trained to take into account such distortions, recognition errors will still occur, and a key challenge was ensuring a message could always be recovered in the face of such errors. Redundancy is one obvious way to recover lost information, but it doesn’t work well with text since redundant letters and symbols are easy to spot.

Instead, the researchers turned to the 1700-year-old Chinese Remainder Theorem, which identifies an unknown number from its remainder after it has been divided by several different divisors. The theorem has been used to reconstruct missing information in other domains; in FontCode, researchers use it to recover the original message even when not all letters are correctly recognized.

“Imagine having three unknown variables,” says Zheng. “With three linear equations, you should be able to solve for all three. If you increase the number of equations from three to five, you can solve the three unknowns as long as you know any three out of the five equations.”

Using the Chinese Remainder theory, the researchers demonstrated they could recover messages even when 25% of the letter perturbations were not recognized. Theoretically the error rate could go higher than 25%.

Obscurity not only means of security

Data hidden using FontCode can be extremely difficult to detect. Even if an attacker detects font changes between two texts—highly unlikely given the subtlety of the perturbations—it simply isn’t practical to scan every file going and coming within a company.

Furthermore, FontCode not only embeds but also optionally encrypts messages. This encryption is based on the order in which the perturbations occur in the codebook. Two people wanting to communicate through embedded documents would agree on a private key that specifies locations of perturbations in the codebook.

Encryption however is just a backup level of protection in case an attacker was able to detect the use of font changes to convey secret information. Given how hard it is to perceive those changes, detection is very difficult to do, making FontCode a very powerful technique to get data past existing defenses.

About the Researchers

Chang Xiao

Chang Xiao is a second-year PhD student at Columbia University working in the Computer Graphics Group and advised by Changxi Zheng. His research interests focus on the principles and applications of computer graphics, with a particular emphasis on computational design. He received his bachelor degree in the College of Computer Science of Zhejiang University.

Cheng Zhang

Cheng Zhang is a first-year PhD student at UC Irvine working in the Interactive Graphics and Visualization Lab advised by Shuang Zhao. His research interests focus on the principles and applications of computer graphics, with a particular emphasis on physically based rendering. He received his master degree in Computer Science in Columbia University.

Changxi Zheng

Changxi Zheng is an Associate Professor in Columbia’s Computer Science Department where he co-directs Columbia’s Computer Graphics Group, working on computational fabrication, computer graphics, acoustic and optical engineering, and scientific computing. He is also a member of the Data Science Institute and collaborates with other researchers across the university to better understand and processing audiovisual information. Zheng received his Ph.D. from Cornell University with the Best Dissertation Award and his B.S. from Shanghai Jiaotong University. He currently serves as an associated editor of ACM Transactions on Graphics. He was a Conference Chair for SCA in 2017, has won a NSF CAREER Award, and was named one of Forbes’ “30 under 30” in science and healthcare in 2013.

FontCode in the News

Helvetica Is Now An Encryption Device, via FastCompany

This algorithm can hide secret messages in regular-looking text, via Digital Trends

Clever Technique Can Hide Secret Messages in the Most Unassuming Text, via Popular Mechanics

Posted 04/20/2018
Linda Crane

Julia Hirschberg elected to the American Academy of Arts and Sciences

Julia Hirschberg

Julia Hirschberg has been elected to the American Academy of Arts and Sciences for her contributions to computer science. Established in 1780, the Academy is one of the oldest learned societies in the United States, and each year honors leaders from the academic, business, and government sectors who address the critical challenges facing the global society.

“Julia Hirschberg has been a pioneer and leader in the field of computational linguistics as well as the field of computer science more broadly,” says Mary Boyce, Dean of Columbia Engineering. “We are proud to see her recognition by the Academy in the Class of 2018!”

“I’m delighted and honored to receive this award,” said Hirschberg, who is the Percy K. and Vida L.W. Hudson Professor of Computer Science, chair of the Computer Science Department, as well as a member of the Data Science Institute.

Hirschberg’s area of research is computational linguistics, where she focuses on prosody, the relationship between intonation and discourse. Her current projects include research into emotional and deceptive speech, spoken dialogue systems, entrainment in dialogue, speech synthesis, text-to-speech synthesis in low-resource languages, and hedging behaviors.

Hirschberg, who joined Columbia Engineering in 2002 as a professor in the Department of Computer Science and has served as department chair since 2012, earned her PhD in computer and information science from the University of Pennsylvania. She worked at AT&T Bell Laboratories, where in the 1980s and 1990s she pioneered techniques in text analysis for prosody assignment in text-to-speech synthesis, developing corpus-based statistical models that incorporate syntactic and discourse information, models that are in general use today.

Among her many honors, Hirschberg is a member of the National Academy of Engineering (2017) and a fellow of the IEEE (2017), the Association for Computing Machinery (2016), the Association for Computational Linguistics (2011), the International Speech Communication Association (2008), and the Association for the Advancement of Artificial Intelligence (1994); she is a recipient of the IEEE James L. Flanagan Speech and Audio Processing Award (2011) and the ISCA Medal for Scientific Achievement (2011). In 2007, she received an Honorary Doctorate from the Royal Institute of Technology, Stockholm, and in 2014 was elected to the American Philosophical Society.

Hirschberg serves on numerous technical boards and editorial committees, including the IEEE Speech and Language Processing Technical Committee and the board of the Computing Research Association’s Committee on the Status of Women in Computing Research (CRA-W). Previously she served as editor-in-chief of Computational Linguistics and co-editor-in-chief of Speech Communication and was on the Executive Board of the Association for Computational Linguistics (ACL), the Executive Board of the North American ACL, the CRA Board of Directors, the AAAI Council, the Permanent Council of International Conference on Spoken Language Processing (ICSLP), and the board of the International Speech Communication Association (ISCA). She also is noted for her leadership in promoting diversity, both at AT&T Bell Laboratories and Columbia, and for broadening participation in computing.

Hirschberg is the fifth professor within the department to be elected to the Academy, joining Alfred Aho (2003), Shree Nayar (2011), Christos Papadimitriou (2001), and Jeannette Wing (2010). Across Columbia Engineering, four other faculty members were also previously elected to the Academy. They are Mary Boyce (2004), Mark Cane (2002), Aron Pinczuk (2009), and Renata Wentzcovitch (2013).

Posted 04/18/2017

Luca Carloni and Georgia Karagiorgi (Physics) are among 2018 RISE awardees

Luca Carloni

For their collaborative project “Acceleration of Deep Neural Networks via Heterogeneous Computing for Real-Time Processing of Neutrino and Particle-Trace Imagery,” Luca Carloni of the Computer Science Department and Georgia Karagiorgi of the Department of Physics were among the five teams awarded funding through the Research Initiatives in Science and Engineering (RISE) competition. Created in 2004, RISE is one of the largest internal research grant competitions within the University, and each year provides funds for interdisciplinary faculty teams from the basic sciences, engineering, and medicine to explore paradigm-shifting and high-risk ideas. This year 29 teams presented pre-proposals; of the nine teams asked to submit full proposals, five were selected to receiving funding in the amount of $80,000 per year for up to two years.

Georgia Karagiorgi

Carloni and Karagiorgi were awarded funding to develop machine learning techniques and to explore deep neural network implementations for “listening” to only the most important and rare physics signals while disregarding environmental noise and other accidental background signals. To do so, they will build a data processing system that can facilitate real-time processing and accurate classification of images streamed at rates on the order of terabytes per second. The primary target application is the future Deep Underground Neutrino Experiment (DUNE). This is a major international particle physics experiment that will be operational in the US for more than a decade, beginning in 2024, and will be continually streaming high-resolution 3D images of the active detector region at a total data rate exceeding 5 terabytes per second. The ability to process this data in real time and to efficiently identify and accurately classify interesting activity in the detector would enable the discovery of rare particle interactions that have never been observed before. This ability however requires the development of an advanced data processing system. This RISE project aims to develop a scalable heterogeneous computing system that employs machine learning for identification and classification of interesting activity in the data. The ultimate goal is to leverage recent advancements in computer science to render the DUNE experiment a powerfully sensitive instrument for fundamental particle physics discovery.

“We are in the midst of a Big Data revolution, wherein our capacity to generate data greatly exceeds our ability to analyze and make sense of everything,” says Karagiorgi. “In my own discipline of particle physics, we think that there is a great deal of information hidden in astroparticle data sets, such as those that will be recorded by the DUNE experiment. But we will need advanced methods and powerful systems to scan through those data sets and efficiently and accurately fish out the information we care about. This project requires interdisciplinary collaboration between physics and computer science, and time – afforded by RISE – to explore our early-stage ideas. If we can develop a system for only capturing the most important data, this is a project that can feasibly capture the attention of the National Science Foundation or the Department of Energy.”

The seed money provided by RISE allows teams to more fully develop high-risk, high-reward ideas to better pursue other avenues of funding. Since 2004, RISE has awarded $9.62 million to 72 projects. These 72 teams later secured more than $55.4 million from governments and private foundations: a 600% return on Columbia’s initial investment. These projects have additionally garnered more than 130 peer-reviewed publications and educated more than 130 postdoctoral scholars and graduate, undergraduate, and high school students.

Applications for the 2019 competition run from September to early-October 2018, with five to six awarded teams announced by spring 2019. Beginning in September, visit the RISE website to apply.

Posted 04/10/2018