Multilingual Technologies and Language Diversity

 

Instructor:  Prof. Smaranda Muresan and Dr. Isabelle Zaugg

Course Meeting Time: Fridays 10:10 AM – 12:40 PM

Office Hours: Smaranda Muresan, by appointment on Monday, 4-5pm

Isabelle Zaugg, by appointment on Monday and Thursday, 10-11am

 

Short Description

Innovations in digital technologies have shown their potential to be at times breathtakingly beneficial, and at others divisive or troubling. With regard to digital technologies’ impact on the ecosystem of language diversity, evidence suggests that new technologies are one contributor to the decline and predicted extinction of 50-90% of the world's languages this century. Yet digital innovations supporting a growing number of languages also have the potential to bolster language diversity in ways unimaginable a few years ago. Will innovations in multilingual natural language processing bring about a renaissance of language diversity, as users no longer need to rely on English and other dominant languages? To address this question, this course will introduce a dual view on language diversity: 1) a typology of language vitality and endangerment and 2) a resource-centric typology (low-resource vs. high-resource) regarding the availability of data resources to develop computational models for language analysis. This course will address the challenge of scaling natural language processing technologies developed mostly for English to the rich diversity of human languages. The resource-centric typology will also contribute to the dialogue of what is “Data Science.” Much research has been dedicated to the “Big Data” scenario; however “Small Data” poses equally challenging problems, which this course will highlight. This course brings data and computational literacy about multilingual technologies to humanities students, while also exposing computer science and data science students to ethical, cultural, business, and policy issues within the context of multilingual technologies. 

This 4000-level course is cross-listed in the Department of Computer Science and the Institute for Comparative Literature and Society, and open to both upper-level undergraduate and graduate students. The class will also provide open source state-of-the-art NLP tools and datasets to be used by students and relevant readings. The final project will team up students in CS and ICLS.

 

Academic Integrity

Columbia's intellectual community relies on academic integrity and responsibility as the cornerstone of its work. Graduate students are expected to exhibit the highest level of personal and academic honesty as they engage in scholarly discourse and research. In practical terms, you must be responsible for the full and accurate attribution of the ideas of others in all of your research papers and projects; you must be honest when taking your examinations; you must always submit your own work and not that of another student, scholar, or internet source. Graduate students are responsible for knowing and correctly utilizing referencing and bibliographical guidelines. When in doubt, consult your professor. Citation and plagiarism-prevention resources can be found at the GSAS page on Academic Integrity and Responsible Conduct of Research (http://gsas.columbia.edu/academic-integrity).  Failure to observe these rules of conduct will have serious academic consequences, up to and including dismissal from the university. If a faculty member suspects a breach of academic honesty, appropriate investigative and disciplinary action will be taken following Dean's Discipline procedures (http://gsas.columbia.edu/content/disciplinary-procedures).

 

Disabilities Accommodations

If you have been certified by Disability Services (DS) to receive accommodations, please either bring your accommodation letter from DS to your professor’s office hours to confirm your accommodation needs, or ask your liaison in GSAS to consult with your professor.  If you believe that you may have a disability that requires accommodation, please contact Disability Services at 212-854-2388 or disability@columbia.edu.

Important: To request and receive an accommodation you must be certified by DS.

 

Percentage Breakdown of Grading:

 

Assignment

Description

%

Participation/

Class activities

As a seminar, attendance and engagement in class activities is expected.  Attendance will be taken at class sessions.

20

Brief Presentations

1)    Language Quick-Fire: Pairings of CS students and ICLSs students ideally, pick a language neither of them speak and do a 5 minute presentation about (Assigned Week 2, due Week 3):         

_      Language facts (endangered/low-resource/high-resource;     demographics, location, script(s))                          

_      Linguistic characteristics (e.g., orthography, morphology, syntax) 

_      Computational efforts such as resources, tools, scholarly articles

10

 

2)    Stakeholder Investigation (Assigned week 10 and due week 11)

_      What is the role of different stakeholders in the multilingual digital sphere?

5

Assignments

 

3)    Essay on the first 6 weeks of the course: What is your normative stance on Digital Language Justice?  Out week 5, due week 6

20

 Final Project

4)    Combines both computational and writing portions (team based: pair CS and Humanities students).  Steps include:

  1. Proposal ideas (non-graded to see whether suitable + team formation): Due week 4
  2. Proposal (one pager):  Due week 5
  3. Lit review: Due week 8
  4. Final project presentations: last day of class, week 12
  5. Final project slides and written report:  Due 1 week after final presentations
    1. Computational project
    2. Essay tying project together with themes in the course

45

 

 

 

 Class Sessions (reading list tentative might be updated)

 

1.  Jan 15. ISABELLE & SMARA:  What is the relationship between language diversity and digital technologies?  Introduce syllabus.  Introduction to the historical trajectory of multilingual computing.  Introduction to rapidly diminishing language diversity.  Instructor and student introductions.

_      Introduce all the dimensions we will be looking at:

_      Language Vitality Scales: Ethnologue and UNESCO typologies with language examples

_      Digital resource-centric typologies: impoverished, low-resource, high-resource languages, including examples

_      Linguistic Typology:

_      Morphology (agglutinative, fusional, polysynthetic)

_      Syntax (SOV, SVO, OVS)

_      Script & Writing system

_      Readings:

_      Zaugg, I. “Digitizing Ethiopic:  Coding for Linguistic Continuity in the Face of Digital Extinction,” 2017, pp. 13-16

_      Loomis, Pandey, & Zaugg, “Full Stack Language Enablement.” Steven R. Loomis Blog, June 6, 2017.

_      Benjamin, M. (2016) “Digital Language Diversity:  Seeking the Value Proposition.” In Collaboration and Computing for Under-Resourced Languages:  Towards an Alliance for Digital Language Diversity, pp. 52–58.

 

2.  Jan 22. ISABELLE (& SMARA):  Language Diversity: Vitality, Resource-centric, Linguistics Typology. High-level discussion of the impact on language communities if multilingual technologies are or are not developed (machine translation (MT), morph analyzers, learning lexicons, parts-of-speech (POS) taggers). Are there downsides to digital inclusion?  Introduce the Final Project.  Students share languages and NLP areas of interest to facilitate group-building.  Assign the Language Quick-fire exercise.

_      Readings:

_      Kornai’s compilation of language resources (useful for Language Quick-fire)

_      Kornai, A. “Digital Language Death.” PLoS ONE 8, no. 10 (Oct. 22, 2013): e77056.

_      Gibson, M. (2016). Assessing Digital Vitality:  Analytical and Activist Approaches. In C. Soria & et al. (Eds.), CCURL 2016=Towards an Alliance for Digital Language Diversity (pp 46-51)

_      Bird, S. (2020). Decolonising Speech and Language Technology. Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020), 3504–3519.

_      Zaugg, I. “Theorizing Global Language Justice in the Digital Sphere,” Global Language Justice Anthology, forthcoming 2021.

 

3.  Jan 29. SMARA:  A critical look at resource-centric language typologies & linguistic typologies with an eye on their use in developing NLP technologies. Students present the quick-fire exercise. Introduction to how resource-centric typology influences the approach to developing language technologies, and how to port resources and computational models across languages (based on similarities in morphological perspective, syntax, etc.).  What type and size of data is needed (e.g. parallel data, comparable corpora, code-switched and mixed language documents)

_      Learning and projecting morphological information

_      Learning and projecting syntactic information

_      Machine translation

 

4.  Feb 5.  ISABELLE:  Diminishing Language Diversity:  What do we lose when a language dies?  Discussion of readings.  Possible screening of the first portion of “Language Matters with Bob Holman,” focused on Indigenous languages and normalcy of multilingualism in Northern Australia, and/or possible guest speakers:  Daniel Kaufman or Ross Perlin, Directors of NYC Endangered Language Alliance.  Final Project proposal ideas (non-graded) due.

_      Readings:

o   Thurman, J. “Annals of Conservation:  A Loss for Words.” The New Yorker, March 30, 2015.

o   Romaine, S. (2015). The Global Extinction of Languages and Its Consequences for Cultural Diversity. In H.F. Marten et al. (Ed.), Cultural and Linguistic Minorities in the Russian Federation and the European Union (pp. 31–46). Switzerland: Springer International Publishing.

o   How to Prevent Language Extinction. (2010, June 16). MIT Technology Review.

o   McWhorter, J.  “The Cosmopolitan Tongue: The Universality of English,” 2009.  

o   Optional:

_      Temperton, J. (2015, Sept. 26). Languages are dying, but is the internet to blame? Wired UK.

_      Harrison, D.K. (2007).  Chapter 1: “A World of Many (Fewer) Voices.” When Languages Die:  The Extinction of the World’s Languages and the Erosion of Human Knowledge. Oxford University Press (pp. 3-21).

_      Grubin, D. (2015, January 25). Language Matters with Bob Holman. David Grubin Productions Inc. and Pacific Islanders in Communications. (Interview with poet-filmmaker Bob Holman)

 

5.  Feb 12. ISABELLE:  Scripts.  Possible guest speaker Anshuman Pandey:  Lecture/discussion on script diversity in digital sphere and Unicode’s role in promoting digital support for scripts, as well as his own experience identifying digitally-disadvantaged communities online and doing primary research and community connection work in South Asia and Southeast Asia to incorporate them into Unicode. Assign Essay on the first 6 weeks of the course: normative stance on “Digital Language Justice” (due next week, week 6).  Final Project proposal (one-pager) due.

_      Reading:

_      Anderson, D. “Global Linguistic Diversity for the Internet.” Communications of the ACM 48, no. 1. Jan 2005: 27–28.

_      Lester, T. “New-Alphabet Disease?” The Atlantic, July ’97. 

_      Liu, L. H. (2015). Scripts in Motion:  Writing as Imperial Technology, Past and Present. Theories and Methodologies, 130(2), 375–383.

_      Erard, M.  “How the Appetite for Emojis Complicates the Effort to Standardize the World’s Alphabets.” New York Times Magazine. Oct. 18, 2017.

_      Optional:

_      Rosseau, J.-J. (1966). Chapter 5:  On Script. In J. G. Herder, On the Origin of Language (pp. 16–22, particularly page 16-17). Chicago and London: The University of Chicago Press.

_      Zaugg, I. (2017). Chapter 2:  Digital Governance Institutions and Digital Language Diversity, “Digitizing Ethiopic:  Coding for Linguistic Continuity in the Face of Digital Extinction.”

_      Wikimedia. (2018, March 12). Confound it! — Supporting languages with multiple writing systems.

 

6.  Feb 19. SMARA: What’s a Word?  Relate to the morphological typology (agglutinative, polysynthetic). Talk about Segmentation issues and how to do segmentation unsupervised and for different languages. Word classes: Part of Speech and cross-lingual approaches for POS tagging.  Guest Lecturer: Ramy Eskander.  Essay on first 6 weeks due (normative stance on “Digital Language Justice”).  Introduction of Final Project Lit Review expectations.

_      Readings:

_      Mathias Creutz and Krista Lagus. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL-02, pages 21-30, Philadelphia, Pennsylvania, 11 July, 2002.

_      To be added: 1 or 2 of Ramy’s papers

_      Resources:

_      Segmentation:

_      Morfessor: https://morfessor.readthedocs.io/en/latest/

_      MorphAGram: https://github.com/rnd2110/MorphAGram

_      POS tagging: TBA

 

 

7.  Feb 26. SMARA: Learning Word Meanings (Representations).  Talk about monolingual and multilingual dictionaries; monolingual word representations (word embeddings); contextualize-word embeddings; multilingual word embeddings. How to evaluate?  Students provide updates on Final Projects and ask questions.

_      Readings:

_      Lia, Lydia H. (2021). Wittgenstein in the Machine. Critical Inquiry.

_      TBA

_       

_      Resources and Additional Readings:

_      https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92

Word2Vec https://radimrehurek.com/gensim/models/word2vec.html 

https://github.com/dav/word2vec 

GloVe: https://github.com/stanfordnlp/GloVe 

FastText: https://github.com/facebookresearch/fastText

_      Bivec: https://nlp.stanford.edu/~lmthang/bivec/

_      Multilingual FastText: https://github.com/Babylonpartners/fastText_multilingual

_      Survey on Cross-lingual Word Embeddings: https://arxiv.org/abs/1706.04902

_      Samuel L. Smith, David H. P. Turban, Steven Hamblin and Nils Y. Hammerla
(2017). Offline bilingual word vectors, orthogonal transformations and the inverted softmax. ICLR 2017 (Multilingual Fastext Models)

_      P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information. Transactions of ACL. (Monolingual Fastext models)

_      Thang Luong, Hieu Pham, and Christopher D Manning (2015). Bilingual word representations with monolingual quality in mind. In Proc. of NAACL

 

 

8.  March 12. SMARA: Machine Translation (Technologies). History of Machine Translation. Brief intro to Statistical Machine Translation and Neural Machine Translation. Final Project Lit Review due.

_      Reading:

_      Ramati, I., & Pinchevski, A. (2017). Uniform multilingualism: A media genealogy of Google Translate. New Media & Society, 1461444817726951.

_      Benjamin, M. (2019, April 1). The Astounding Mathematics of Machine Translation. Teach You Backwards. https://www.teachyoubackwards.com/mt-mathematics/

_      Adam Lopez. 2008. Statistical machine translation. ACM Comput. Surv. 40, 3, Article 8 (August 2008)/

_      Optional:

_      Survey in Neural Machine Translation: https://arxiv.org/abs/1905.05395

_      Lewis-Kraus, G. (2015, June 4). Is Translation an Art or a Math Problem? The New York Times.

_      Gibbs, R. (2012). Machine Translation in the Imperfect World.  (15-min video about how to develop MT for small data scenarios). 

 

 

9.  March 19. SMARA:  Machine Translation #2.  High Resource vs. Low Resource & Evaluation. Tentative Guest speaker:  Mona Diab:Faithfulness in natural language generation in an era of heightened ethical AI awareness: opportunities for MT.”  Students provide Final Project updates and ask questions.

_      Readings:

_      Bar-Hillel, Y. (1960). A Demonstration of the Nonfeasibility of Fully Automatic High Quality Translation (Appendix III). In The present status of automatic translation of languages, (Vol. 1, pp. 158–163). 

_      Liu, L. H. (2018). The Battleground of Translation: Making Equal in a Global Structure of Inequality / _______ __ ________: _____ __ ________ __ ____ __________ _______ on JSTOR. Alif: Journal of Comparative Poetics, No. 38, Translation and the Production of Knowledge(s) / _______ ______ ________, 368–387.

_      TBA.

o   Optional:

_      Bonnie Dorr, Matt Snover,, Nitin Madnani (2011). Machine Translation Evaluation.  Handbook of Natural Language Processing and Machine Translation. Joseph Olive, John McCary, and Caitlin Christianson (eds.)

_      Zuckerman, E. Chapter “Found in Translation.” Rewire: Digital Cosmopolitans in the Age of Connection. W.W. Norton & Co, 2013.

_      Lehman-Wilzig, S. (2017). Babel and babble: Autonomous, algorithmic, simultaneous translation systems in the glocal village — consequences & paradoxical outcomes. In S. Brunn & R. Kehrein (Eds.), The Changing World Language Map. New York: Springer Publishing.

 

 

10.  March 26. ISABELLE:   Who are the stakeholders in creating a multilingual digital sphere?  Discussion of readings and role of international tech companies, local tech companies, governments, military/surveillance, users/volunteers, international governance institutions, language advocates, etc. The readings are assigned as a “jigsaw,” i.e. students are assigned to read one of the readings and make a 1-2 min. presentation on it.  Introduction of Stakeholder Investigation:  for the next class, students will examine one stakeholder in more depth, including mission statement, “true” interests (if they diverge from mission statement), and how it is working towards or against a multilingual digital sphere.

 

_      Readings (each student selects one to read and present - this list may be updated):

_      Winner, L. “Do Artifacts Have Politics?” In The Whale and the Reactor, Univ. of Chicago Press, 1980, pp. 19-39.

_      Castells, M.  Chapter 2: “The Rise of the Fourth World:  Informational Capitalism, Poverty, and Social Exclusion.” In End of Millennium, 2nd Edition. Blackwell Publ, 2000, (p. 68-82 and 165-168, or full chapter).

_      McGill University, & Benjamin, R. (2020, October 28). Mossman Lecture: Ruha Benjamin—Race to the Future? Reimagining the Default Settings of Technology & Society.

_      Innis, H. A., & Watson, A. J. (2008). Introduction.  The Bias of Communication, 2nd Edition (2nd edition). Toronto_; Buffalo, NY: University of Toronto Press.

_      Carey, J. (2009). A Cultural Approach to Communication. In Communication as Culture:  Essays on Media and Society (Revised Edition, pp. 11–28). Taylor & Francis.  

_      Hao, K. (2020, December 4). We read the paper that forced Timnit Gebru out of Google. Here’s what it says. MIT Technology Review.

_      Vannini, L., & Le Crosnier, H. (Eds.). (2016). Net.Lang:  Towards the Multilingual Cyberspace. Maaya Network, C&F édition.

_      Braffort, A. & Dalle, P., Accessibility in Cyberspace: Sign Languages (pp. 249-268).

_      Pierangeli Borletti, I., “Describing the World: Multilingualism, the Internet, and Human Rights (pp. 351-372).

_      Bortzmeyer, S., “Multilingualism and Internet Governance” (pp. 373-386).

_      Diki-Kidiri, M., “Ethical Principles Required for an Equitable Language Presence in the Information Society” (pp. 387-400).

_      Grumbach, S., “The Internet in China” (pp. 401-406).

_      Oustinoff, M., “The Economy of Languages” (407-422).

_      Prado, D. & Pimienta, D., “Public Policies for Languages in Cyberspace” (pp. 423-436).

_      Abdul Rahim, R. “Linguistic Diversity in the Internet Root: The Case of the Arabic Script and Jawi.” ICANN, Sept 8, 2015.

_      Discover SIL.” SIL International, May 1, 2012.

_      UNESCO Information for All Programme. “Keynote” and “Multilingualism in Russia” (p. 18-47).  Linguistic and Cultural Diversity in Cyberspace. Proceedings of the 3rd Intl. Conference (Yakutsk, Russian Federation): Moscow: ILCC, 2014.

_      “Interview w/Joseph Olive, Program Manager, Darpa” Human Language Technologies for Europe. European Com, April ‘06.

_      Adlam: ‘The Alphabet That Will Save a People from Disappearing’,” The Atlantic. Accessed January 28, 2017.

_       “Apple Helps Preserve Native American Language.” Foxnews.com. Associated Press, December 23, 2010.

_      Mapuche Indians to Bill Gates: Hands off Our LanguageReuters, The Sydney Morning Herald, Nov 24, 2006.

_      Whiteley, P. “Do ‘Language Rights’ Serve Indigenous Interests? Some Hopi and Other Queries.” American Anthropologist 105, no. 4 (2003): 712–22.

_      Wadhwa, K., & Fung, H. (2014). Converting Western Internet to Indigenous Internet: Lessons from Wikipedia. Innovations: Technology, Governance, Globalization, 9(3-4), 127–135.

_      Saleh, N. (2010).  “A Human Rights Approach to Globalization.” Third World Citizens and the Information Technology Revolution. Information Technology and Global Governance. Palgrave Macmillan, (p. 1-23).

_      Malcomson, S. (2016). Splinternet: How Geopolitics and Commerce Are Fragmenting the World Wide Web. OR Books. pp. 7-12. 

_      Interview blog posts with digital pioneer minority language speakers from Kevin Scannell’s “Indigenous Tweets Blog,” indigenoustweets.com:

_      Scannell, K. (2011a, May 1). Indigenous Tweets: Not dead yet: John Gillingham on the Cornish Language. Indigenous Tweets.

_      Scannell, K. (2011c, May 23). Indigenous Tweets: Why Haitian Creole? Indigenous Tweets.  AND Scannell, K. (2011b, May 18). Indigenous Tweets: After the Quake: Jean Came Poulard on Haitian Creole. Indigenous Tweets.

_      Scannell, K. (2011d, June 7). Indigenous Tweets: Tír gan teanga, tír gan anam: Keola Donaghy on the Hawaiian language. Indigenous Tweets.

_      Scannell, K. (2011e, June 21). Indigenous Tweets: Meeting the Challenge: Edmond Kachale on Chichewa. Indigenous Tweets.

_      Scannell, K. (2011f, August 22). Indigenous Tweets: “We’re here, we’re using this language”: Michael Bauer on Scottish Gaelic. Indigenous Tweets.

_      Scannell, K. (2011g, September 7). Indigenous Tweets: In the shadow of Pinatubo: José Navarro on Kapampangan. Indigenous Tweets.

_      Scannell, K. (2011h, November 11). Indigenous Tweets: “Murdered on its native territory”: Jordan Kutzik on Yiddish. Indigenous Tweets.

_      Scannell, K. (2011i, December 6). Indigenous Tweets: Language revitalization through free software: the case of Aragonese. Indigenous Tweets.

_      Scannell, K. (2012, October 15). Facebook in your language. Indigenous Tweets.

_      Scannell, K. (2013, December 29). Mapping the Celtic Twittersphere. Indigenous Tweets.

 

11.   April 2. ISABELLE:  What is the “clash of values” when it comes to “global language justice?”   Students each present 2-3 minutes on the digital stakeholder they chose last week, their mission statement or agenda, and how that intersects with support for language diversity.  If extra time remains, students are divided into teams and compete in a class debate about the ideal relationship between multilingual technologies and language diversity.  Check-in on final projects.

 

12.  April 9. ISABELLE & SMARA:  Final Project Presentations (last day of class):  Students present their final projects in a festive setting (organized as a poster session).  Final paper, including the computational project and an essay tying project together with themes in the course, is due a week after the end of class, April 16.