Multilingual Technologies and Language Diversity

 

Instructor:  Prof. Smaranda Muresan and Dr. Isabelle Zaugg

 

 

Course Meeting Time: Fridays 1:10-3:40

 

Office Hours Instructors:

Smaranda Muresan Fridays 4:00-5:00pm (320F, Data Science Institute, Interchurch Building)

Isabelle Zaugg Wednesdays 1:00-2:00pm (320, Data Science Institute, Interchurch Building)

 

Office Hours TA:

Sujay Khandagale: TBD

 

 

Short Description

Innovations in digital technologies have shown their potential to be at times breathtakingly beneficial, and at others divisive or troubling. With regard to digital technologies’ impact on the ecosystem of language diversity, evidence suggests that new technologies are one contributor to the decline and predicted extinction of 50-90% of the world's languages this century. Yet digital innovations supporting a growing number of languages also have the potential to bolster language diversity in ways unimaginable a few years ago. Will innovations in multilingual natural language processing bring about a renaissance of language diversity, as users no longer need to rely on English and other dominant languages? To address this question, this course will introduce a dual view on language diversity: 1) a typology of language vitality and endangerment and 2) a resource-centric typology (low-resource vs. high-resource) regarding the availability of data resources to develop computational models for language analysis. This course will address the challenge of scaling natural language processing technologies developed mostly for English to the rich diversity of human languages. The resource-centric typology will also contribute to the dialogue of what is “Data Science.” Much research has been dedicated to the “Big Data” scenario; however “Small Data” poses equally challenging problems, which this course will highlight. This course brings data and computational literacy about multilingual technologies to humanities students, while also exposing computer science and data science students to ethical, cultural, business, and policy issues within the context of multilingual technologies. 

 

This 4000-level course is cross-listed in the Department of Computer Science and the Institute for Comparative Literature and Society, and open to both upper-level undergraduate and graduate students. The class could also organize programming bootcamps for ICLS students to be held by the class assistant in the first 4-5 weeks of the semester. The class will also provide open source state-of-the-art NLP tools and datasets to be used by students and relevant readings. The final project will team up students in CS and ICLS.

 

Academic Integrity

Columbia's intellectual community relies on academic integrity and responsibility as the cornerstone of its work. Graduate students are expected to exhibit the highest level of personal and academic honesty as they engage in scholarly discourse and research. In practical terms, you must be responsible for the full and accurate attribution of the ideas of others in all of your research papers and projects; you must be honest when taking your examinations; you must always submit your own work and not that of another student, scholar, or internet source. Graduate students are responsible for knowing and correctly utilizing referencing and bibliographical guidelines. When in doubt, consult your professor. Citation and plagiarism-prevention resources can be found at the GSAS page on Academic Integrity and Responsible Conduct of Research (http://gsas.columbia.edu/academic-integrity).  Failure to observe these rules of conduct will have serious academic consequences, up to and including dismissal from the university. If a faculty member suspects a breach of academic honesty, appropriate investigative and disciplinary action will be taken following Dean's Discipline procedures (http://gsas.columbia.edu/content/disciplinary-procedures).

 

Disabilities Accommodations

If you have been certified by Disability Services (DS) to receive accommodations, please either bring your accommodation letter from DS to your professor’s office hours to confirm your accommodation needs, or ask your liaison in GSAS to consult with your professor.  If you believe that you may have a disability that requires accommodation, please contact Disability Services at 212-854-2388 or disability@columbia.edu.

Important: To request and receive an accommodation you must be certified by DS.

 

Percentage Breakdown of Grading:

 

Assignment

Description

%

Participation/

Class activities

As a seminar, attendance and engagement in class activities is expected.

20

Brief Presentations

Pairings of CS students and ICLSs students ideally

1)    Language Quick-Fire: pick a language neither of them speak and do a 5 minute presentation about (Assigned Week 2, due Week 3):              

_      Language facts (endangered/low-resource/high-resource;   demographics, location, script(s))                             

_      Linguistic characteristics (e.g., orthography, morphology, syntax) 

_      Computational efforts such as resources, tools, scholarly articles

2)    Stakeholder Investigation (Assigned week 12 and due week 13)

_      What is the role of different stakeholders in the multilingual digital sphere?

 

10

Assignments

 

2 computational assignments

1)    Corpora Collection/Language ID/Code-switching: this will help with the project too. Out week 6, due week 8

2)    Building lexicons: Out week 8 due week 10

1 essay-based assignment

1)    Essay on the first 6 weeks of the course: Out week 6, due week 7

30

 Final Project

Combines both computational and writing assignment (team based: pair CS and ICLS students).  Steps include:

  1. Proposal ideas (non-graded to see whether suitable + team formation): Due week 4
  2. Proposal (one pager):  Due week 6
  3. Lit review (after midterm): Due week 12
  4. Final project presentations: last day of class, week 14
  5. Final project and written report:  Due 1 week after final presentations
    1. Computational project
    2. Essay tying project together with themes in the course

 40

 

 

 

 

 

 Class Sessions (reading list tentative might be updated)

 

1.  Jan 24. ISABELLE & SMARA:  What is the relationship between language diversity and digital technologies?  Introduce syllabus.  Introduction to the historical trajectory of multilingual computing.  Introduction to rapidly diminishing language diversity.

_      Introduce all the dimensions we will be looking at:

_      Language Vitality Scales: Ethnologue and UNESCO typologies with language examples

_      Digital resource-centric typologies: impoverished, low-resource, high-resource languages, including examples

_      Linguistic Typology:

_      Morphology (agglutinative, fusional, polysynthetic)

_      Syntax (SOV, SVO, OVS)

_      Script & Writing system

_      Readings:

_      Zaugg, I. “Digitizing Ethiopic:  Coding for Linguistic Continuity in the Face of Digital Extinction,” 2017, pp. 13-16

_      Benjamin, M. (2016) “Digital Language Diversity:  Seeking the Value Proposition.” In Collaboration and Computing for Under-Resourced Languages:  Towards an Alliance for Digital Language Diversity, pp. 52–58.

 

2.  Jan 31. ISABELLE & SMARA:  Language Diversity: Vitality, Resource-centric, Linguistics Typology. High-level discussion of the impact on language communities if multilingual technologies are or are not developed (machine translation (MT), morph analyzers, learning lexicons, parts-of-speech (POS) taggers). Assign the Language Quick-fire exercise. Introduce Project

_      Readings:

_      Kornai’s compilation of language resources (useful for Language Quick-fire)

_      Kornai, A. “Digital Language Death.” PLoS ONE 8, no. 10 (Oct. 22, 2013): e77056.

_      Gibson, M. (2016). Assessing Digital Vitality:  Analytical and Activist Approaches. In C. Soria & et al. (Eds.), CCURL 2016=Towards an Alliance for Digital Language Diversity (pp. 46–61—see page numbers printed on page, not auto-generated PDF page count).  

_      Rosenberg, T. “Everyone Speaks Text Message.” New York Times, Dec. 9, 2011.

_      Optional:

_      Young, H. “The Digital Language Divide.” The Guardian, n.d.

_      Morais, L.  “Why I believe Almost All African Languages are Endangered.” Medium.com. July 1, 2016.

_      Prado, D. (2016). “Language Presence in the Real World and Cyberspace.” In Vannini, L., & Le Crosnier, H. (Eds.), “Net.Lang:  Towards the Multilingual Cyberspace. Maaya Network, C&F édition, pp. 34-51.

 

3.  Feb 7. SMARA:  A critical look at resource-centric language typologies & linguistic typologies with an eye on their use in developing NLP technologies. Student present the quick-fire exercise. Introduction to how resource-centric typology influences the approach to developing language technologies, and how to port resources and computational models across languages (based on similarities in morphological perspective, syntax, etc.).  What type and size of data is needed (e.g. parallel data, comparable corpora, code-switched and mixed language documents)

_      Learning and projecting morphological information

_      Learning and projecting syntactic information

_      Machine translation

_      Readings: 

o   Emerging Technology from the arXiv. 2019. “Machine Learning Has Been Used to Automatically Translate Long-Lost Languages.” MIT Technology Review, July 1, 2019.

o   Loomis, Pandey, & Zaugg, “Full Stack Language Enablement.” Steven R. Loomis Blog, June 6, 2017.

o   Benjamin, M., & Radetzky, P. (2014). Small Languages, Big Data: Multilingual Computational Tools and Techniques for the Lexicography of Endangered Languages. In Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages (pp. 15–23). Baltimore, Maryland, USA: Association for Computational Linguistics.

 

4.  Feb 14.  ISABELLE:  Diminishing Language Diversity:  What do we lose when a language dies?  Discussion of readings.  Either screening of first portion of “Language Matters with Bob Holman,” focused on Indigenous languages and normalcy of multilingualism in Northern Australia, or possible guest speakers:  Daniel Kaufman or Ross Perlin, Directors of NYC Endangered Language Alliance, or Skype-in Nicholas Evans or K. David Harrison.  Final Project proposal ideas (non-graded) due.

_      Readings:

o   Thurman, J. “Annals of Conservation:  A Loss for Words.” The New Yorker, March 30, 2015.

o   Romaine, S. (2015). The Global Extinction of Languages and Its Consequences for Cultural Diversity. In H.F. Marten et al. (Ed.), Cultural and Linguistic Minorities in the Russian Federation and the European Union (pp. 31–46). Switzerland: Springer International Publishing.

o   How to Prevent Language Extinction. (2010, June 16). MIT Technology Review.

o   McWhorter, J.  “The Cosmopolitan Tongue: The Universality of English,” 2009.  

o   Optional:

_      Temperton, J. (2015, Sept. 26). Languages are dying, but is the internet to blame? Wired UK.

_      Harrison, D.K. (2007).  Chapter 1: “A World of Many (Fewer) Voices.” When Languages Die:  The Extinction of the World’s Languages and the Erosion of Human Knowledge. Oxford University Press (pp. 3-21).

 

5.  Feb 21. SMARA: Corpora Collection; Language ID; Code-Switching l.  Discuss web crawling (Parallel corpora, Wikipedia as source material, dictionaries).  Discuss Language Identification and code-switching computational methods.  

_      Readings:

o   Scannell, K. P. (2007). The Crúbadán Project: Corpus building for under-resourced languages. Presented at the WAC3 Conference, Louvain-la-Neuve, Belgium.

o   Mendels, G., Soto, V., Jaech, A., and Hirschberg, J. (May 2018).  "Collecting Code-Switched Data from Social Media,” LREC, Miyazaki, Japan.

o   Zhang, Y., Riesa, J., Gillick, D., Bakalov, A., Baldridge, J., & Weiss, D.  (2018). A Fast, Compact, Accurate model for Language Identification of Codemixed Text. Proceedings of EMNLP 2018. 

o   Marco Lui and Timothy Baldwin. Accurate language identification of Twitter messages. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), pages 1725, Gothenburg, Sweden, 2014.

o   Optional: Soto, V., Hirschberg, J. (July 2018). "Joint Part-of-Speech and Language ID Tagging for Code-Switched Data," V.Soto, J. Hirschberg, Third Workshop on Computational Approaches to Linguistic Code-Switching at ACL , Melbourne, Australia.

 

_      Resources:

o   https://github.com/googlei18n/corpuscrawler/  

o   https://kamusi.org/#sitemap

o   Check out Google’s open-source language classifier:  https://github.com/google/cld3

o   Look at the Indigenous Tweets website and consider how tweets were ID’d as belonging to a particular language

o   Check Thamar Solorio’s work on codeswitching:

o   Check ACL workshops on Code-switching (3 editions)

 

6.  Feb 28. ISABELLE:  Scripts.  Lecture/discussion on script diversity in digital sphere and Unicode’s role in promoting digital support for scripts. Assign Essay on the first 6 weeks of the course (due week 7). Assign  Language ID/Code-switching assignment (due Week 8).

 Final Project proposal (one-pager) due.

_      Reading:

_      Anderson, D. “Global Linguistic Diversity for the Internet.” Communications of the ACM 48, no. 1. Jan 2005: 27–28.

_      Lester, T. “New-Alphabet Disease?” The Atlantic, July ’97. 

_      Liu, L. H. (2015). Scripts in Motion:  Writing as Imperial Technology, Past and Present. Theories and Methodologies, 130(2), 375–383.

_      Erard, M.  “How the Appetite for Emojis Complicates the Effort to Standardize the World’s Alphabets.” New York Times Magazine. Oct. 18, 2017.

_      Optional:

_      Rosseau, J.-J. (1966). Chapter 5:  On Script. In J. G. Herder, On the Origin of Language (pp. 16–22, particularly page 16-17). Chicago and London: The University of Chicago Press.

_      Zaugg, I. (2017). Chapter 2:  Digital Governance Institutions and Digital Language Diversity, “Digitizing Ethiopic:  Coding for Linguistic Continuity in the Face of Digital Extinction.”

_      Wikimedia. (2018, March 12). Confound it! — Supporting languages with multiple writing systems.

 

 

7.  March 6. SMARA: What’s a Word?   Relate to the morphological typology (agglutinative, polysynthetic). Talk about Segmentation issues and how to do segmentation unsupervised and for different languages. Talk about learning monolingual dictionaries and monolingual word representations (word embeddings).

_      Readings:

_      Mathias Creutz and Krista Lagus. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL-02, pages 21-30, Philadelphia, Pennsylvania, 11 July, 2002.

_      M. Baroni, G. Dinu and G. Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors Proceedings of ACL 2014 (52nd Annual Meeting of the Association for Computational Linguistics), East Stroudsburg PA: ACL, 238-247. An archive with results with further models and parameter settings on the same benchmarks. The best count and predict semantic vectors from this study.

_      Resources

_      Segmentation: Morfessor: https://morfessor.readthedocs.io/en/latest/

_      Learning Word Representations: https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92

Word2Vec https://radimrehurek.com/gensim/models/word2vec.html 

https://github.com/dav/word2vec 

GloVe: https://github.com/stanfordnlp/GloVe 

FastText: https://github.com/facebookresearch/fastText

 

 

8.  March 13. SMARA: Learning Multilingual Word Representations.     Talk about bi-lingual or multi-lingual dictionaries; bilingual or multilingual word representations.  Evaluation: word-level translation. Corpora Collection/Language ID/Code-switching assignment due.  Assign Building dictionaries assignment (due week 10).  

_      Readings:

_        Thang Luong, Hieu Pham, and Christopher D Manning (2015). Bilingual word representations with monolingual quality in mind. In Proc. of NAACL

_       P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information. Transactions of ACL. (Monolingual Fastext models)

_        

_      Resources and Additional Readings:

_       Bivec: https://nlp.stanford.edu/~lmthang/bivec/

_       Multilingual FastText: https://github.com/Babylonpartners/fastText_multilingual

_       Survey on Cross-lingual Word Embeddings: https://arxiv.org/abs/1706.04902

_       Samuel L. Smith, David H. P. Turban, Steven Hamblin and Nils Y. Hammerla
(2017). Offline bilingual word vectors, orthogonal transformations and the inverted softmax. ICLR 2017 (Multilingual Fastext Models)

 

 

9.  March 27. SMARA: Machine Translation (Technologies). History of Machine Translations. Brief intro to Statistical Machine Translation and Neural Machine Translation.

 

_      Reading:

_      Ramati, I., & Pinchevski, A. (2017). Uniform multilingualism: A media genealogy of Google Translate. New Media & Society, 1461444817726951.

_      Adam Lopez. 2008. Statistical machine translation. ACM Comput. Surv. 40, 3, Article 8 (August 2008)/

_      Suvery in Neural Machine Translation: https://arxiv.org/abs/1905.05395

 

_      Optional:

_      Lewis-Kraus, G. (2015, June 4). Is Translation an Art or a Math Problem? The New York Times.

_      Gibbs, R. (2012). Machine Translation in the Imperfect World.  (15-min video about how to develop MT for small data scenarios). 

 

 

10.  April 3. SMARA:  Machine Translation #2.  High Resource vs. Low Resource & Evaluation. What needs to change when working in a low-resource scenario.  How to evaluate quality of translations.  (Building dictionaries assignment due)

_      Readings:

_      Bar-Hillel, Y. (1960). A Demonstration of the Nonfeasibility of Fully Automatic High Quality Translation (Appendix III). In The present status of automatic translation of languages, (Vol. 1, pp. 158–163). 

_      Bonnie Dorr, Matt Snover,, Nitin Madnani (2011). Machine Translation Evaluation.  Handbook of Natural Language Processing and Machine Translation. Joseph Olive, John McCary, and Caitlin Christianson (eds.)

_      Liu, L. H. (2018). The Battleground of Translation: Making Equal in a Global Structure of Inequality / _______ __ ________: _____ __ ________ __ ____ __________ _______ on JSTOR. Alif: Journal of Comparative Poetics, No. 38, Translation and the Production of Knowledge(s) / _______ ______ ________, 368–387.

o   Optional:

_      Zuckerman, E. Chapter “Found in Translation.” Rewire: Digital Cosmopolitans in the Age of Connection. W.W. Norton & Co, 2013.

_      Lehman-Wilzig, S. (2017). Babel and babble: Autonomous, algorithmic, simultaneous translation systems in the glocal village — consequences & paradoxical outcomes. In S. Brunn & R. Kehrein (Eds.), The Changing World Language Map. New York: Springer Publishing.

 

 

11.  April 10. ISABELLE:  A Theoretical Framework:  How can we understand the relationship between ICT’s and Language Diversity?  Discussion of design.  Lecture and discussion on readings

_      Reading:

o   Winner, L. “Do Artifacts Have Politics?” In The Whale and the Reactor, Univ. of Chicago Press, 1980, pp. 19-39.

o   Zaugg, I. “Digitizing Ethiopic:  Coding for Linguistic Continuity in the Face of Digital Extinction,” 2017, pp. 16-27.

o   Olohan, M. (2017). Technology, translation and society: A constructivist, critical theory approach. Target. International Journal of Translation Studies, 29(2), 264–283.  

o   Innis, H. A., & Watson, A. J. (2008). IntroductionThe Bias of Communication, 2nd Edition (2nd edition). Toronto_; Buffalo, NY: University of Toronto Press.

o   Optional:

_       Carey, J. (2009). A Cultural Approach to Communication. In Communication as Culture:  Essays on Media and Society (Revised Edition, pp. 11–28). Taylor & Francis. 

_       Castells, M.  Chapter 2: “The Rise of the Fourth World:  Informational Capitalism, Poverty, and Social Exclusion.” In End of Millennium, 2nd Edition. Blackwell Publ, 2000, (p. 68-82 and 165-168, or full chapter). 

 

12.  April 17. ISABELLE:   Who are the stakeholders in creating a multilingual digital sphere?  Discussion of readings and role of international tech companies, local tech companies, governments, military/surveillance, users/volunteers, international governance institutions, language advocates, etc. The readings are assigned as a “jigsaw,” i.e. students are assigned to read one of the readings and make a 1-2 min. presentation on it.  Each student also responds to one interview blog post with a digital pioneer minority language speaker from:  Scannell, K. “Indigenous Tweets Blog,” indigenoustweets.com (hint: most of these posts are from 2011).  For the next class, students will examine one stakeholder in more depth, including mission statement, “true” interests (if they diverge from mission statement), and how it is working towards or against a multilingual digital sphere.  (Lit Review for Final Project due)

 

_      Readings:

_      Vannini, L., & Le Crosnier, H. (Eds.). (2016). Net.Lang:  Towards the Multilingual Cyberspace. Maaya Network, C&F édition.

_      Braffort, A. & Dalle, P., Accessibility in Cyberspace: Sign Languages (pp. 249-268).

_      Pierangeli Borletti, I., “Describing the World: Multilingualism, the Internet, and Human Rights (pp. 351-372).

_      Bortzmeyer, S., “Multilingualism and Internet Governance” (pp. 373-386).

_      Diki-Kidiri, M., “Ethical Principles Required for an Equitable Language Presence in the Information Society” (pp. 387-400).

_      Grumbach, S., “The Internet in China” (pp. 401-406).

_      Oustinoff, M., “The Economy of Languages” (407-422).

_      Prado, D. & Pimienta, D., “Public Policies for Languages in Cyberspace” (pp. 423-436).

_      Abdul Rahim, R. “Linguistic Diversity in the Internet Root: The Case of the Arabic Script and Jawi.” ICANN, Sept 8, 2015.

_      Discover SIL.” SIL International, May 1, 2012.

_      UNESCO Information for All Programme. “Keynote” and “Multilingualism in Russia” (p. 18-47).  Linguistic and Cultural Diversity in Cyberspace. Proceedings of the 3rd Intl. Conference (Yakutsk, Russian Federation): Moscow: ILCC, 2014.

_      “Interview w/Joseph Olive, Program Manager, Darpa” Human Language Technologies for Europe. European Com, April ‘06.

_      Adlam: ‘The Alphabet That Will Save a People from Disappearing’,” The Atlantic. Accessed January 28, 2017.

_       “Apple Helps Preserve Native American Language.” Foxnews.com. Associated Press, December 23, 2010.

_      Mapuche Indians to Bill Gates: Hands off Our LanguageReuters, The Sydney Morning Herald, Nov 24, 2006.

_      Whiteley, P. “Do ‘Language Rights’ Serve Indigenous Interests? Some Hopi and Other Queries.” American Anthropologist 105, no. 4 (2003): 712–22.

_      Wadhwa, K., & Fung, H. (2014). Converting Western Internet to Indigenous Internet: Lessons from Wikipedia. Innovations: Technology, Governance, Globalization, 9(3-4), 127–135.

_      Saleh, N. (2010).  “A Human Rights Approach to Globalization.” Third World Citizens and the Information Technology Revolution. Information Technology and Global Governance. Palgrave Macmillan, (p. 1-23).

_      Malcomson, S. (2016). Splinternet: How Geopolitics and Commerce Are Fragmenting the World Wide Web. OR Books. pp. 7-12. 

 

13.   April 24. ISABELLE:  What are the “clash of values” when it comes to “global language justice?”   Students each present 2-3 minutes on the digital stakeholder they chose last week, their mission statement, and how that intersects with support for language diversity.  With extra time, students are divided into teams and compete in a class debate about the ideal relationship between multilingual technologies and language diversity. 

 

14.  May 1. ISABELLE & SMARA:  Final Project Presentations (last day of class):  Students present their final projects in a festive setting (organized as a poster session).  Final paper, including the computational project and an essay tying project together with themes in the course, is due a week after end of class.