Sonya Allin and Melissa Holcombe
CS4721 Advanced Intelligent Systems
Final Project

Theme Identification in Alice in Wonderland


The literary scholar, when interpreting a fictional work, is currently limited to tools such as automatic concordance generators, various lexical databases, and search programs. These methods may help one find the frequency of a specific word in the text and all of the places where it is mentioned, but they are limited in terms of the information they can provide. Each word is treated as just a string of letters, devoid of semantics. It has been our effort, then, to extend to literary scholars and sensitive readers alike a tool that will help them identify conceptual categories, or themes, that are prevalent within a large body of text, such as a novel.

The Problem

Although most of us may not be not conscious of the fact, when we read, we assemble a massive amount of written information into very intricately formed conceptual hierarchies. These conceptual hierarchies have taken us a lifetime to build and refine, are specific to our language, and are subject to constant modifications.

The nature of these hierarchies is particularly complicated when it comes to the domain of fiction. Authors very frequently enjoy wordplays in their written work, and may use words that are rich in multiple meanings, specifically with the intention of widening the scope of their work's associations. Yeats, for example, was particularly fond of using the metaphor of the rose to symbolize Ireland. Yet your average computer would most certainly be inclined to resolve its metaphors to single meanings in the search for computational efficiency.

Moreover, fictional works may be seen to be particularly inclined to let a single meaning resonante over many different, extended passages. In Nabokov's Lolita, for instance, there are many passages in which a particular person named Quincy appears. Yet he is never named and is always peripheral to events (in a car behind Humbert Humbert, at a gas station speaking with Lolita). We as reader come to understand that Quincy is in fact the same individual only after several meetings with him across hundreds of pages. Your average computer, again, may be more inclined to focus on more local meanings in the quest for speed, and may therefore completely overlook the repetitions and patterns that reveal a fellow like Quincy.

What is wanting is the ability to trace the usage of words with similar meanings -- i.e., words that fall within certain themes. Such a tool would need to be able to generalize from specific words to the categories that contain them.

The Approach

In this section we describe a domain-independent method of identifying themes across a large body of text. The method entails constructing a tree that contains all of the nouns in the text, organized according to WordNet's hyponym/hypernym relations for nouns. Our sample domain was the book Alice in Wonderland. Our system succeeded in identifying concepts prevalent in the text, such as "playing card", "queen", and "rodent". Results have been made accessible through an interactive web interface.

Building the Hypernym Tree

To build the hypernym tree we first use the Church tagger to find nouns in Alice. Each noun is then located within the WordNet hyponym/hypernym hierarchy and added to our tree according to this location. Our tree, then, reflects a distinct subse t of the WordNet hierarchy, tailored specifically to the Alice domain.

Each node of our tree is either a noun that actually appears in the text, or a concept along the hyponym path from a noun in the text to the top of the tree. Every noun node contains the number of times the word appears in the text. And every concept node contains a number representing its frequency, which is defined as follows:

the sum of all nouns that connect to this concept via the hyponym relation, multiplied by the number of times that particular word occurs in the text.

We then assigned each node a percentile ranking, on the assumption that, if each and every sense of every word umbrella'ed by a particular concept were connected to that concept, the concept would weigh 100%.

One complication of building the hypernym tree is that a given noun can have multiple senses, hence multiple paths up the tree. As part of our approach holds that each and every sense may in fact bear relevance to the text and cannot be ignored, we have added all senses of all nouns to our tree. As it currently stands in our implementaion, each sense of each word has been assigned equal weighting.

It is acknowledged that the act of giving all of the senses equal weight does, on the average, place undue weight on the more obscure senses. For example, our current implementation holds that, if the word 'cat' appears 18 times in the text, both the concept of 'xrays' and the concept of 'animals' will be assigned an additional frequency of 18.

We propose, however, a method by which one may adjust these weights according to a sliding scale, giving more weight to the more common senses. The most common sense is always assigned a weight of 1, while the weights for the other senses are reduced by increments of 1 - (1 / (numSenses + 1) ). Thus, if a word has four senses, its related concept weights will be 1, 4/5, 3/5, 2/5, and 1/5, in decreasing order of commonality.

The sample output below shows part of the hypernym tree generated for Alice in Wonderland. The number on the left of each entry is the level in the tree. The number in parentheses is the category weight, a measure of how many words in Alice appear within that category. The asterisked number indicates how many times the word appears within Alice (words without asterisks do not appear in Alice at all). For sample code, click here.

-- >>1: abstraction (8131) 
-- -- >>2: relation (3278) 
-- -- -- >>3: part, portion, component part, component (173) 
-- -- -- -- >>4: language unit, linguistic unit (133) 
-- -- -- -- -- >>5: word (69) 
-- -- -- -- -- -- >>6: head (47) 47*
-- -- -- -- -- -- >>6: heads (10) 10*
-- -- -- -- -- -- >>6: form, word form (1) 
-- -- -- -- -- -- -- >>7: roots (1) 1*
-- -- -- -- -- -- >>6: terms (1) 1*
-- -- -- -- -- >>5: name (11) 
-- -- -- -- -- -- >>6: label (2) 2*
-- -- -- -- -- >>5: names (2) 2*
-- -- -- -- -- >>5: sound (4) 4*
-- -- -- -- -- >>5: sounds (2) 2*
-- -- -- -- -- >>5: phone, speech sound, sound (1) 
-- -- -- -- -- -- >>6: consonant (1) 
-- -- -- -- -- -- -- >>7: stop (1) 1*
-- -- -- -- -- >>5: words (21) 21*
-- -- -- -- >>4: particular (1) 1*
-- -- -- -- >>4: item, point (12) 
-- -- -- -- -- >>5: place (8) 8*
-- -- -- -- -- >>5: places (2) 2*
-- -- -- -- -- >>5: position (2) 2*
-- -- -- -- >>4: rest (8) 8*

as we build the hypernym tree, we took care not only to record the number of instances that each word occures in the text, but also where they occured. These line numbers were stored in a flat file, and were used to generate proximate concepts, as is described in the following section.

Extracting Information from the Tree

Once the tree has been constructed, we use several methods of to extract information about themes. To this end, we built a web-based interface that allows the user to:

  • see the conceptual ordering of the text at all levels in the WordNet hierarchy.
  • select a concept and follow hyperlinks to every occurrence of words contained by that concept.
  • view the other concepts on the same level that are closest in proximity (ie which co-occur frequently in the same lines) to a chosen concept.

Concept proximity is a measure of how frequently two concepts appear within the same line. The concepts to be compared must be at the same level. The program looks up all of the descendants of each concept and searches the text to determine how many times a descendant of one concept appears in the same line as a descendant of the other concept. Concept proximity is calculated for every concept on the same level as the selected concept and then the results are sorted in decreasing order.

It should be noted that one word in a line may relate two concepts. For example, the word 'queen' may relate 'face card' to 'female artistocrat' at a higher level in the WordNet tree.

The complexity of these computations is on the order of l*m*n, where l is the number of words that appear under the first concept, m is the number of words under the second concept, and n is the number of concepts at a given level. We made use of hash tables to index all of the line numbers so this lookup time is relatively fast (see source code for this and related code in a subroutine. If the concepts are fairly low in the Wordnet heirarchy, then, computation time may proceed quickly; otherwise it may be found to be fairly slow.

We have concluded that the information on the levels of the tree with the greatest breadth (levels 6 - 9) would be of most interest to the user. The concepts at the upper levels, such as "abstraction" and "relation", are too general to provide much insight into the text's prevalent themes (althought they may serve a purpose for some readers). Furthermore, because the program doesn't perform sense disambiguation, a certain amount of error percolates upward through the tree, and definitely infects the proximity relation. For example, carnivore is seen to relate to kings, as the words 'king' and 'queen' appear next to one another quite often in the text and a 'queen' is some times considered a female lion. The nodes at the upper levels are most susceptible to inaccurate weightings. On the other hand, the very lowest levels are so specific that they do not provide much information about themes at all.


This section describes the results obtained when the above procedure was applied to the entire text of Alice in Wonderland. Here we show the top ten most frequent concepts for levels 6 through 9, in order of increasing generality. The percentages refer to the number of times each concept appeared in the text compared to all of the concepts on the same level.

Level 9

  • leporid, leporid mammal (0.35%)
  • queen (0.33%)
  • king (0.28%)
  • turtle (0.27%)
  • front, front end, forepart (0.25%)
  • foam, froth (0.25%)
  • gryphon (0.24%)
  • head (0.21%)
  • rabbit (0.21%)
  • written record, written account (0.19%)
Level 8

  • face card, picture card, court card (0.67%)
  • chessman, chess piece (0.63%)
  • chelonian, chelonian reptile (0.54%)
  • musical notation (0.46%)
  • rodent, gnawer, gnawing animal (0.44%)
  • side, face (0.38%)
  • lagomorph, gnawing mammal (0.35%)
  • naked mole rat (0.34%)
  • record (0.32%)
  • mark (0.27%)
Level 7

  • placental, placental mammal, eutherian, eutherian mammal (2.58%)
  • anapsid, anapsid reptile (0.80%)
  • playing card (0.73%)
  • man, piece (0.72%)
  • surface (0.70%)
  • queen (0.67%)
  • notation, notational system (0.46%)
  • evidence (0.43%)
  • written symbol, printed symbol (0.42%)
  • movable barrier (0.41%)
Level 6

  • mammal (3.15%)
  • female aristocrat (1.22%)
  • writing, written material (1.15%)
  • reptile, reptilian (1.15%)
  • statement (1.05%)
  • boundary, bound, bounds (0.96%)
  • room (0.93%)
  • game equipment (0.84%)
  • meat (0.81%)
  • information, info (0.73%)

Note that at level 9 the most frequent concept is "leporid, leporid mammal", which is sense 1 of "rabbit". However, we also have the word "rabbit" lower down on the list. This is because of sense 2 of rabbit, which pertains to rabbit fur and is not likely to be the appropriate meaning for rabbit in this context. Since sense 2 is weighted lower than sense 1, it appears lower down in the list, which is what we would prefer at least in this instance. As we move upward through the levels of abstraction, "leporid" is absorbed into "rodent, gnawer, gnawing animal" at level 8, "placental, placental mammal, eutherian, eutherian mammal" at level 7, and "mammal" at level 6.

Moreover, we are able to conclude, via the proximity script, that concepts like communication and chordate are fairly closely linked at level 4. One line where these concepts occur together follows:

"There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear [concept] the Rabbit [chordate] say to itself, `Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural);..."

We also find that, at level 6, mammal is found to relate to game equipment, as the word "white" is parsed as a game piece, and is frequently coincident with Rabbit. Another example of co-occurrence is here:

"the puppy [a mammal] made another rush at the stick, and tumbled head over heels in its hurry to get hold of it; then Alice, thinking it was very like having a game [game equipment] of play with a cart-horse [mammal]"

Such relationships like this yield significant insight into the content of the text, and themes which may bear reference to one another across great distances.

Related Work

The concept proximity aspect of our project was partly inspired by the algorithm presented in "A Proposal for Word Sense Disambiguation Using Conceptual Distance" (Agirre and Rigau 1995). Agirre and Rigau provide a formula that disambiguates a noun by examining the other nouns within the same window of text and measuring the density of the corresponding nodes within WordNet. The sense that yields the greatest density is the one that is selected. Although we have not attempted to disambiguate word senses, we have applied a similar approach toward the problem of determining concept proximity. Like their approach, ours relies on context, focusing on all of the nouns that appear in a line at a time. Rather than node density, we measure frequency of pairwise matches between words within two categories.


In this paper we have presented a domain-independent method for identifying themes within a large body of text. The method requires a tagger to identify all of the nouns within the text and uses word sense tags from WordNet, a lexical database, to build a hypernym hierarchy from which inferences can be drawn based on frequency of concept occurrences.

Go to the Theme Identifier