Summarization content units (SCUs)

The goal of SCU annotation is to identify sub-sentential content units that can allow for comparison of the information in several summaries. It is well-known that when summarizing people make different choices about what information to include in their summary. The SCU annotation aims at highlighting what people agreed on. After the annotation is completed, some SCUs might appear in only one summary, but its annotation will allow a person to read a brand new summary and look for that SCU in this new summary.

An SCU consist of a label and contributors. The label is a concise English sentence that states the semantic meaning of the content unit. The contributors are snippet(s) of text coming from the summaries that show the wording used in a specific summary to express the label. It is possible for an SCU to have a single contributor, in the case when only one of the analyzed summaries expresses the label of the SCU.

The definition of content unit is somewhat fluid -- it can sometimes be a single word but it is never bigger than a sentence clause. Any event realized by a verb or a nominalized verb (e.g, "blow up" and "bombing" in the examples below) is a candidate SCU.

The three questions that will help you identify an SCU contributor are

  1. Is the information expressed by it repeated in some other summary? Note, the wording need not be the same for the expressed meaning to be the same; we are looking for the same meaning. When an information unit is expressed in two or more summaries, the amount of information overlap will serve as a main indication of which parts of the corresponding sentences will become contributors.
  2. Spans of words that indicate location or time, or otherwise provide more specific information about another SCU are also SCUs. Usually these are expressed in adjuncts such as prepositional phrases and are not an obligatory argument to any verb. Noun phrases containing premodification can also be split into more than one SCU when the premodifiers include additional information. The need to split such additional information will arise in two cases. 1) When more than one summary express some information, but one of the summaries has an adjunct, e.g. several summaries mention that there was a bombing and one summary mentions the exact location of the bombing. In this situation one would identify two SCUs, one with the main event, and one with the additional detail information.
  3. Is the difference important for the story? Occasionally there will be minor differences in wording that if put under scrutiny could be construed to have different nuances. We are not interested in the finest grained distinctions---these will be too many to describe in a reasonable way.

Example 1: The three sentences below come from four different summaries A, B, C and D.

A: In 1992 the U. N. voted sanctions against Libya for its refusal to
turn over the suspects. 

B: The United Nations imposed sanctions on Libya in 1992 because of
their refusal to surrender the suspects. 

C: The U.N. imposed international air travel sanctions on Libya to
force their extradition. 

D: Since 1992 Libya has been under U.N. sanctions in effect until the suspects are turned over to United States or Britain. 
Among other information, all four sentences express the fact that "Libya was under U.N. sanctions" and this is the label for the SCU. The contributors are marked in brackets below (ignore SCU2 for now.)
A: In 1992 [the U. N. voted sanctions against Libya]1 [for its refusal to
turn over the suspects.]2 

B: [The United Nations imposed sanctions on Libya]1 in 1992 [because of
their refusal to surrender the suspects.]2 

C: [The U.N. imposed]1 international air travel sanctions on Libya [to
force their extradition.]2 

D: Since 1992 [Libya has been under U.N. sanctions]1 [in effect until
the suspects are turned over]2 to United States or Britain. 

Other information, such as when the sanctions where imposed, what specific sanctions were imposed, why they were imposed etc, will form their own SCUs. Identifying a main topic event in the summaries and asking yourself such questions as above about specifics will help you formulate labels and identify the SCU contributors. The contributors of an SCU need not share identical wording. For example in the sentences above, the SCU with label "The goal behind the sanctions is to make Libya surrender the suspects" is expressed by the text coindexed with "2". Sentence B differs in wording from the rest of the sentences, but the meaning is the same as that of the other contributors, expressing the fact that Libya does not want to surrender the suspects and the other nations involved want to force their extradition. (Note that this is an example of only two SCUs that will be derived from the sentences, the full analysis will lead to identifying more SCUs and will lead to complete bracketing of the sentences.)

Let's look at one more example of sentences from the different summaries that share some common information.
A. In 1998 [two Libyans indicted]1 [in 1991]2 for the Lockerbie [bombing]3 were
still in Libya. 

B. [Two Libyans were indicted]1 [in 1991]2 [for blowing up]3 [a Pan Am]5
[jumbo jet]4 over Lockerbie, Scotland in 1988. 

C. [Two Libyans, accused]1 by the United States and Britain [of bombing]3 [a
New York bound]6 [Pan Am]5 [jet]4 over Lockerbie, Scotland in 1988, killing
270 people, for 10 years were harbored by Libya who claimed the
suspects could not get a fair trail in America or Britain. 

D. [Two Libyan suspects were indicted]1 [in 1991]2. 

All share the information that (1) "Two Libyans are held responsible for a crime". The contributors are surrounded by brackets and coindexed by 1. Note that C differs in its wording from the other sentences--accused is not the same as indicted. But because the goal of the annotation is to find as much shared information as possible, and the sense of "accused" is so close to that of "indicted", the contributors will be grouped together, and the label expresses the general meaning of both accused and indicted.

The time expression prepositional phrase "in 1991" forms a separate SCU because the phrase "in 1991" can be omitted for example from sentence D without making the sentence ungrammatical or incomprehensible. There will be loss of information, and this is why the phrase can indicate a new *content* unit! The contributors of the SCU with label "The libyans were accused in 1991" are coindexed with "2".

Now we have to proceed and find what other information is repeated. For example, what was the crime committed? The different sentences give different amount of detail. When deciding where to start from--remember that the main goal is identifying the same information! All sentences agree on the fact that "the crime in question is a bombing" -- the contributors are coindexed with 3.

What was bombed? "An airplane was bombed" is another SCU with index 4. This information is expressed in two bigger noun phrases " Pan Am jumbo jet" and "a New York bound Pan Am jet" but "New York bound" and "Pan Am" can be omitted and the sentences will still be acceptable, so this information will be marked in a separate content unit.

The contributors are simply a part of the sentence--not all grammatical arguments necessary to reconstruct the label will be included in the contributor. This is ok, because the label will "bring in" any argument needed.

It is best if the SCU contributor can be a complete grammatical phrase. But this is sometimes not possible, so use your best judgment in assigning the specific token boundaries of the contributor.

Guidelines for document-level annotation

When annotating entire documents instead of only summaries, we have a few more guidelines. These examples should also be helpful for annotating summaries as well. As with annotation summaries, you should first skim through the documents to get an idea of what sorts of information is common between the documents.

When annotating documents for SCUs, be sure to take breaks! Annotating documents is time consuming. Be sure to take a break every once in a while.

Dealing with attribution

When marking SCUs from documents, it can be difficult to deal with attribution. In general, we find that attribution should be marked with the text of the SCU, and not treated separately. For example, for the following three sentences:
  1. Swedish rescue authorities announced that at least 60 people died in a blaze that broke out the night of Thursday-Friday in a theater in Gutenberg (south).
  2. Police announced this evening, Friday, that 60 people and not 65 as announced earlier, died in the fire that engulfed a disco in Gutenberg (southwestern Sweden), according to a new official toll.
We mark up one SCU about the number of deaths which contains the attribution from both sources. Even though the SCU is only about the number of deaths, it makes more sense to keep the attribution with the contributor rather than creating SCUs about which entities make announcements, which isn't really very interesting information to include in a summary.
SCU 1 At least 60 people died
  Swedish rescue authorities announced that at least 60 people died
  Police announced this evening, Friday, that 60 people

Appositions specifying jobs and positions

We also found that it was useful to mark up sentences like:
Police Commissioner, Hans Carlssen, said during a press conference in Gutenberg that "some information collected by police was not correct."
into an SCU such as:
SCU 2 Hans Carlssen is the police commissioner
  Police Commissioner, Hans Carlssen, 
It often happens that there are such appositions or premodifiers in documents used to describe attribution, and in such cases the information can be split off into its own SCU. The remainder of the sentence might break into two SCUs ([said during a press conference in Gutenberg] and [that "some information collected by police was not correct."]) or only a single SCU ([said during a press conference in Gutenberg that "some information collected by police was not correct."].) In this case, it was only one SCU, because no other documents had information about a press conference in Gutenberg.

Keeping similar information together

When creating SCUs, it is important to try to keep information that is generally the same in the same SCU. As explained above, some of the contributors might have more detail that others, but they should still be included in the same SCU. If the information is conceptually different, then a new SCU should be started.

In the following example, all sentences have information on the time the fire started, which should go into the same SCU. While some contributors only have the granularity of the information at the day range, and others quote a specific hour, since they all are about when the blaze started, they should all go in the same SCU.

  1. Local police explained that the fire, which may have been deliberate, started at 00:30 (at 23:30 Greenwich Meantime yesterday, Thursday) in a backside hall of one of the theaters where 400 young men and women, including people younger than 14 years of age, were celebrating "Halloween."
  2. Stockholm 10-30 (AFP) - The head of the Swedish rescue authorities, Linart Ohlin, told the Swedish "TT" Agency that the fire that broke out in a disco in Gutenberg (south) and which killed 60 people and wounded about one hundred others the night of Thursday-Friday, may have been deliberate.
  3. He added that the reasons behind the blaze are not known yet, explaining that the fire broke out at about 01:00 at dawn today, Friday (00:00 GMT) in a disco behind the theater.
The example SCU is shown below:
SCU 3 The fire started at 1:00am Friday morning
  started at 00:30 (at 23:30 Greenwich Meantime yesterday, Thursday)
  the night of Thursday-Friday
  the fire broke out at about 01:00 at dawn today, Friday (00:00 GMT)

Repetition in documents

When annotating summaries for SCUs, there should not be repetition of information, assuming the summary is a good one. When annotating documents, some information is repeated multiple times. If information is repeated in a document, it should be added to the already existing SCU for that information. Typically a lead sentence will give rise to one more SCUs, while later sentences might elaborate on the dense information imparted in the lead sentence. The intuition behind marking multiple instances of the same information in a document is that, while the repetition might be fulfilling some linguistically-related reference role, most edited news also does not repeat unimportant information. Information that is repeated on a consistent basis should be important, and this importance should be reflected in the weight of the SCU it contributes to.

SCUs should not be longer than a clause

For the most part, SCUs really should not be longer than a clause. If you include attribution, they might become a bit longer, but normally should not be too long. If there are two separate ideas in the SCUs, you might want to consider breaking it down into two SCUs.

Using WordFreak to annotate

This guide shows how to use Tom Morton's WordFreak to annotate SCUs.
Dave Evans
Last modified: Wed Jan 26 17:56:52 EST 2005