The goal of SCU annotation is to identify sub-sentential content units that can allow for comparison of the information in several summaries. It is well-known that when summarizing people make different choices about what information to include in their summary. The SCU annotation aims at highlighting what people agreed on. After the annotation is completed, some SCUs might appear in only one summary, but its annotation will allow a person to read a brand new summary and look for that SCU in this new summary.
An SCU consist of a label and contributors. The label is a concise English sentence that states the semantic meaning of the content unit. The contributors are snippet(s) of text coming from the summaries that show the wording used in a specific summary to express the label. It is possible for an SCU to have a single contributor, in the case when only one of the analyzed summaries expresses the label of the SCU.
The definition of content unit is somewhat fluid -- it can sometimes be a single word but it is never bigger than a sentence clause. Any event realized by a verb or a nominalized verb (e.g, "blow up" and "bombing" in the examples below) is a candidate SCU.
The three questions that will help you identify an SCU contributor are
Example 1: The three sentences below come from four different summaries A, B, C and D.
A: In 1992 the U. N. voted sanctions against Libya for its refusal to turn over the suspects. B: The United Nations imposed sanctions on Libya in 1992 because of their refusal to surrender the suspects. C: The U.N. imposed international air travel sanctions on Libya to force their extradition. D: Since 1992 Libya has been under U.N. sanctions in effect until the suspects are turned over to United States or Britain.Among other information, all four sentences express the fact that "Libya was under U.N. sanctions" and this is the label for the SCU. The contributors are marked in brackets below (ignore SCU2 for now.)
A: In 1992 [the U. N. voted sanctions against Libya]1 [for its refusal to turn over the suspects.]2 B: [The United Nations imposed sanctions on Libya]1 in 1992 [because of their refusal to surrender the suspects.]2 C: [The U.N. imposed]1 international air travel sanctions on Libya [to force their extradition.]2 D: Since 1992 [Libya has been under U.N. sanctions]1 [in effect until the suspects are turned over]2 to United States or Britain.
Other information, such as when the sanctions where imposed, what specific sanctions were imposed, why they were imposed etc, will form their own SCUs. Identifying a main topic event in the summaries and asking yourself such questions as above about specifics will help you formulate labels and identify the SCU contributors. The contributors of an SCU need not share identical wording. For example in the sentences above, the SCU with label "The goal behind the sanctions is to make Libya surrender the suspects" is expressed by the text coindexed with "2". Sentence B differs in wording from the rest of the sentences, but the meaning is the same as that of the other contributors, expressing the fact that Libya does not want to surrender the suspects and the other nations involved want to force their extradition. (Note that this is an example of only two SCUs that will be derived from the sentences, the full analysis will lead to identifying more SCUs and will lead to complete bracketing of the sentences.)
Let's look at one more example of sentences from the different summaries that share some common information.A. In 1998 [two Libyans indicted]1 [in 1991]2 for the Lockerbie [bombing]3 were still in Libya. B. [Two Libyans were indicted]1 [in 1991]2 [for blowing up]3 [a Pan Am]5 [jumbo jet]4 over Lockerbie, Scotland in 1988. C. [Two Libyans, accused]1 by the United States and Britain [of bombing]3 [a New York bound]6 [Pan Am]5 [jet]4 over Lockerbie, Scotland in 1988, killing 270 people, for 10 years were harbored by Libya who claimed the suspects could not get a fair trail in America or Britain. D. [Two Libyan suspects were indicted]1 [in 1991]2.
All share the information that (1) "Two Libyans are held responsible for a crime". The contributors are surrounded by brackets and coindexed by 1. Note that C differs in its wording from the other sentences--accused is not the same as indicted. But because the goal of the annotation is to find as much shared information as possible, and the sense of "accused" is so close to that of "indicted", the contributors will be grouped together, and the label expresses the general meaning of both accused and indicted.
The time expression prepositional phrase "in 1991" forms a separate SCU because the phrase "in 1991" can be omitted for example from sentence D without making the sentence ungrammatical or incomprehensible. There will be loss of information, and this is why the phrase can indicate a new *content* unit! The contributors of the SCU with label "The libyans were accused in 1991" are coindexed with "2".
Now we have to proceed and find what other information is repeated. For example, what was the crime committed? The different sentences give different amount of detail. When deciding where to start from--remember that the main goal is identifying the same information! All sentences agree on the fact that "the crime in question is a bombing" -- the contributors are coindexed with 3.
What was bombed? "An airplane was bombed" is another SCU with index 4. This information is expressed in two bigger noun phrases " Pan Am jumbo jet" and "a New York bound Pan Am jet" but "New York bound" and "Pan Am" can be omitted and the sentences will still be acceptable, so this information will be marked in a separate content unit.
The contributors are simply a part of the sentence--not all grammatical arguments necessary to reconstruct the label will be included in the contributor. This is ok, because the label will "bring in" any argument needed.
It is best if the SCU contributor can be a complete grammatical phrase. But this is sometimes not possible, so use your best judgment in assigning the specific token boundaries of the contributor.
When annotating documents for SCUs, be sure to take breaks! Annotating documents is time consuming. Be sure to take a break every once in a while.
We mark up one SCU about the number of deaths which contains the attribution from both sources. Even though the SCU is only about the number of deaths, it makes more sense to keep the attribution with the contributor rather than creating SCUs about which entities make announcements, which isn't really very interesting information to include in a summary.
- Swedish rescue authorities announced that at least 60 people died in a blaze that broke out the night of Thursday-Friday in a theater in Gutenberg (south).
 - Police announced this evening, Friday, that 60 people and not 65 as announced earlier, died in the fire that engulfed a disco in Gutenberg (southwestern Sweden), according to a new official toll.
 
SCU 1 At least 60 people died Swedish rescue authorities announced that at least 60 people died Police announced this evening, Friday, that 60 people
Police Commissioner, Hans Carlssen, said during a press conference in Gutenberg that "some information collected by police was not correct."into an SCU such as:
SCU 2 Hans Carlssen is the police commissioner Police Commissioner, Hans Carlssen,It often happens that there are such appositions or premodifiers in documents used to describe attribution, and in such cases the information can be split off into its own SCU. The remainder of the sentence might break into two SCUs ([said during a press conference in Gutenberg] and [that "some information collected by police was not correct."]) or only a single SCU ([said during a press conference in Gutenberg that "some information collected by police was not correct."].) In this case, it was only one SCU, because no other documents had information about a press conference in Gutenberg.
In the following example, all sentences have information on the time the fire started, which should go into the same SCU. While some contributors only have the granularity of the information at the day range, and others quote a specific hour, since they all are about when the blaze started, they should all go in the same SCU.
The example SCU is shown below:
- Local police explained that the fire, which may have been deliberate, started at 00:30 (at 23:30 Greenwich Meantime yesterday, Thursday) in a backside hall of one of the theaters where 400 young men and women, including people younger than 14 years of age, were celebrating "Halloween."
 - Stockholm 10-30 (AFP) - The head of the Swedish rescue authorities, Linart Ohlin, told the Swedish "TT" Agency that the fire that broke out in a disco in Gutenberg (south) and which killed 60 people and wounded about one hundred others the night of Thursday-Friday, may have been deliberate.
 - He added that the reasons behind the blaze are not known yet, explaining that the fire broke out at about 01:00 at dawn today, Friday (00:00 GMT) in a disco behind the theater.
 
SCU 3 The fire started at 1:00am Friday morning started at 00:30 (at 23:30 Greenwich Meantime yesterday, Thursday) the night of Thursday-Friday the fire broke out at about 01:00 at dawn today, Friday (00:00 GMT)