Database

The Enron email database is broken into three collections: entities, emails, and threads.

Entities

class EnronEntity

Base model for representing an entity in the Enron organization, ranging in specificity from a division like "Government Affairs" to individual employees.

The EnronEntity class describes the following attribute common to many EnronUnits and EnronEmployees:

position_nodes

An array of the PositionNode objects describing this entity's position in the organizational hierarchy at Enron.

class EnronUnit

Model for representing organizational units within Enron like "Global Corp Affairs". This model contains all the attributes of an EnronEntity, but is defined in order to allow one to quickly distinguish between an organizational unit and an employee in the entities collection.

This class inherits from: EnronUnit

class EnronEmployee

Model for representing employees at Enron.

This model contains the following attributes of an Emailer

email_address
email_names
email_addresses

This model also contains the following attribute:

position

A string describing this employee's highest position in the org chart

As well as:

position_id

This is an alpha numeric string associated with every position in the Enron org chart file.

Since some employees in the org chart have multiple PostionNodes associated with them, EnronEmployees also have the following field:

position_nodes

A list of all the PostionNodes associated with this entity in the org chart.

class Emailer

This model contains the following attributes of an Emailer

email_address

A single, canonical email_address belonging to this emailer (if one exists). If an emailer has multiple email_address and it is not clear which single address should be chosen, then this field will not exist. On the other hand, the email_addresses field will always exist, even if the emailer has just a single email address.

email_names

Names by which an email participant has been referred to in the corpus. There is a many-to-one relationship of names to email participant. "[W]e made some effort to recognize different names that referred to the same person. Chiefly this was by the different names that appeared in the "From" field in a single person's sent mail items. Sometimes this led to, e.g. an executive and an assistant being assigned the same uid".

For example the string 'From: "William Williams" <bwillia5@enron.com>'may occur in one or more emails, associating the name "William Williams" with the owner of the email address bwillia5@enron.com.

email_addresses

An array of email_addresses associated with this emailer

class PositionNode

A model associated with an EnronEntity describing its position in the organizational hierarchy at Enron(as charted in the orgcharts file) as a node in a graph, where an edge from this node to another indicates this node is in a higher position, and vice-versa.

A PositionNode has a unique id, and a list of edges (specifically, of type PositionEdge) connecting this node to other nodes. Each PositionNode is associated with (and stored as a property of) an EnronEntity, representing that entity's position within the organization. For example, a supervisor will have a PositionNode with a position string of "Supervisor", and edges from his position node to the position nodes of those he supervises to indicate that he is at a higher level than these employees in the organizational hierarchy.

Also, an organizational unit like "US Natural Gas" has a PositionNode associated with it, representing this unit's position in Enron. This position node's incident_edges field contains PositionEdges from it to all the employees it contains, and has a PositionEdge pointing to it from the PositionNode associated with Mr. Steffes, the director of the unit.

A PositionNode has the following attributes:

position

A string giving the title of the position.

position_id

This is an alpha numeric string associated with every position in the Enron org chart file.

uid

A string which serves as a unique identifier of this PositionNode. PositionEdges pointing to or from this node will reference this value.

incident_edges

An array containing the PositionEdges originating from, or pointing to this position in the organizational hierarchy.

class PositionEdge

This class stores a "relationship" property specific to directed edges between PositionNodes in our representation of the Enron hierarchy. The to and ffrom fields in this edge store the uids of the PositionNodes this edge is connecting.

A PositionEdge has the following attributes:

to

A field storing the uid of the "child" of the PositionNode from which this edge is pointing. This indicates that the position being pointed to is below the position this edge comes from in the organizational hierarchy.

ffrom

A field storing the uid of the "parent" of the PositionNode to which this edge is pointing. This indicates that the position being pointed to is above the position this edge points to in the organizational hierarchy.

relationship

A string describing the nature of the relationship between the positions being connected. For example, a positon may "manage" another employee or entity

class EnronEmailer

Class for describing entities which have an email address that ends in @enron.com, but who are not in our orgchart and cannot be confirmed to be employees (ex. there are many mailing lists that come from an @enron.com address

This class inherits from EnronEntity and Emailer

Emails

class EnronEmail

Model for representing a unique message appearing in the enron email data set

This model contains the following attributes:

uid

A integer uniquely identifying the message.

subject

The subject line (if available) of a message.

body

The text consituting the body of the message.

sender

A field containing the uid of the sender of the message. For the sake of generality, this is assumed to be a string.

from

A field containing the uid of the sender of the message. The difference between this field and sender is that for emails in the MySQL database that did not indicate a sender, the "sender" field does not exist, while the "from" field exists with a null value.

recipients

A field containing the uids of the intended recipients of the message. For the sake of generality, this field is assumed to be a list, since a message may have multiple intended recipients (as is often the case with emails). In the case of there being only one intended recipient, it is encouraged to still create a list with one entry for this field, in order to keep a consistent format.

to

A list containing the uids of the people to whom this email was addressed. Even when the header info describing who the email was sent to is missing, an empty array is stored for this field.

cc

A list containing the uids of those who have been cc'd/bcc'd on this email. In the case that there were no such people, this field will simply contain an empty list.

date_time

A Python datetime object representing the time the message was sent.

annotations

A list of objects containing additional information about a message (such as indicating a MessageQuote)

In addition, the following fields unique to this dataset are defined:

is_bubble

This is a boolean field where a value of true indicates that this message was missing in the corpus, but was reconstructed from quotations in other messages.

header_info

This is a list of HeaderInfo objects, each of which has the uid of a sender or recipient of the message, and the corresponding information about this person that appeared in the header of the email (ex. the email address that was used)

attachments

This is a list of filenames of attachments to this message

corresponding_files

This is a list of FileInfo objects describing instances of the email appearing in the Enron dataset (ex. an email may appear both in one person's "sent" box and another's inbox)

text_chunks

A list of the TextChunk objects which comprise this email. For example, a quoted message within the email will be represented as its own TextChunk object. I think a more natural representation of the information provided by splitting a message into "text chunks" is to store the message as one, contiguous string, and indicate which ranges are quotes and other information using MessageQuoteAnnotation and SplitterTextAnnotation objects

class HeaderInfo

An object describing the appearance of an Emailer in an email's header, it contains the following attributes:

uid

The uid of the participating entity

email_name

The name by which the entity is referred to in the header field of the email, ex. M Scott Kuehn is reffered to both as M Scott and as Kuehn in separate emails.

email_address

The email_address with which this entity participates in this email (ex. the email was sent to or from this email address)

role

The role of this entity's participation in the email (ex. this entity was cc'd or, the email was sent from this entity). This field can take any one of the following values: to, from, cc, bcc, sender, reply-to, to-box, from-box, bubble-to, bubble-from

header_source

Describes the source of the header information. The most frequent sources of header information were that stored in an email using the Microsoft Exchange protocol (field has value "Exchange v1.0), and that stored following the general RFC format ("RFCv1.0). An additional possible value for this field is "mailbox-tagger". This value indicates that no suitable headers were found for the email, and that an attempt was made to recover the header information for the email from the folder the message was found in.

class Mentions

This class describes an instance of a person named within the body of the email. These "mentions" were extracted as described in this paper.

It contains the following attributes of the named entity:

mention

A string indicating the name of the person mentioned in the email.

uid

A field containing the uid of the person whom the mention was resolved to. This is exactly the same as the Entities.uid field.

class FileInfo

This is a class for describing an instance of an email appearing in the Enron email data set. For example, the same message may be found in both the "sent" items folder of the sender and "inbox" folder of the receiver

It describes the following properties of this instance of the email:

file_path

A string indicating where in the archive of emails the file containing this instance of the message is located.

file_name

A string indicating the name of the file in which this instance of the message is located

mailbox_name

A string giving the name of the "mailbox" in which this message was found (one of the 158 released mailboxes)

sdoc_no

This is a value given to each document in the collection in the FERC dataset provided online (from which this dataset was scraped)

class OriginalMessageContentAnnotation

This class indicates a region of content original to this message (not quoted). It contains the following attributes:

start_index

An integer indicating the character index of the`Message` body at which the content starts

end_index

An integer indicating the character index of the Message body at which the content ends

annotation_type

A string indicating what kind of annotation this is. Indicates the type/format of annotation. This will typically be a “class constant”.

The string "OriginalMessageContent" is stored here, to make the purpose of this annotation immediately apparent to someone browsing the database.

class MessageQuoteAnnotation

This class is used to indicate a portion of a Message that is quoted from another message. It indicates the original author and uid (if available) of the quoted message.

start_index

An integer indicating the character index of the`Message` body at which the content starts

end_index

An integer indicating the character index of the Message body at which the content ends

author_id

The uid of the author of the quoted message. Defined as a String value for this general class, can be subclassed to include the type of unique identifier applicable to the specific corpus. For example, email authors may be assigned a unique integer id, while a username String uniquely identifies a poster on an online discussion board.

message_id

uid of the quoted Message

relative_depth

An integer that describes the position of this text chunk in the email. Relative depth of 0 means that this is the original text of the email. A directly quoted message in the email will have a relative depth of -1, and a message quoted by that message will have a value of -2, and so on.

class Message

Model for a general representation of a message as its own entity, storing references to senders and recipients, as well as the message’s text content.

The Message class defines the following attributes:

uid

A integer uniquely identifying the message. For example, in our corpus, the generation of such an integer is useful for uniquely identifying a message.

subject

The subject line (if available) of a message.

body

The text consituting the body of the message.

sender

A field containing the uid of the sender of the message. For the sake of generality, this is assumed to be a string.

recipients

A field containing the uids (assumed to be strings) of the intended recipients of the message. For the sake of generality, this field is assumed to be a list, since a message may have multiple intended recipients (as is often the case with emails). In the case of there being only one intended recipient, it is encouraged to still create a list with one entry for this field, in order to keep a consistent format. If this is the case, the presence of a to field indicates the intention for a principal recipient.

to

A field containing the uid (assumed to be a string) of the principal or sole recipient of the message, if available.

date_time

A Python datetime object representing the time the message was sent.

annotations

A field containing a list of instances of the Annotation class. This provides a uniform format for providing additional information about a message (such as indicating a MessageQuote)

class SplitterTextAnnotation

An instance of an Annotation used to indicate a portion of a Message body that is "splitter text", separating a quote from the remainder of the message.

start_index

An integer indicating the character index of the Message body at which the splitter text starts

end_index

An integer indicating the character index of the Message body at which the splitter text ends

author_id

The uid of the author of the quoted message.

message_id

uid of the quoted message (if available)

This class inherits from Annotation

class TextChunk

Container object for information describing a "chunk of text" (typically a single message) present in an email. The information represented by these objects is more naturally represented as a combination of a single contiguous body string and corresponding MessageQuoteAnnotations and SplitterTextAnnotations.

This class contains the following attributes:

content

A string giving the content of this TextChunk

splitter_text

The text (if any) that immediately precedes this chunk of text in the message, typically used to indicate quote.

splitter_type

A code describing the "type" of splitter text that is used.

relative_depth

An integer that describes the position of this text chunk in the email. Relative depth of 0 means that this is the original text of the email. A directly quoted message in the email will have a relative depth of - 1, and a message quoted by that message will have a value of -2, and so on.

Threads

In this database, Threads are stored in their own collection, titled "threads"

class Thread

Threads are intended to be stored in their own collection, and reference a collection of Messages in a database. This model can be inherited from and extended for a given data set.

A Thread is decomposed into ThreadNodes, which correspond to Messages in the message collection. These nodes contain edges to other nodes in the thread (e.g. the node corresponding to a reply will have an edge to the node corresponding to the replied-to message in the thread). A “thread” is the graph represented by these sets of nodes and edges. The graph is stored using an adjacency list representation. See ThreadNode for more details on how this graph is represented.

There are several advantages for storing these “thread nodes” within a Thread object instead of storing the messages themselves inside the thread object:

For those uninterested in the concept of threads, messages can be accessed directly in the Messages collection instead of having to access them through Thread objects Only the information relative to a message’s participation in a given Thread are stored in a corresponding ThreadNode. A message can be considered part of multiple threads without redundant storage of information. For example, even within the same “thread” in a message board, some posts are direct responses to one another, or discuss a somewhat different subject than other posts, and these can be grouped into their own Thread Some ThreadNodes may correspond to a message which is unavailable. For example, an email may have been removed from a data set.

The class attributes are:

title

Title of the thread, corresponds to the subject line of an email thread or thread title in a message board

uid

A unique identifier of the thread

root_node_id

uid of the ThreadNode that is the root of this thread

time_created
thread_nodes

A list of the ThreadNode objects that consititute this thread

dataset

A field which takes one of three values: train, test or dev

annotation_group_id

A number grouping this thread into a smaller subgroup for hand annotation.

class ThreadNode

Model for representating a message’s participation in a Thread.

ThreadNodes are embedded objects stored in Threads to describe a message's participation in the thread. ThreadNode objects are intended to be part of a set which constitutes a graph called a Thread. This graph is stored in an adjacency list representation, with each node having a incident_edges array attribute, storing a list of the (directed) edges corresponding to this Node.

The class attributes are:

message_id

The uid of the Message in the the corresponding Messages collection to which this ThreadNode corresponds.

uid

A unique identifier of this thread_node.

node_depth

Depth of this node in the thread. e.g. a node corresponding to a direct reply to the original message in the thread will have a node_depth value of 1, while the node corresponding to the original message will have a node_depth value of 0.

incident_edges

An array containing entries for each message in the thread.

to

uid of destination thread node

ffrom

uid of originating thread node

uid

uid of this thread

Table Of Contents

Previous topic

Setup

This Page