The Enron email database is broken into three collections: entities, emails, and threads.
Base model for representing an entity in the Enron organization, ranging in specificity from a division like "Government Affairs" to individual employees.
The EnronEntity class describes the following attribute common to many EnronUnits and EnronEmployees:
An array of the PositionNode objects describing this entity's position in the organizational hierarchy at Enron.
Model for representing organizational units within Enron like "Global Corp Affairs". This model contains all the attributes of an EnronEntity, but is defined in order to allow one to quickly distinguish between an organizational unit and an employee in the entities collection.
This class inherits from: EnronUnit
Model for representing employees at Enron.
This model contains the following attributes of an Emailer
This model also contains the following attribute:
A string describing this employee's highest position in the org chart
As well as:
This is an alpha numeric string associated with every position in the Enron org chart file.
Since some employees in the org chart have multiple PostionNodes associated with them, EnronEmployees also have the following field:
A list of all the PostionNodes associated with this entity in the org chart.
This model contains the following attributes of an Emailer
A single, canonical email_address belonging to this emailer (if one exists). If an emailer has multiple email_address and it is not clear which single address should be chosen, then this field will not exist. On the other hand, the email_addresses field will always exist, even if the emailer has just a single email address.
Names by which an email participant has been referred to in the corpus. There is a many-to-one relationship of names to email participant. "[W]e made some effort to recognize different names that referred to the same person. Chiefly this was by the different names that appeared in the "From" field in a single person's sent mail items. Sometimes this led to, e.g. an executive and an assistant being assigned the same uid".
For example the string 'From: "William Williams" <bwillia5@enron.com>'may occur in one or more emails, associating the name "William Williams" with the owner of the email address bwillia5@enron.com.
An array of email_addresses associated with this emailer
A model associated with an EnronEntity describing its position in the organizational hierarchy at Enron(as charted in the orgcharts file) as a node in a graph, where an edge from this node to another indicates this node is in a higher position, and vice-versa.
A PositionNode has a unique id, and a list of edges (specifically, of type PositionEdge) connecting this node to other nodes. Each PositionNode is associated with (and stored as a property of) an EnronEntity, representing that entity's position within the organization. For example, a supervisor will have a PositionNode with a position string of "Supervisor", and edges from his position node to the position nodes of those he supervises to indicate that he is at a higher level than these employees in the organizational hierarchy.
Also, an organizational unit like "US Natural Gas" has a PositionNode associated with it, representing this unit's position in Enron. This position node's incident_edges field contains PositionEdges from it to all the employees it contains, and has a PositionEdge pointing to it from the PositionNode associated with Mr. Steffes, the director of the unit.
A PositionNode has the following attributes:
A string giving the title of the position.
This is an alpha numeric string associated with every position in the Enron org chart file.
A string which serves as a unique identifier of this PositionNode. PositionEdges pointing to or from this node will reference this value.
An array containing the PositionEdges originating from, or pointing to this position in the organizational hierarchy.
This class stores a "relationship" property specific to directed edges between PositionNodes in our representation of the Enron hierarchy. The to and ffrom fields in this edge store the uids of the PositionNodes this edge is connecting.
A PositionEdge has the following attributes:
A field storing the uid of the "child" of the PositionNode from which this edge is pointing. This indicates that the position being pointed to is below the position this edge comes from in the organizational hierarchy.
A field storing the uid of the "parent" of the PositionNode to which this edge is pointing. This indicates that the position being pointed to is above the position this edge points to in the organizational hierarchy.
A string describing the nature of the relationship between the positions being connected. For example, a positon may "manage" another employee or entity
Class for describing entities which have an email address that ends in @enron.com, but who are not in our orgchart and cannot be confirmed to be employees (ex. there are many mailing lists that come from an @enron.com address
This class inherits from EnronEntity and Emailer
Model for representing a unique message appearing in the enron email data set
This model contains the following attributes:
A integer uniquely identifying the message.
The subject line (if available) of a message.
The text consituting the body of the message.
A field containing the uid of the sender of the message. For the sake of generality, this is assumed to be a string.
A field containing the uid of the sender of the message. The difference between this field and sender is that for emails in the MySQL database that did not indicate a sender, the "sender" field does not exist, while the "from" field exists with a null value.
A field containing the uids of the intended recipients of the message. For the sake of generality, this field is assumed to be a list, since a message may have multiple intended recipients (as is often the case with emails). In the case of there being only one intended recipient, it is encouraged to still create a list with one entry for this field, in order to keep a consistent format.
A list containing the uids of the people to whom this email was addressed. Even when the header info describing who the email was sent to is missing, an empty array is stored for this field.
A list containing the uids of those who have been cc'd/bcc'd on this email. In the case that there were no such people, this field will simply contain an empty list.
A Python datetime object representing the time the message was sent.
A list of objects containing additional information about a message (such as indicating a MessageQuote)
In addition, the following fields unique to this dataset are defined:
This is a boolean field where a value of true indicates that this message was missing in the corpus, but was reconstructed from quotations in other messages.
This is a list of HeaderInfo objects, each of which has the uid of a sender or recipient of the message, and the corresponding information about this person that appeared in the header of the email (ex. the email address that was used)
This is a list of filenames of attachments to this message
This is a list of FileInfo objects describing instances of the email appearing in the Enron dataset (ex. an email may appear both in one person's "sent" box and another's inbox)
A list of the TextChunk objects which comprise this email. For example, a quoted message within the email will be represented as its own TextChunk object. I think a more natural representation of the information provided by splitting a message into "text chunks" is to store the message as one, contiguous string, and indicate which ranges are quotes and other information using MessageQuoteAnnotation and SplitterTextAnnotation objects
An object describing the appearance of an Emailer in an email's header, it contains the following attributes:
The uid of the participating entity
The name by which the entity is referred to in the header field of the email, ex. M Scott Kuehn is reffered to both as M Scott and as Kuehn in separate emails.
The email_address with which this entity participates in this email (ex. the email was sent to or from this email address)
The role of this entity's participation in the email (ex. this entity was cc'd or, the email was sent from this entity). This field can take any one of the following values: to, from, cc, bcc, sender, reply-to, to-box, from-box, bubble-to, bubble-from
Describes the source of the header information. The most frequent sources of header information were that stored in an email using the Microsoft Exchange protocol (field has value "Exchange v1.0), and that stored following the general RFC format ("RFCv1.0). An additional possible value for this field is "mailbox-tagger". This value indicates that no suitable headers were found for the email, and that an attempt was made to recover the header information for the email from the folder the message was found in.
This class describes an instance of a person named within the body of the email. These "mentions" were extracted as described in this paper.
It contains the following attributes of the named entity:
A string indicating the name of the person mentioned in the email.
A field containing the uid of the person whom the mention was resolved to. This is exactly the same as the Entities.uid field.
This is a class for describing an instance of an email appearing in the Enron email data set. For example, the same message may be found in both the "sent" items folder of the sender and "inbox" folder of the receiver
It describes the following properties of this instance of the email:
A string indicating where in the archive of emails the file containing this instance of the message is located.
A string indicating the name of the file in which this instance of the message is located
A string giving the name of the "mailbox" in which this message was found (one of the 158 released mailboxes)
This is a value given to each document in the collection in the FERC dataset provided online (from which this dataset was scraped)
This class indicates a region of content original to this message (not quoted). It contains the following attributes:
An integer indicating the character index of the`Message` body at which the content starts
An integer indicating the character index of the Message body at which the content ends
A string indicating what kind of annotation this is. Indicates the type/format of annotation. This will typically be a “class constant”.
The string "OriginalMessageContent" is stored here, to make the purpose of this annotation immediately apparent to someone browsing the database.
This class is used to indicate a portion of a Message that is quoted from another message. It indicates the original author and uid (if available) of the quoted message.
An integer indicating the character index of the`Message` body at which the content starts
An integer indicating the character index of the Message body at which the content ends
The uid of the author of the quoted message. Defined as a String value for this general class, can be subclassed to include the type of unique identifier applicable to the specific corpus. For example, email authors may be assigned a unique integer id, while a username String uniquely identifies a poster on an online discussion board.
uid of the quoted Message
An integer that describes the position of this text chunk in the email. Relative depth of 0 means that this is the original text of the email. A directly quoted message in the email will have a relative depth of -1, and a message quoted by that message will have a value of -2, and so on.
Model for a general representation of a message as its own entity, storing references to senders and recipients, as well as the message’s text content.
The Message class defines the following attributes:
A integer uniquely identifying the message. For example, in our corpus, the generation of such an integer is useful for uniquely identifying a message.
The subject line (if available) of a message.
The text consituting the body of the message.
A field containing the uid of the sender of the message. For the sake of generality, this is assumed to be a string.
A field containing the uids (assumed to be strings) of the intended recipients of the message. For the sake of generality, this field is assumed to be a list, since a message may have multiple intended recipients (as is often the case with emails). In the case of there being only one intended recipient, it is encouraged to still create a list with one entry for this field, in order to keep a consistent format. If this is the case, the presence of a to field indicates the intention for a principal recipient.
A field containing the uid (assumed to be a string) of the principal or sole recipient of the message, if available.
A Python datetime object representing the time the message was sent.
A field containing a list of instances of the Annotation class. This provides a uniform format for providing additional information about a message (such as indicating a MessageQuote)
An instance of an Annotation used to indicate a portion of a Message body that is "splitter text", separating a quote from the remainder of the message.
An integer indicating the character index of the Message body at which the splitter text starts
An integer indicating the character index of the Message body at which the splitter text ends
The uid of the author of the quoted message.
uid of the quoted message (if available)
This class inherits from Annotation
Container object for information describing a "chunk of text" (typically a single message) present in an email. The information represented by these objects is more naturally represented as a combination of a single contiguous body string and corresponding MessageQuoteAnnotations and SplitterTextAnnotations.
This class contains the following attributes:
The text (if any) that immediately precedes this chunk of text in the message, typically used to indicate quote.
A code describing the "type" of splitter text that is used.
An integer that describes the position of this text chunk in the email. Relative depth of 0 means that this is the original text of the email. A directly quoted message in the email will have a relative depth of - 1, and a message quoted by that message will have a value of -2, and so on.
In this database, Threads are stored in their own collection, titled "threads"
Threads are intended to be stored in their own collection, and reference a collection of Messages in a database. This model can be inherited from and extended for a given data set.
A Thread is decomposed into ThreadNodes, which correspond to Messages in the message collection. These nodes contain edges to other nodes in the thread (e.g. the node corresponding to a reply will have an edge to the node corresponding to the replied-to message in the thread). A “thread” is the graph represented by these sets of nodes and edges. The graph is stored using an adjacency list representation. See ThreadNode for more details on how this graph is represented.
There are several advantages for storing these “thread nodes” within a Thread object instead of storing the messages themselves inside the thread object:
For those uninterested in the concept of threads, messages can be accessed directly in the Messages collection instead of having to access them through Thread objects Only the information relative to a message’s participation in a given Thread are stored in a corresponding ThreadNode. A message can be considered part of multiple threads without redundant storage of information. For example, even within the same “thread” in a message board, some posts are direct responses to one another, or discuss a somewhat different subject than other posts, and these can be grouped into their own Thread Some ThreadNodes may correspond to a message which is unavailable. For example, an email may have been removed from a data set.
The class attributes are:
Title of the thread, corresponds to the subject line of an email thread or thread title in a message board
A unique identifier of the thread
uid of the ThreadNode that is the root of this thread
A list of the ThreadNode objects that consititute this thread
A field which takes one of three values: train, test or dev
A number grouping this thread into a smaller subgroup for hand annotation.
Model for representating a message’s participation in a Thread.
ThreadNodes are embedded objects stored in Threads to describe a message's participation in the thread. ThreadNode objects are intended to be part of a set which constitutes a graph called a Thread. This graph is stored in an adjacency list representation, with each node having a incident_edges array attribute, storing a list of the (directed) edges corresponding to this Node.
The class attributes are:
The uid of the Message in the the corresponding Messages collection to which this ThreadNode corresponds.
A unique identifier of this thread_node.
Depth of this node in the thread. e.g. a node corresponding to a direct reply to the original message in the thread will have a node_depth value of 1, while the node corresponding to the original message will have a node_depth value of 0.