.. _enron: ******************** Database ******************** The Enron email database is broken into three collections: `entities`, `emails`, and `threads`. .. _entities: ======== Entities ======== .. class:: EnronEntity() Base model for representing an entity in the Enron organization, ranging in specificity from a division like "Government Affairs" to individual employees. The :class:`EnronEntity` class describes the following attribute common to many :class:`EnronUnit`\ s and :class:`EnronEmployee`\ s: .. attribute:: position_nodes An array of the :class:`PositionNode` objects describing this entity's position in the organizational hierarchy at Enron. .. class:: EnronUnit() Model for representing organizational units within Enron like "Global Corp Affairs". This model contains all the attributes of an :class:`EnronEntity`, but is defined in order to allow one to quickly distinguish between an organizational unit and an employee in the `entities` collection. This class inherits from: :class:`EnronUnit` .. class:: EnronEmployee Model for representing employees at Enron. This model contains the following attributes of an :class:`Emailer` .. attribute:: email_address .. attribute:: email_names .. attribute:: email_addresses This model also contains the following attribute: .. attribute:: position A string describing this employee's highest position in the org chart As well as: .. attribute:: position_id This is an alpha numeric string associated with every position in the Enron org chart file. Since some employees in the org chart have multiple :class:`PostionNode`\ s associated with them, :class:`EnronEmployees` also have the following field: .. attribute:: position_nodes A list of all the :class:`PostionNode`\ s associated with this entity in the org chart. .. class:: Emailer() This model contains the following attributes of an :class:`Emailer` .. attribute:: email_address A single, canonical email_address belonging to this emailer (if one exists). If an emailer has multiple email_address and it is not clear which single address should be chosen, then this field will not exist. On the other hand, the :attr:`email_addresses` field will *always* exist, even if the emailer has just a single email address. .. attribute:: email_names Names by which an email participant has been referred to in the corpus. There is a many-to-one relationship of names to email participant. "[W]e made some effort to recognize different names that referred to the same person. Chiefly this was by the different names that appeared in the "From" field in a single person's sent mail items. Sometimes this led to, e.g. an executive and an assistant being assigned the same uid". For example the string 'From: "William Williams" 'may occur in one or more emails, associating the name "William Williams" with the owner of the email address bwillia5@enron.com. .. attribute:: email_addresses An array of email_addresses associated with this emailer .. class:: PositionNode A model associated with an :class:`EnronEntity` describing its position in the organizational hierarchy at Enron(as charted in the orgcharts file) as a node in a graph, where an edge from this node to another indicates this node is in a higher position, and vice-versa. A :class:`PositionNode` has a unique id, and a list of edges (specifically, of type :class:`PositionEdge`) connecting this node to other nodes. Each :class:`PositionNode` is associated with (and stored as a property of) an :class:`EnronEntity`, representing that entity's position within the organization. For example, a supervisor will have a :class:`PositionNode` with a position string of "Supervisor", and edges from his position node to the position nodes of those he supervises to indicate that he is at a higher level than these employees in the organizational hierarchy. Also, an organizational unit like "US Natural Gas" has a :class:`PositionNode` associated with it, representing this unit's position in Enron. This position node's :attr:`incident_edges` field contains :class:`PositionEdge`\ s from it to all the employees it contains, and has a :class:`PositionEdge` pointing to it from the :class:`PositionNode` associated with Mr. Steffes, the director of the unit. A :class:`PositionNode` has the following attributes: .. attribute:: position A string giving the title of the position. .. attribute:: position_id This is an alpha numeric string associated with every position in the Enron org chart file. .. attribute:: uid A string which serves as a unique identifier of this :class:`PositionNode`. :class:`PositionEdge`\ s pointing to or from this node will reference this value. .. attribute:: incident_edges An array containing the :class:`PositionEdge`\ s originating from, or pointing to this position in the organizational hierarchy. .. class:: PositionEdge This class stores a "relationship" property specific to directed edges between :class:`PositionNode`\ s in our representation of the Enron hierarchy. The :attr:`to` and :attr:`ffrom` fields in this edge store the :attr:`uid`\ s of the :class:`PositionNode`\ s this edge is connecting. A :class:`PositionEdge` has the following attributes: .. attribute:: to A field storing the :attr:`uid` of the "child" of the :class:`PositionNode` from which this edge is pointing. This indicates that the position being pointed to is below the position this edge comes from in the organizational hierarchy. .. attribute:: ffrom A field storing the :attr:`uid` of the "parent" of the :class:`PositionNode` to which this edge is pointing. This indicates that the position being pointed to is above the position this edge points to in the organizational hierarchy. .. attribute:: relationship A string describing the nature of the relationship between the positions being connected. For example, a positon may "manage" another employee or entity .. class:: EnronEmailer Class for describing entities which have an email address that ends in @enron.com, but who are not in our orgchart and cannot be confirmed to be employees (ex. there are many mailing lists that come from an @enron.com address This class inherits from :class:`EnronEntity` and :class:`Emailer` .. _emails: ====== Emails ====== .. class:: EnronEmail Model for representing a unique message appearing in the enron email data set This model contains the following attributes: .. attribute:: uid A integer uniquely identifying the message. .. attribute:: subject The subject line (if available) of a message. .. attribute:: body The text consituting the body of the message. .. attribute:: sender A field containing the :attr:`uid` of the sender of the message. For the sake of generality, this is assumed to be a string. .. attribute:: from A field containing the :attr:`uid` of the sender of the message. The difference between this field and :attr:`sender` is that for emails in the MySQL database that did not indicate a sender, the "sender" field does not exist, while the "from" field exists with a null value. .. attribute:: recipients A field containing the :attr:`uid`\ s of the intended recipients of the message. For the sake of generality, this field is assumed to be a list, since a message may have multiple intended recipients (as is often the case with emails). In the case of there being only one intended recipient, it is encouraged to still create a list with one entry for this field, in order to keep a consistent format. .. attribute:: to A list containing the :attr:`uid`\ s of the people to whom this email was addressed. Even when the header info describing who the email was sent to is missing, an empty array is stored for this field. .. attribute:: cc A list containing the :attr:`uid`\ s of those who have been cc'd/bcc'd on this email. In the case that there were no such people, this field will simply contain an empty list. .. attribute:: date_time A Python :class:`datetime` object representing the time the message was sent. .. attribute:: annotations A list of objects containing additional information about a message (such as indicating a :class:`MessageQuote`) In addition, the following fields unique to this dataset are defined: .. attribute:: is_bubble This is a boolean field where a value of *true* indicates that this message was missing in the corpus, but was reconstructed from quotations in other messages. .. attribute:: header_info This is a list of :class:`HeaderInfo` objects, each of which has the :attr:`uid` of a sender or recipient of the message, and the corresponding information about this person that appeared in the header of the email (ex. the email address that was used) .. attribute:: attachments This is a list of filenames of attachments to this message .. attribute:: corresponding_files This is a list of :class:`FileInfo` objects describing instances of the email appearing in the Enron dataset (ex. an email may appear both in one person's "sent" box and another's inbox) .. attribute:: text_chunks A list of the :class:`TextChunk` objects which comprise this email. For example, a quoted message within the email will be represented as its own :class:`TextChunk` object. I think a more natural representation of the information provided by splitting a message into "text chunks" is to store the message as one, contiguous string, and indicate which ranges are quotes and other information using :class:`MessageQuoteAnnotation` and :class:`SplitterTextAnnotation` objects .. class:: HeaderInfo An object describing the appearance of an :class:`Emailer` in an email's header, it contains the following attributes: .. attribute:: uid The uid of the participating entity .. attribute:: email_name The name by which the entity is referred to in the header field of the email, ex. M Scott Kuehn is reffered to both as M Scott and as Kuehn in separate emails. .. attribute:: email_address The :attr:`email_address` with which this entity participates in this email (ex. the email was sent to or from this email address) .. attribute:: role The role of this entity's participation in the email (ex. this entity was cc'd or, the email was sent from this entity). This field can take any one of the following values: `to`, `from`, `cc`, `bcc`, `sender`, `reply-to`, `to-box`, `from-box`, `bubble-to`, `bubble-from` .. attribute:: header_source Describes the source of the header information. The most frequent sources of header information were that stored in an email using the Microsoft Exchange protocol (field has value "Exchange v1.0), and that stored following the general RFC format ("RFCv1.0). An additional possible value for this field is "mailbox-tagger". This value indicates that no suitable headers were found for the email, and that an attempt was made to recover the header information for the email from the folder the message was found in. .. class:: Mentions This class describes an instance of a person named within the body of the email. These "mentions" were extracted as described in this `paper `_. It contains the following attributes of the named entity: .. attribute:: mention A string indicating the name of the person mentioned in the email. .. attribute:: uid A field containing the uid of the person whom the mention was resolved to. This is exactly the same as the :class:`Entities.uid` field. .. class:: FileInfo This is a class for describing an instance of an email appearing in the Enron email data set. For example, the same message may be found in both the "sent" items folder of the sender and "inbox" folder of the receiver It describes the following properties of this instance of the email: .. attribute:: file_path A string indicating where in the archive of emails the file containing this instance of the message is located. .. attribute:: file_name A string indicating the name of the file in which this instance of the message is located .. attribute:: mailbox_name A string giving the name of the "mailbox" in which this message was found (one of the 158 released mailboxes) .. attribute:: sdoc_no This is a value given to each document in the collection in the FERC dataset provided online (from which this dataset was scraped) .. class:: OriginalMessageContentAnnotation This class indicates a region of content original to this message (not quoted). It contains the following attributes: .. attribute:: start_index An integer indicating the character index of the`Message` body at which the content starts .. attribute:: end_index An integer indicating the character index of the Message body at which the content ends .. attribute:: annotation_type A string indicating what kind of annotation this is. Indicates the type/format of annotation. This will typically be a “class constant”. The string "OriginalMessageContent" is stored here, to make the purpose of this annotation immediately apparent to someone browsing the database. .. class:: MessageQuoteAnnotation This class is used to indicate a portion of a Message that is quoted from another message. It indicates the original author and uid (if available) of the quoted message. .. attribute:: start_index An integer indicating the character index of the`Message` body at which the content starts .. attribute:: end_index An integer indicating the character index of the Message body at which the content ends .. attribute:: author_id The uid of the author of the quoted message. Defined as a String value for this general class, can be subclassed to include the type of unique identifier applicable to the specific corpus. For example, email authors may be assigned a unique integer id, while a username String uniquely identifies a poster on an online discussion board. .. attribute:: message_id uid of the quoted Message .. attribute:: relative_depth An integer that describes the position of this text chunk in the email. Relative depth of 0 means that this is the original text of the email. A directly quoted message in the email will have a relative depth of -1, and a message quoted by that message will have a value of -2, and so on. .. class:: Message Model for a general representation of a message as its own entity, storing references to senders and recipients, as well as the message’s text content. The Message class defines the following attributes: .. attribute:: uid A integer uniquely identifying the message. For example, in our corpus, the generation of such an integer is useful for uniquely identifying a message. .. attribute:: subject The subject line (if available) of a message. .. attribute:: body The text consituting the body of the message. .. attribute:: sender A field containing the uid of the sender of the message. For the sake of generality, this is assumed to be a string. .. attribute:: recipients A field containing the uids (assumed to be strings) of the intended recipients of the message. For the sake of generality, this field is assumed to be a list, since a message may have multiple intended recipients (as is often the case with emails). In the case of there being only one intended recipient, it is encouraged to still create a list with one entry for this field, in order to keep a consistent format. If this is the case, the presence of a to field indicates the intention for a principal recipient. .. attribute:: to A field containing the uid (assumed to be a string) of the principal or sole recipient of the message, if available. .. attribute:: date_time A Python datetime object representing the time the message was sent. .. attribute:: annotations A field containing a list of instances of the Annotation class. This provides a uniform format for providing additional information about a message (such as indicating a MessageQuote) .. class:: SplitterTextAnnotation An instance of an :class:`Annotation` used to indicate a portion of a :class:`Message` body that is "splitter text", separating a quote from the remainder of the message. .. attribute:: start_index An integer indicating the character index of the :class:`Message` :attr:`body` at which the splitter text starts .. attribute:: end_index An integer indicating the character index of the :class:`Message` :attr:`body` at which the splitter text ends .. attribute:: author_id The :attr:`uid` of the author of the quoted message. .. attribute:: message_id :attr:`uid` of the quoted message (if available) This class inherits from :class:`Annotation` .. class:: TextChunk Container object for information describing a "chunk of text" (typically a single message) present in an email. The information represented by these objects is more naturally represented as a combination of a single contiguous :attr:`body` string and corresponding :class:`MessageQuoteAnnotation`\ s and :class:`SplitterTextAnnotation`\ s. This class contains the following attributes: .. attribute:: content A string giving the content of this :class:`TextChunk` .. attribute:: splitter_text The text (if any) that immediately precedes this chunk of text in the message, typically used to indicate quote. .. attribute:: splitter_type A code describing the "type" of splitter text that is used. .. attribute:: relative_depth An integer that describes the position of this text chunk in the email. Relative depth of 0 means that this is the original text of the email. A directly quoted message in the email will have a relative depth of - 1, and a message quoted by that message will have a value of -2, and so on. .. _threads: ======= Threads ======= In this database, :class:`Thread`\ s are stored in their own collection, titled "*threads*" .. class:: Thread Threads are intended to be stored in their own collection, and reference a collection of :class:`Message`\s in a database. This model can be inherited from and extended for a given data set. A Thread is decomposed into :class:`ThreadNode`\s, which correspond to :class:`Message`\s in the message collection. These nodes contain edges to other nodes in the thread (e.g. the node corresponding to a reply will have an edge to the node corresponding to the replied-to message in the thread). A “thread” is the graph represented by these sets of nodes and edges. The graph is stored using an adjacency list representation. See :class:`ThreadNode` for more details on how this graph is represented. There are several advantages for storing these “thread nodes” within a :class:`Thread` object instead of storing the messages themselves inside the thread object: For those uninterested in the concept of threads, messages can be accessed directly in the :class:`Message`\s collection instead of having to access them through :class:`Thread` objects Only the information relative to a message’s participation in a given :class:`Thread` are stored in a corresponding :class:`ThreadNode`. A message can be considered part of multiple threads without redundant storage of information. For example, even within the same “thread” in a message board, some posts are direct responses to one another, or discuss a somewhat different subject than other posts, and these can be grouped into their own :class:`Thread` Some :class:`ThreadNode`\s may correspond to a message which is unavailable. For example, an email may have been removed from a data set. The class attributes are: .. attribute:: title Title of the thread, corresponds to the subject line of an email thread or thread title in a message board .. attribute:: uid A unique identifier of the thread .. attribute:: root_node_id :class:`uid` of the :class:`ThreadNode` that is the root of this thread .. attribute:: time_created .. attribute:: thread_nodes A list of the :class:`ThreadNode` objects that consititute this thread .. attribute:: dataset A field which takes one of three values: *train*, *test* or *dev* .. attribute:: annotation_group_id A number grouping this thread into a smaller subgroup for hand annotation. .. class:: ThreadNode Model for representating a message’s participation in a Thread. :class:`ThreadNode`\ s are embedded objects stored in :class:`Thread`\ s to describe a message's participation in the thread. ThreadNode objects are intended to be part of a set which constitutes a graph called a Thread. This graph is stored in an adjacency list representation, with each node having a :attr:`incident_edges` array attribute, storing a list of the (directed) edges corresponding to this Node. The class attributes are: .. attribute:: message_id The :attr:`uid` of the :attr:`Message` in the the corresponding `Messages` collection to which this :attr:`ThreadNode` corresponds. .. attribute:: uid A unique identifier of this thread_node. .. attribute:: node_depth Depth of this node in the thread. e.g. a node corresponding to a direct reply to the original message in the thread will have a :attr:`node_depth` value of 1, while the node corresponding to the original message will have a :attr:`node_depth` value of 0. .. attribute:: incident_edges An array containing entries for each message in the thread. .. attribute:: to :attr:`uid` of destination thread node .. attribute:: ffrom :attr:`uid` of originating thread node .. attribute:: uid :attr:`uid` of this thread