Devise both a storage access and a transport protocol for all source code. Source code is defined as plain text that has some semantic meaning and a representation as a parsed tree of terms according to some grammar. The purpose of this protocol is not to address the transport or storage of plain text formats. FTP, HTTP, and filesystems already handle this duty. Build a very easy to understand, code, and use one in Java first (objects, vectors, big memory hog, no algorithmic optimization) and then start to clean it up to high performance Java, then perhaps to other languages (Perl?, C/C++).
Can we come up with a heavy first protocol that depends on knowing both the storage and display formats? Then can we come up with a really light protocol (SOAP envelopes) that just passes events back and forth? Note that this second protocol is not a request/response model, and therefore not SOAP/HTTP compliant. It would be analogous to LDAP*, which is not request/response based, but still implemented on TCP/IP.
*Actually, I'm not sure if LDAP is request/response based or not, but the new protocol would be at the layer of LDAP or HTTP (not on top of HTTP) regardless.
Make next-generation source code storage and manipulation feasible. Plus, write the protocol, everyone knows your name, and you've definitly had an impact on the industry. This topic has huge implications and areas of study, and works in with some of what Clay was doing. It also has major security implications, as well as software engineering in-the-large considerations.
By "next-generation", I mean we will be allowing the evolution of storing source code to continue. Just as source code used to be stored on punch cards or pushpins, and is now stored book-like in pages of flat text, so now we can move source code to a data format that takes advantage of the power of modern relational database structures, centralized and access-controlled electronic storage, and the ability to view and search the source code in many different formats and vectors.
The most recent copy of this paper can be found at http://www.cs.columbia.edu/~locasto/projects/scp/protocol.proposal.html
This idea should be turned into an(some) RFC(s). It's a good way to get feedback and the "official" way to publish something of this sort.
This protocol actually consists of two different pieces:
Both protocols must make "getting and writing code" very easy, as easy as firing up your favorite text editor. Clients should be able to be light (or at least pluggable via a standard client API (the SCTP) ).
What should we call these two new protocols? Need to discuss and research possible conflicts in naming. The author of the SCID paper should also be contacted for help, acknowledgement, or just good will.
The phrase source web will refer to the graph (parse tree or parse graph) that is the realized internal conceptual and physically stored model of the source code.
A component container, such as Jakarta's Avalon Framework and Phoenix container, is a piece of server side logic that manages the lifecycle for components, a group of objects that have a standard interface and hidden implementation that require resource and lifecycle management, and follow the SoC and IoC patterns. (Separation of Concerns, Inversion of Control).
The idea of storing source code in a complex structure rather than flat files is presented in this publication: http://mindprod.com/scid.html
Below are some very relevant quotes from the body of the above mentioned paper. Italicized comments follow each quote.
"We make the error of thinking computer programs are primarily for communicating with computers. On a project that requires more than one person, the source code is primarily for communicating between people. The SCID gives you a mechanism to record information only of interest to people and to help you manage that information overload."
This is especially relevant in large scale distributed software engineering projects like Clay. In reality, the source code is secondarily for communicating between people. Documentation and design specifications should be the primary communication medium. What the source code actually should accomplish is communicate from the machine/compiled code to the people. Source code represents machine code, even though it has become a way to communicate between developers. What this protocol should do is merge source code and documentation so that it does become part of the primary communcations medium.
"There is no source code, just the parse tree. You are thus free to display it in many different possible formats, or to export traditional Java source. The parse tree always represents a syntatically valid Java program."
The author makes a bit of a strong statement, but only to share his viewpoint in a clear manner. It is a vital point to assure the user that the source code does in fact exist and that we provide a viable migration path. You can always get/extract your "source code" back in a plain text, traditional format. That said, in effect all that exists in the physical data storage is the parse tree structure that represents a valid parse tree in the schema for that language. Storage format depends on the language XML schema document that describes the grammar. The actual physical format of the data storage can vary from backend implementation to backend implementation. However, the objects are always available to the client (transport protocol) as artifacts described by the XML schema document.
"The parse tree contains much more data than the equivalent source code, e.g. history of change, who changed each token and why"
Although logging functions and comments are present in current CVS systems, the logical conclusion of this approach is a potentially very sensitive operation, although it is a very good selling point for managers. Privacy and security concerns come into play here, as well as at a lower level of the actual code components. For example, Abby may own four "ojbects" in the system and Ben may own three others. However, Abby and Ben may be allowed to edit a common object. More complicated, Carol may own a library that is not available as source code (closed source) that any of Abby or Ben's objects must use, in either a secure (authenticted) or insecure manner. Abby or Ben might be able to read Carol's library but not change it, or they may be able to execute it, but not read it.
The comments in this section are taken from another context (e-mail).
It seems as if the stuff Neff is doing with his mini-OO compiler and his "tree" of compiled nodes would be a good pattern to implement a little parser for the SCID..that tree can then be translated to really any data storage mechanism via XML (JAXB, Java-XML binding).
The essential component of this guy's argument is that source code shouldn't only be stored in a flat file, but some kind of robust structure of "lego bins"...so a CVS is no longer just a directory structure of text files, but a complex tree - no, it's a graph, maybe cyclical - (and this is where the intractable part comes in, eh?)We have parsers and lexical analyzers to transform code into some tree structure already, why not make this a live process and take advantage of this? A file/source code represents a chunk of this tree.
<more-interesting-problems> <ip> write a standard storage and translation mechanism with XML for src </ip> <ip> write a protocol for people to exchange source code in different languages using this storage idea. (basically extending HTTP/SOAP, or perhaps using or modeling LDAP). Version 1 is basically just the specification for an access (request/response) type mechanism. This is actually really easy to roll out version 1, and if it takes off, version 2 can be better. </ip> </more-interesting-problems> <selling-points intro="Besides just prettying up the code, the organization of it should have nice side effects like the following."> <side-effect> perhaps make formal verification of code possible </side-effect> <side-effect> allow for higher-level languages than Java (e.g., programming in near-american) </side-effect> <side-effect> can be coupled with component managers to keep your code "live" - since it is obstensibly in a structure that can be compiled automatically, your "object" can always be part of the system...you can make live changes in the production system by just editing source code...building, compiling, linking, and configuration can be automated. </side-effect> </selling-points> <snip> this would definitly be a way to create a "common" base of code. Unfortunately, most organizations don't share code and effort is duplicated over and over. Even organizations that are sharing code internally have big problems doing so. I look at Apache, who seems to have worked it out pretty well, but they still have *human level* fights over "I want to use my code b/c i wrote it and I know it." So that is a problem to be solved. I don't think any protocol can do that. Only peacemaking. Need some kind of verification of the quality of the code too. Plus, it may not be a good idea to form this single huge code base, for many security type reasons. What the protocol can do is offer the ability to associate attributes with code objects (entries in an LDAP-type directory?). The attributes would be a mixture of language specific and general attributes. So a collection of ten objects (a segment of the "source web") would represent a particular AI or sorting algorithm, or Component like a SessionManager or something. I guess what this kind of does is make code more like a resource or utility like running water and electricity, which I like </snip> <snip> Almost, but not quite. Just translate the webpage into a CVS interface (any interface you want, actually, a brower that speaks the protocol is fine too.) and the code library of plain text/html into an database of XML descriptions of the source web and the objects in it. Yes, there would have to be some sort of naming convention established (but I think fully qualified class and package names following the reverse-DNS system is pretty good) in order to "publish" your public source web and make it available for general "JINI/JNDI" lookup. So, the LDAP-type naming, following java package conventions, should be a pretty solid start. Of course, "hints" and descriptions can be additional attributes so that you can do a sort of SQL "select" or LDAP-style filter when you search the codebase to return "all form and data scripts". This query language would be built into the transport layer, and the storage layer would provide the hooks for implementation over the data. Sorry for all the doubt-quotes sprinkled above, it's just that most of this stuff is tough to put a name to, and I don't want to commit to saying "you must use JNDI and only JNDI", because of course the protocol is language independent, and you can choose to implement it in Java or .NET or whatever else supports lookups and whatnot. </snip>
The storage protocol should define a mechanism for storing different kinds of source code (images, sounds, semantic text [source code as defined above]). The mechanism should be platform independent. The mechanism is not concerned with the underlying physical storage or length of bit fields, endianess, etc.
The storage protocol MUST define how to create and maintain a framework or container for hosting (and plugging in) different language schema sets. The format of a schema set WILL be specified by the storage protocol. The details of a schema set are free to the implementor of that schema.
The underlying storage mechanism MAY be:
However, the underlying mechanism must support the ability to use the SCSP. This will probably be done by the use of a standard extension or library that implements the SCSP.
It would be a nice feature of the storage protocol and mechanism if an LDAP-type lookup (JNDI) could be performed on a stored (compiled) object directly from the new CVS. Thus, the CVS acts as both a source code storage device and a component or object container*.
For an example of what I mean by a component container, see Apache-Jakarta's Avalon project. However, a directory that can store serialized Java objects (like OpenLDAP) is just fine for beginers, as long as there is some overlaying logic that implements the requirements of the SCSP presented below.
For example, in LDAP, it is possible to lookup a serialized Java object belonging to Mary (via JNDI) like so:
DirContext directory = .... Printer p = (Printer)directory.lookup( "dc=com,ou=mycompany,cn=Mary,sn=Jenks,ou=objects,cn=HP4050Printer" ); //use printer...
An implementation of the SCSP will consist of and define mechanisms for handling:
Provides a mapping from the parse tree to actual physical storage (or storage api) format. For example, the document is two parts (or two documents): one that specifies the valid grammar and one that maps each "object" (note that functions are objects) to a physical data representation. The XGS Container will basically construct a parser and lexical analyzier (if not a compiler!) for the language. This construct will then scan (perhaps multi-threaded) all input to place it into the correct part of the DFA (source web/graph of nodes). Even procedural-oriented languages can be presented by DFAs or better.
Valid Java syntax --> [some object segment] --> [serialized object graph, these bytes here, accesible by this unique identifier]
Valid Perl syntax --> [some object segment] --> [4 byte array, int]
Valid Pascal syntax --> [segment of parse tree] --> [int]
The "point" of most parsers and compilers is to turn the nicely laid out stream of English-seeming characters (with nice visible pseudo-physical logic loops and flow of execution) into a long thin stream of hex, then binary numbers. We need to use Neff's representation of a Tree of objects to store the source web. The tough part is going to be zeroing in on exactly what a procedurl language looks like as a source web, and seeing if the parsed source web looks anything like a DFA, or if it is non-deterministic or an actual Turing machine. What about EventListener code that waits for user input? That is inherently non-deterministic. We need to invent some sort of special handler or signal for that case, a kind of "Habermann" cut that short-circuits the XGSD Container from sending the GenParLexAnalyzer down that path.
The SCTP is the actual protocol for communicating between SCSP implementations. It specifies the format of the data "on the wire."
It may be important to establish a body (public, government or otherwise) to oversee how companies protect and store your data. Access control rights to "works of art" must comply with various copyrights and public licenses (Apache, BSD, GNU GPL). Freely available code over the network must be protected but never patented.