As I have written in earlier blogs, we are trying to handle our challenge - a large volume, highly flexible, pipeline sort of processing, where we seek to handle quite different sets of information, fix it and calculate some fee - by mixing Domain Driven Design, SOA, Tuple Space, BASE (and others) and a coarse grained document store that contains Aggregates (see previous discussions where we seek to store Aggregates as xml documents).
Known good areas of usage
We know that a document approach is applicable for certain types of challenges; Content Management Systems, Search Engines, Cookie Crunchers, Trading etc.. We also know that documents handle transactions (messages) very nice. But how applicable is it to an Enterprise Application type of system. We want loose coupling between sets of data because it can scale out, functional loose coupling and for other reasons discussed in earlier blogs here.
Why two data structures?
We want systems that are easier to develop and maintain. Today most of Java systems have one structure on the business layer, where we successfully develop code and have a god pace. Using unit tests and mock data to enable fast development. Every thing seems fine, until we have the object relational mapping (ORM). Here we also must model all the same data again, but now in a different structure. At the storage level we put tables and constrains and indexes so that we are sure that the data is consistent. But that also has already been done on the business layer. Why do we continue to do this twice?
The relational model is highly flexible, and is sound and robust. A good reason to use it, is that we want to store data in a bigger context than the business logic did handle. But would´n it be great to relax this layer and trust the business logic instead?
Relational vs. Document
We know that the document approach scale linear very well, and that the relational database does not have the same properties because of ACID and other stuff, but why is it so?
The main reason for not being able to scale out, is that data is spread out over many tables (and that is the main structure of most object databases too). Data for all contexts is spread on all tables. Data belong to Party S is all over the place, mixed with Party T. During an insert (or update) concurrency challenges happen at tables A, B and C. The concurrency mechanism must handle continuous resource usage on all tables. No wonder referential integrity is important.
In the document model the objects A, B and C are stored within the document. This means that all data for Party S is in one document and T is in another. No common resource and no concurrency problem.
The document model is not as optimal if there are many usage scenarios that handle all objects C, regardless of what entity it belongs to.
The Enterprise Application challenge
So how do we solve the typical Enterprise Application challenge, with a document store approach? (Should´n we be twice as agile and productive, if we do not need to maintain a separate storage model.) Finding the granularity is important, and most probably should follow the main usage scenarios. To be able to compose aggregates there should be some strong keys that the business logic must ensure referential integrity on. Even though we may not have integrity checks in the storage layer, I am not sure it is that bad. We do validate the documents (xsd and business logic) before we store. And I have no counts on how much bad data I have debugged in databases, even though they have had a lot of schema-enforcement.
A lot of the information that we handle in our systems are not part of the domain. There are also intermediate information, historical states and audit for instance. Remember that we in a document approach reverse the concept and store everything about Party S, by itself in its own document. To be able to cope with a document approach, the document itself must be placed within a structure (a super-document) that has more meta data about common concerns such as: keys, process information (just something simple like : new, under construction, accepted), rationale (what decisions did the system do in order to produce this result), anomalies (what errors are there in the aggregate), and audit (who did what, when). The <head> is the Root Object, and its generic so that all documents are referenced in a uniform way. The super-document is structured like this:
The IO challenge
We succeed in this only if we also manage to make this perform. The main pitfall for any software project is the time and space dimensions. Your model may look great and your code super, but it does not perform, it does not scale, and you loose. The document storage model is only successful if you manage to reduce IO, both calls and size. If you end up transporting too much information, or if you have too many calls (compared to ORM), then the document model may not be optimal. An Enterprise Application may have 100´s of tables, where probably 30% is m-m relations. I have seen applications with more that 4000 tables... Only a genius or semi-god may manage that. Most probably it will just be unstable for the rest of its life-time (see comment on the Silo). My structure example above is way too simple compared to the real-world. But surely for many of these applications there is a granularity, that fits the usage scenarios better. I have seen documents with 100.000 nodes getting serialized in less than a second. Does not 20 document-types, small and large, seem like a better manageable situation, than 200 tables?
In our upcoming Prof-of-concepts we will be investigating these ideas.
Aggregate persistence for the Enterprise Application by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.