Tuesday, June 18, 2013

BIG DATA advisory - Definite Content

I urge you to start taking data quality seriously. Aggregate design (as defined in Domain Driven Design) and the technology supporting BIG DATA and NoSQL gives new possibilities, also for your core business. So be warned: pure definite business data at your fingertips.

Central to our new architecture here at Tax Norway is The Continual Aggregate Hub. One important feature of the CAH; keeping legislated versions of business data unchanged and available for “eternity”. Business data is our most important asset, the content must be definite. We must keep it for proof of procedure for more than 10 years, its integrity must be protected for wear and tear from both functional and technical upgrades in the software handling it.


Your current situation
I claim that the relational schema is too volatile for keeping the integrity of the business data stored in it, over time. The data is too fragmented, and functional enhancements to the schema will make the data deteriorate over time. “Will a join today give the same result as it did 5 years ago?”. Also a major threat against the integrity of the data is relations and other stuff that is added to the schema (DDL) to support reporting or analytical concerns. This makes the definite (explicit) content hard to get to, because the real business data is vague in the relational schema.

The Continual Aggregate Hub
Here, business data is stored as XML-documents, a version for every legislated change, and categorized by meta-data. See my talk at QCon 2013 in London, where I present that we organize business data by a classification system not unlike what libraries use for books. Basically we use the header to describe the content, and the document itself contains an Aggregate. I also show that we can compose complex domains from these Aggregates, and that applications running these domains fit nicely in the deployment model of "the cloud” and in-memory architectures. (see discussion on software design in the CAH)

The Implementation
The excellent team that implemented the data-store of the CAH constructed it as two parts; one called BOX and the other IRIS. BOX sole purpose is to store aggregates (as versioned documents), enforce a common header (meta-data for information classification), information retrieval (lookup based on meta-data), and providing feeds (ATOM) of these documents to consumers. BOX does not care what is in the document. IRIS' sole purpose is to provide search, reporting, insight and (basic) analytics based on all document content. IRIS utilize a search engine for this. We use Java, REST, ATOM-feeds, XML, and Elastic Search. We still use the Oracle database, but will migrate to a document database in a couple of years. (see blog for discussion on deployment models)

Separation of Concern
This is now in production and we see the great effect of having separated the concerns of information storage and information usage. We can at any time exchange the search engine and use sexy new tools  (for example. Neo4J, SkyTree, and many others), without touching the schema of the business data, or the technology supporting BOX. A true components based approach to functionality. We can also change the schema of the business data over time without altering the old data, or altering the analytics and search capacities. The original and definite business content is untouched. The lifetime requirements of our data has never had such a good stand. Also the performance of these searches are awesome. Expect to use the same amount of space for IRIS as spaced used in BOX.

Insight into our business data has never been better. BIG DATA and NoSQL tools are giving us a fantastic opportunity. You should consider it.

Creative Commons License
BIG DATA advisory - Definite Content by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.