Sunday, September 15, 2013

Cloud prototyping and Kolb's Learning Cycle


There is a lot of attention on Cloud platforms (IaaS, PaaS and SaaS), but not so many present how they actually will move to it. This blog is about prototyping core business to such a platform (eg. Amazon Web Services' Beanstalk). It’s also about utilizing Klob’s Learning Cycle in organizational development. An even harder problem, until you realize that your organization has to experience a common vision.

I have earlier presented results from our Proof of Concept (Tax Norways PoC Results). That prototype was about designing software for the "cloud" (In-memory, Big Data); to make it easier to maintain, cost less and scale linearly. (I had a talk at QCon London 2013 about this)
This time its about how we can simplify our business and still make sure we comply to legislation. Its about using Domain Driven Design to establish the ubiquitous language and aggregates. Its about engaging business and stakeholders.
Innovation emerges from the collaboration between disciplines.
This is what Enterprise Architecture is all about: Improving business though IT.
What business?
The need to clean up a very complex set of forms representing mandatory reporting from the population to the Tax Administration. Simplifying for small businesses is our focus, by addressing 13 forms (the total is 50 forms, but larger businesses and other tax domains is not addressed now). Main strategy for this simplification is to collect timely facts through a responsive and “smart” wizard made available for the public. There was already a business project working on this involving 20-30 people. They had a concept ready, but was unsure how to proceed. And I had blueprints of our future systems. 
Is the Public Sector able to innovate?
Yes We Can!
Why Prototype?
Kolb's learning cycle
Lets experience something together! Ever noticed how people understand the same text or the same specification, in different ways? That is one of the major challenges in Computer Science, expectations and requirements most often are not aligned. We used Kobl's to achieve a collective understanding of our vision in this area, combining the individual skills of our organization. With this support we managed to combine deep knowledge to create something new and highly useful.

We prototyped to make this cycle come alive. Making very different disciplines; people from business (lawyers, economists and accountants), user interaction (UI designers), IT, Planning and Sponsors come together over a running application. We have run demos every 14. days (Active experimentation and Concrete experience) and discussed calculations, concepts and terms (as will be part of the new and improved ubiquitous Business Language). After the demo the participants used their Reflective observation and Abstract understanding to write new requirements to the Backlog. This is a “silver bullet” for a team fighting a complex domain. No Excel sheet, presentation or report would have made this possible.

In technical terms its more than just a prototype. It is a full blown information model, implementation, and a full fledged stack (although without database). And we know it will perform with massive scale. We have now defined the Java classes and the xml for the document store (Continual Aggregate Store).
But for business its a prototype; as we have not covered all areas or all detail, but shown how the majority of complex issues must be tackled. The prototype will be used as a study within new areas. Also remember that for business, IT is just one piece. There is also timing, financing, customer scoping, migration, integration, processes and organisational issues.

The beautiful prototype
The domain has 13 forms with some 1500+ fields. The solution now consists of 6-7 aggregates and
Navigation at left, income facts at right
some 600+ fields altogether, but any user will only fill in a subset of these. We have utilised the latest in responsive design, to give as much as possible on a web platform. Everything is saved as you work and "backend" business logic is run for calculations and validations. Logic is not duplicated, the user is working towards the same backend as our own case handlers. (There are no Form anymore, as the user is working synchronously with the backend. Old legacy systems need an asynchronous front end, and get a costly duplication of the business logic.)

Previously there was a “back-and-forth” work process through 11 hard to comprehend steps, now the structure is sequential through 5 simple steps.
Previously the user had to do a lot of manual calculations and book-keeping, now the prototype collect facts and calculate for them.
Expert in your hands
Previously the user had to know what forms to use, now the prototype guides the user though it.

Making the impossible possible
At the end there is adjustment GUI of sliders that solves what was previously hard to comprehend. Only experts with many years experience could do this. A goal for us is that normal people can report their own tax-form in an optimal way. We have done user testing and got very positive feedback. It is much easier for a user to report on real world things (assets, income, costs, dept), and let the application do the calculations.
Now that we see the result, it looks so simple, sublime and beautiful.

The key artefact is the new set of fully unit testable java classes (information, logic and aggregates) of the core domain. Ready to be deployed on some PaaS. This core is now much simpler, simplifying all aspects of systems development and integration. This is how we increase the ability to change and decrease maintenance cost.

The platform
We are using Amazon Web Services (AWS) and have deployed the prototype in Apache and PHP Beanstalk containers. This prototype continues where the last prototype stopped, and porting that code has proven to be simple. Also we are using plain Java and HazelCast on the backend. The backend contains all business logic and information. There is very little duplication of code, and the backend is used all the time as the user works his way through the wizard.
The front end has HTML5/CSS3, Javascript, JSON and REST.
AWS has been really simple (time saving) and cheap. The bill is about $100 for the test environments after 5 months! :) Also we have proven (again) that if you do your design right, you can quite easily move to another "cloud" platform (see deploy looking good).
Testing is now a dream compared to before, mainly because the aggregates are great units to test, and because business provide calculations as spreadsheets, which then sub-sequently are used as tests.
(We are deploying our private cloud, and the production-stack will be different)  

The experience
We teamed up in late March 2013 with a team of 5 from Inmeta.no. They where experienced in Java, Infinispan, GUI design, and web-development. They had no experience from Tax Administration or Accounting. The business had a concept and a plan; by starting with simple cases and adding complexity demo-by-demo. And I had a rough design for the aggregates replacing the forms. Late August we finished. At that time we had covered a lot more than we anticipated and also worked out new concepts (which had been impossible to foresee without the prototype). 
  • The theory and practices of Kolb's Learning Cycle are helpful in Computer Science. 
  • Prototyping is a silver bullet in many aspects
  • Use the prototype also on all other parts of you organization
  • Our modernised business can run with many 10s of thousands of concurrent users
  • EA-perspective and Organizational Development: Business is engaged, drive change and stakeholders are behind the modernisation
  • Our business processes can be run in new ways, eg. being much more efficient or providing transparency for the public
  • A prototype will result in a mutual understanding of information model and business logic
  • Do not implement paper forms into xml schemas, but re-structure in aggregates 
  • Your legacy systems should be moved to a “cloud” platform by using Domain Driven Design, Java, and a systems design approach that I talked about at QCon London 2013
  • If you understand HTLM5/CSS3, Javascript, JSON and REST, it is not that important what framework to use on the client side
  • Java can be really verbose, you dont need a rule-engine
  • Aggregate design (of Domain Driven Design) really rock
  • PaaS really saves time and cost
This show the innovation-power of small multi-discipline teams, with the right competence and ambition.

Creative Commons License
Cloud prototyping by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Tuesday, June 18, 2013

BIG DATA advisory - Definite Content

I urge you to start taking data quality seriously. Aggregate design (as defined in Domain Driven Design) and the technology supporting BIG DATA and NoSQL gives new possibilities, also for your core business. So be warned: pure definite business data at your fingertips.

Central to our new architecture here at Tax Norway is The Continual Aggregate Hub. One important feature of the CAH; keeping legislated versions of business data unchanged and available for “eternity”. Business data is our most important asset, the content must be definite. We must keep it for proof of procedure for more than 10 years, its integrity must be protected for wear and tear from both functional and technical upgrades in the software handling it.


Your current situation
I claim that the relational schema is too volatile for keeping the integrity of the business data stored in it, over time. The data is too fragmented, and functional enhancements to the schema will make the data deteriorate over time. “Will a join today give the same result as it did 5 years ago?”. Also a major threat against the integrity of the data is relations and other stuff that is added to the schema (DDL) to support reporting or analytical concerns. This makes the definite (explicit) content hard to get to, because the real business data is vague in the relational schema.

The Continual Aggregate Hub
Here, business data is stored as XML-documents, a version for every legislated change, and categorized by meta-data. See my talk at QCon 2013 in London, where I present that we organize business data by a classification system not unlike what libraries use for books. Basically we use the header to describe the content, and the document itself contains an Aggregate. I also show that we can compose complex domains from these Aggregates, and that applications running these domains fit nicely in the deployment model of "the cloud” and in-memory architectures. (see discussion on software design in the CAH)

The Implementation
The excellent team that implemented the data-store of the CAH constructed it as two parts; one called BOX and the other IRIS. BOX sole purpose is to store aggregates (as versioned documents), enforce a common header (meta-data for information classification), information retrieval (lookup based on meta-data), and providing feeds (ATOM) of these documents to consumers. BOX does not care what is in the document. IRIS' sole purpose is to provide search, reporting, insight and (basic) analytics based on all document content. IRIS utilize a search engine for this. We use Java, REST, ATOM-feeds, XML, and Elastic Search. We still use the Oracle database, but will migrate to a document database in a couple of years. (see blog for discussion on deployment models)

Separation of Concern
This is now in production and we see the great effect of having separated the concerns of information storage and information usage. We can at any time exchange the search engine and use sexy new tools  (for example. Neo4J, SkyTree, and many others), without touching the schema of the business data, or the technology supporting BOX. A true components based approach to functionality. We can also change the schema of the business data over time without altering the old data, or altering the analytics and search capacities. The original and definite business content is untouched. The lifetime requirements of our data has never had such a good stand. Also the performance of these searches are awesome. Expect to use the same amount of space for IRIS as spaced used in BOX.

Insight into our business data has never been better. BIG DATA and NoSQL tools are giving us a fantastic opportunity. You should consider it.

Creative Commons License
BIG DATA advisory - Definite Content by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Thursday, January 10, 2013

Target architecture looking good

It has now been 3 years after the target architecture, and more specifically the CAH (Continual Aggregate Hub) was established. The main goal is to have a processing architecture where ease of maintenance, robustness and scale is combined. We want it highly verbose, but at the same time flexible. In the mean time we have started our modernization program, and at present we have 50-70 people on projects, but many more supporting on different levels. We have gathered more detailed insight (we ran a successful PoC) and the IT-landscape has matured. Right now we are live with our first in-memory application; it collects Bank account information from all persons and business in Norway. I am quite content that our presumptions and design are also seeing real traction in the IT-landscape. Domain Driven Design, Event driven Federated systems, Eventually Consistent, CQRS, ODS, HTTP, REST, HTML/CSS/JS, Java (container standardization), and XML for long term data still is a good choice.
My talk at QCon London 2013 is about how this highly complex domain is designed.

It is time for a little retrospective.

The Continual Aggregate Hub contains a repository or data storage, consisting of immutable documents containing aggregates. The repository is aggregate-agnostic. It does not impose any schema on these; it is up to the producer and consumers to understand the content (and for them it must be verbose). The only thing the repository mandates is a common header for all documents. The header contains the key(s), type, legitimate period and a protocol. Also part of the CAH is a processing layer. This is where the business logic is, and all components here reside in-memory and are transient. Valid state only exists in the repository, and all processing is eventually consistent, and must handle things idempotent. Components are fed by queues of documents, the aggregates in the documents are composed into a business model (things are very verbose here), and new documents are produced (and put into the repository). Furthermore all usage of information is retrieved from the repository.

Realization
With this way of structuring our systems, we can utilize in-memory or BIG-data architectures. The success for utilizing these lies in understanding how your business domain may be implemented (Module and Aggregate design). The IT-landscape within NoSQL is quickly expanding, in-memory products are pretty mature, PaaS looks more promising than ever, and BIG-data certainly has a few well proven candidates. I will not go in details on these, but use some as example as to how we can utilize them.
This is in no way an exhaustive list. Products in this blog is used as examples for different implementations or deployments of the CAH. It is not a product recommendation, nor represent what Skatteetaten might acquire.
NoSQL: It’s all about storing data as they are best structured. Data is our most valuable asset. It brings out the best of Algorithms and Data structures (as you where taught in school). For us a document store is feasible, also because legislation states formal information set that should last for a long time. In this domain example candidates are: Couch DB because of its document handling, Datomic because of immutability and timeline, or maybe MarkLogic because of XML support.

Scaleable processing, where many candidates are possible. It depends on what is important.
In-memory: I would like to divide them into “Processing Grid” or “Data Grid”. Either you have data in the processing java-VM, or you have data outside the java-VM (but on the same machine).
PaaS: An example is Heroku, because I think the maturity of the container is important. The container is where you put your business logic (our second most valuable asset), and we want it to run for along time (10 years +). Maybe development and test should run at Heroku, but we run the production environment at our own site. Developers could BYOD. Anyway Heroku is important because it tells a lot of how we should build systems that has the properties we are discussing here. And the CAH could be implemented there (I will talk about thet at SW2013).
BIG-data: We will handle large amounts of data live from “society”. Our current data storage can’t cope with the amounts of data that will be pouring in. This may be solved with Hadoop and its “flock” of supporting systems.

Deployment models
The deployment models resemble our dilemma of balancing a “Total Cost of Ownership”; ease of maintenance, robustness, ability to scale, and cost.
In-memory – Processing Grid (~GemFire)
  • Pro. Very low latency. Elastic (scale and re-balance)
  • Con. Cost. (Open Source not stable enough).   Heap limitation leads to many VM’s. Business code and data are close, leads to deployment issues.
In-memory – Data Grid (~Terracotta)
  • Pro. Elastic (scale and re-balance). Number of VM’s solely by processing modules. Business code and data are separate, better deploy situation. Low latency (serialisation, but on same machine).
  • Con. Cost.
Distributed database – Big Data (~Hadoop)
  • Pro. Super simple VM (jetty) that only handle local data. Cost. (Open Source stable). Business code and data are separate, better deploy situation. Number of VM’s solely by processing modules.
  • Con. Slow elastic (scale and re-balance). Disk-to-disk. Latency (map-reduce)

Conclusion
Our application and systems ovehaul seems to fit many scalable deployment models, and that is good. Lifetime requirements are strict, and we need flexible sourcing.
We are doing Processing Grid now (we are using HazelCast), but will acquire some "in-memory" product during 2013 (either Processing or Data). Oracle is the document database, extremely simple, just a table with the header as relational columns, and the aggregate as CLOB. The database is “aggregate agnostic”.
Somewhere around 2015 when the large volumes really show up, you will probably see us with something like Hadoop, in addition maybe to the ones mentioned above. Since latency in sub-seconds is OK, and we will have a long tail of historic data, maybe just Hadoop? Who knows?

We are navigating into a system landscape where we may choose deployment models, in a more tactical cost / benefit evaluation. We are not stuck to a Database or a single machine anymore.
Creative Commons License
Target architecture looking good by Tormod Varhaugvik is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.