Share via


multiple architectures - many ghosts in the machine

I got an interesting comment to my post about a persistent data grid... that the idea is interesting when considered in context with an ESB.  (I assume this particular TLA stands for Enterprise Service Bus).  I don't know if the person leaving the comment meant to say that they are essentially the same, or just complimentary.  If he thought that I meant the same thing, then I failed to be clear.

The thing about the ESB is that it places the messages "into the cloud."  The persistent data grid places cached data "into the cloud."  Different, but complimentary.

When I was describing this idea to two other architects the other day, one asked "what happens on update?  Does the cache update from the message?"  The answer is no.  The message may intend for data to be updated.  I may even command that data be updated, but until the data is actually updated in the source system, it has no place in the cache.

In a very real sense, while the data grid may leverage an ESB as a portion of its architecture, it is seperate from it.  The distributed data, which behaves in a manner that should allow very fast data performance, even at great distances, is not a message.  Intelligent and seamless routing and distribution is essential but does not deliver large datasets at great distances. 

While I cannot know for certain if my idea would, I can tell you that ESB, of and by itself, does not.  So, in this situation, at least two architecturs are needed.

Add to that the need for business intelligence.  In a BI world, the data needs to be delivered "as of a particular date" in order to be useful for the recipient analytic systems.  This is because this 'date relevance' is needed to get proper roll-ups of data in order to create a truly valid snapshot of the business. 

For example, if you have one system recording inventory levels, another recording shipments in transit, and another showing sales in stores, you need to know that your data in the analytics represents "truth" as of a particular time (say Midnight GMT).  Otherwise, you may end up counting an item in inventory at 1am, in a shipment at 7am, and sold by 10am.  Count it thrice... go ahead.  Hope you don't value your job, or your company's future. 

That requires data pulls that represent data as of a particular time, even if the pull happens a considerable time later.  For example, we may only be able to get our data from the inventory system at midnight local time, let's say Pacific Standard Time, when the server is not too busy.  That's about eight hours off of GMT.  The query has to pull for GMT.

This type of query is not well suited for a data-grid style cache, and while the message can travel through the ESB, the actual movement of the data is probably best handled by an ETL (Extract Translate Load) process using an advanced system like SQL Server Integration Services (the replacement for SQL DTS).

Alas, in our data architecture, I've described no less than three different data movement mechanisms.  Yet I still have not mentioned the local creation of mastered data.  If the enterprise architecture indicates that a centralized CRM system is the actual 'master' system for customer data, then the CRM will use local data access to read and write that data.  That is a fourth architecture.

OK... so where do reports get their data?  That's a fun one.  Do they pull directly from the source system?  If so, that's a direct connect.  What if the source system is 10,000 miles away?  Can we configure the cache system to automatically refresh a set of datasets for the timely pull of operational reporting data?  That would be a variation on my persistent data cache: the pre-scheduled data cache refresh.  This would require a seperate data store from the active cache itself.  This amounts to data architecture number five.

Recap... how many data architectures do we need, all running at once?

  • Message-based data movement
  • Cached data 'in the cloud'
  • Business Intelligence data through large ETL loads
  • Direct data connections for locally mastered data
  • Prescheduled data cache refresh for operational reporting

That's a lot.  But not unreasonably so.  Heck, my Toyota Prius has a bunch of different electric motors in it, in addition to the one engaged in the powertrain.  Sophisticated systems are complex.  That is their nature.  

So when I go off on 'simplification' as a way to reduce costs, I'm not talking about an overly simplistic infrastructure.  I'm talking about reducing unneeded redundancy, not useful sophistication.  It is just fine to have more than one way to move data.