Hadoop for a Data Driven Organisation

Using Big Data: Interoperability Issues

To derive meaning quickly, from huge quantities of data, has been the objective of enterprises since the inception of databases.  Data virtualisation makes large and complex data sets understandable and enables domain experts to spot underlying trends and patterns. Companies would want to learn of signs and patterns in their data to warn them, not only of hazards such as fraud or production delays, but provide insight into opportunities such as the treatment of diseases or leveraging different logistics approaches.  Traditionally, business intelligence required static reports, created from an underlying data warehouse, comprised of conspicuously formatted data in specialised data tables. Now, with a proper visualisation tool, on top of big data[1], the business analytics process may be shortcut; stakeholders use a visualisation tool to model their data requirements to see relationships and spot patterns that analytics professionals might be missing. (Linthicum, 2013).

Hadoop[2] has become the preferred platform for processing large quantities of data[3].  These large quantities may also refer to structured data that cannot be analysed in a reasonable length of time.  The better information is, and the more routinely it is put in the hands of those who need it, the higher the intelligence derived from it.  Despite this, Hadoop is a batch-oriented[4] system and cannot (as yet) answer simple queries with a good turnaround time (Hortonworks, 2013).  But, Hadoop handles data that traditional relational databases, data warehouses, and other analytic platforms have been unable to effectively manage—including user-generated data through social media and machine generated data from sensors, appliances, and applications.  That is, all forms of data.  Hadoop does this through a distributed file system, called the Hive Distributed File System – HDFS[5], providing the portability, scalability (over commodity hardware[6]) and reliability and MapReduce processing for parallel distributed systems (Srivastava, 2011).  “Once data exceeds the capabilities of a single computer, we face the problem of distributing its storage and processing.” (Downey, 2011).

Big Data - Horton Works

Big Data – Horton Works

The Open Group suggests The Need for Boundaryless Information Flow (III-RM).  The boundaryless Information Flow is essentially resolving the problem of getting information to the right people at the right time in a secure, reliable manner, in order to support the operations that are core to the extended enterprise.  According to the Boundaryless Organisation it is not implied that there are no boundaries in the flow of information, only that they should be made permeable or penetrable.  However, the information systems in place today do not allow for information to flow in support of the boundaryless organisation, but when data do, it will be a boundaryless information flow.  The Open Group published the Interoperable[7] Enterprise Business Scenario in 2001 that crystallises this need for boundaryless information flow, describing how this need drives information customers’ deployment of their information systems.  A key driver is to “create a worldwide market for interoperable information products supporting access to integrated information, in which all stakeholders’ needs are addressed.” (The Open Group, 2002).

The lack of the right information to the right person at the right time is equably a lack of interoperability, preventing organisations from achieving their business objectives.  Information as an asset, in principle, passes the point of departure “that information technology merely supports or enables the business – increasingly, information is the business.” (The Open Group, 2002).  To address the lack of interoperability between systems that provide access to data is to attend to the boundaryless information flow.  Technology infrastructure is used to facilitate interoperability by either using Common-off-the-shelf (COTS) appliances or a collection of best of breed integration[8] to reduce interoperability issues.

The Open Group (The Open Group, 2002) identifies high-level technical requirements for a boundaryless information flow architecture:

  1. Openness – standards based open interfaces need to exist for critical interfaces to enable interoperability more easily and support integration of components, for areas:
    1. Protocols
    2. Web portals
    3. Collaboration
    4. Database
    5. Storage Connections
    6. Data Integrity – the same data stored in multiple systems must be conformed to mean the same thing throughout
    7. Availability – must be guaranteed and where maintenance requires downtime, a fall-back must be available
    8. Security – support different security levels and protection based on the sensitivity and use of the data, as well as the normal requirements for data integrity, confidentiality, logging and tracking.  Must provide flexible policy management and strong authentication providing users with a single sign-on and user administration.
    9. Accessibility – solutions must provide global accessibility to the information without compromising security
    10. Manageability – solutions must be manageable
    11. Internationalisation and Localisation – accommodate multiple languages and adapt to culture where deployed internationally
Technology Architecture III-RM

Technology Architecture III-RM – The Open Group

Standardisation of the information technology of an organisation, to achieve a boundaryless information flow, must be measurable in terms of business value, improving business operations, effectiveness, and also that of the Information Technology capability in the organisation.

  • Drive revenue growth
  • Lower Information Technology expenditure
  • Increase % of procurements against standards
  • Reduce spend on customisations
  • Improve cycle time for rolling out upgrades

In the Business Intelligence sphere, a few key trends are identified with the emerging use of big data (Linthicum, 2013):

  • The ability to leverage both structured and unstructured data, and visualise that data
  • Ability to imply structure at the time of analysis, to provide flexibility from the underlying structure of the data (be it structured or unstructured) through decoupling
  • Leverage current or near-real-time data for critical information
  • Allowing BI analyst to mash up data that are outside of the organisation (like the cloud) to enhance the analysis process
  • Ability to bind data analytics to business processes and applications to allow issue resolution without human intervention, i.e. known as embedded analytics
Data and Hadoop Architecture

Data and Hadoop Architecture

An organisation should seek to support of the boundaryless information flow idea in any strategy it evaluates for information integration.  Hadoop, being a framework in support of Big Data, i.e. to provision results/ data over large heterogeneous data sources to analytics tools, or data stores, is a technology to be considered in support of the boundaryless information flow.

Business Intelligence and Big Data

Big data analytics has a potential, which is at odds with the traditional Business Intelligence approach.  To force this existing paradigm on big data analytics is missing the full potential of this technology, or to fail altogether (Linthicum, 2013).

In the previous two decades, calculating summaries, sums and averages were sufficient, but large complex data require new techniques.  To recognise customer preferences requires the analysis of purchase history, browsing behaviour, products viewed, comments, suggestions, complaints, to predict future behaviours (Cloudera, 2013).

Hadoop[9], being able to work on Big Data, structured, and/ or unstructured (heterogeneous sources), is ideally positioned for provisioning data for Business Intelligence.  Problems like advertisement targeting, recommendations engine (predicting customer preferences), customer churn (attrition) analyses, risk modelling, point-of-sale transactional analysis, predicting network data failures, threat analysis, and trade surveillance[10], can be solved with the ability of Hadoop (Cloudera, 2013).

The challenge with Big Data is the transformation of it into intelligent data.  Because Hadoop is primarily a batch-orientated processing system, it is very suitable for Data Analytics (Payandeh, 2013).  It is the crow-bar (no pun intended) that wrests open the capacious data-lake to parse, cleanse, apply structure and transform the data and arranges  it for consumption by analytics applications.

HADOOP Specifically

The most popular spearhead of Big Data, right now, appears to be Hadoop (Kernochan, 2011).  Hadoop is open-source software that achieves reliable, scalable, and distributed computing of large data sets across clusters of computers using programming models, designed to scale from a single to thousands of machines (Welcome to Apache™ Hadoop®!, 2013).  Hadoop is a framework that supports data-intensive distributed applications that run in parallel on large clusters of commodity hardware that provide both reliability and data motion to applications.  By using MapReduce[11][12] (reducing the workload into smaller fragments, parallelising the effort over multiple servers) and a distributed file system that stores data on the compute nodes, it provides a huge aggregated bandwidth across the cluster, as well as the automatic handling of failures by the framework.  “It enables applications to work with thousands of computational-independent computers and petabytes of data.” (WikiPedia, 2013).

Hadoop is also more cost-effective than traditional analytics platforms such as data warehouses, and it moves analytics closer to where the data is.  It may commonly be used for optimising internal data management operations, but it is rapidly gaining ground as a strategic big data analytics platform (Big Data Analytics: Unleashing the Power of Hadoop, 2013).

The Apache-Hadoop platform is comprise of:

  1. Common – common utilities that support the Hadoop modules
  2. MapReduce – YARN-based system for parallel processing of large data sets
  3. Distributed File System (HDFS) – distributed file system for high throughput access to application data
  4. YARN – framework for job scheduling and cluster resource management

Other related projects are:

  • Amabar – web-based tool for provisioning, managing, and monitoring Hadoop clusters, with dashboard
  • Avro – data serialisation system
  • Cassandra – scalable multi-master database with no single points of failure
  • Chukwa – data collection system for managing distributed systems
  • HBase – scalable, distributed database that supports structured data storage for large tables
  • Hive – a data warehouse infrastructure that provides data summarisation and adhoc querying
  • Pig – high level data-flow language and execution framework for parallel computation
  • Zookeeper – high-performance coordination service for distributed applications

The Hadoop File System (HDFS) can store huge and many files to scale across the commodity nodes, breaking down enterprise data silos[13], this common data globule is referred to as a “data lake” (Gualtieri, 2013).  These “data lakes” are queried by MapReduce, a distributed data processing query, in a highly parallel fashion.

Dealing with only files on disk, essentially indicates a lack of structure for the content, data maps are applied over the unstructured content to define the core metadata for it.  For example, using Hive to provide a mechanism to project structure onto the data allows the data to be queried by a SQL-like language called HiveQL, or Pig, a MapReduce mechanism.  A BI tool will thus obtain file metadata from the cluster (serving from both structured and unstructured sources) and retrieve the data from the data nodes into the structure using MapReduce to get to the required data, in parallel, and return the data to a BI visualisation tool, e.g. Tableau[14], or into other types of processing, into specific data structures, e.g. a dimensional model for complex analytical processing (Linthicum, 2013).

Big data is usually only processed periodically, e.g. daily or weekly.  During processing, the load is dispersed over multiple processing nodes to speed it up.  But, after the query is executed those nodes are no longer required until the next run.  This is indicative of a perfect expansion-contraction pattern.  To reduce cost and maximise hardware utilisation, it would be a boon to have the fabric dynamically provisioning the nodes in the grid, to jobs that are executing, and so provide an agile infrastructure.  Hadoop should be thought of as a distributed operating system with many computers storing and processing data as if formed on a single machine (Ken Krugler[15]).  Adding to this, the ability to host Hadoop in the Cloud, to achieve an agile infrastructure, sees efficiency maximised and costs reduced for a multiple number of use-cases (Downey, 2011).

Hadoop is essentially used for (Julio, 2009):

  • Searching
  • Log processing
  • Recommendation systems
  • Analytics
  • Video and Image analysis
  • Data Retention
General BI Scenario with Hadoop

General BI Scenario with Hadoop

Visualisation of Data and Hadoop

Hadoop is supported by Data Virtualisation technologies like Informatica, Composite Server, and Denodo.  Visualisation technologies like Tableau[16] can connect to multiple flavours of Hadoop, to bring data in-memory and perform fast adhoc visualisations to find patterns and outliers, in all the data in the Hadoop cluster.  It is said about Tableau that it is a very elegant solution, obviating the need to move huge data into a relational store before analysing it (Tableau, 2013).   Tableau uses ODBC to connect to the Cloudera connector for Tableau that uses HiveQL to connect to Hive and MapReduce over the HDFS and HBase, Hadoop (Popescu, 2012).

A visualisation tool (in this example Tableau) is connected to Hadoop from which to build visualisations just like when connected to a traditional database.  Tableau achieves “on-the-fly ETL” by connecting to Hive with Custom SQL, allowing the batch-orientated[17] nature of Hadoop to handle layers of analytical queries on top of complex Custom SQL with only an incremental increase in time.

Although Hadoop is essentially a batch-orientated system, there are a number of techniques for improving the performance of visualisations and dashboards, built from data stored on a Hadoop cluster.  Custom SQL limits the data-set size to speed up exploring a new data set and build initial visualisations.   Extracting data for Tableau[18], although Tableau is capable of handling large sets of data, it is nowhere near as well positioned as Hadoop to handle really large, or Big Data loads.  An extract can be pre-formatted for Tableau, by using Hadoop to roll up dates, hide unused fields, and aggregate visible dimensions, to create a broader view of the data.  Hive can, for example, partition records on disk and allow tableau to direct a query, aligned to these partitions, to greatly reduce the time spent on getting to the right data (Hortonworks, 2013).

HADOOP Use-case

Why is Hadoop most typically the technology underpinning “Big Data”?  Data comes from a set of data sources, e.g. CRM, ERP, and Custom Applications.  This data has to be queried somehow, either via an elaborate point-to-point integration, a virtual data store, data warehouse or even a Massively Parallel Processing Systems (MPP), e.g. Microsoft’s Parallel Data Warehouse.  The application doing the query is probably some high-end business intelligence tool, e.g. SAS.  As other unstructured machine generated or social origin data are added, the data integration becomes more complex.

Hadoop is not replacing traditional data systems, but is complementary to them (Hortonworks, Apache Hadoop Patterns of Use, 2013).

Enterprise Data Architecture

Enterprise Data Architecture

The traditional Enterprise Data Architecture is indicated inside the red-marked area, with the additionally added Hadoop ability.  The specific use of Hadoop is directed by a few use-cases or patterns.


Observing analytic challenges from an infrastructure level, some clear patterns emerge and can fit into the following three patterns (Hortonworks, Apache Hadoop Patterns of Use, 2013):

  1. Hadoop as a Data Refinery
  2. Data Exploration with Hadoop
  3. Application Enrichment

Hadoop as a Data Refinery

Refine data sources into their commonly used analytics applications, e.g. providing a central view of Customer from all the data about them in either the ERP, CRM or bespoke systems.  This pattern is especially powerful where Customer data are incorporated from their web sessions on the enterprise’s web site, to see what they have accessed and what they are interested in.

This pattern uses Hadoop[19] to parse, cleanse, apply structure and transform such data and then push it into an analytics store for use with existing analytics tools.  This pattern uses Hadoop as a type of staging environment between the source and the analytics tools, with Hadoop being the transformation agent.

Data Exploration with Hadoop

Hadoop is used directly on the data to explore it.  It still acts the role of parser, cleaner, structure and transformer, but it provisions the data to visualisation tools, e.g. Tableau (Tableau, 2013).  Financial institutions can use this pattern to discover identity fraud, i.e. perform some kind of surveillance activity.

Application Enrichment

Data, stored in Hadoop, is being used to impact an application’s behaviour, e.g. a web application using a user’s access patterns to curtail its depiction to suit the returning user’s usage characteristics.  Hadoop parses, cleanses, applies structure and transforms the data and passes it directly into the application.  This is an ability used by the large web companies in the world, like Yahoo and Facebook.  The right data is enriched at the right time to the right customer.  Massive amounts of patterns and repeatable behaviour are identified in order to customise the web-application experience of the user, by serving the right content to the right person, at the right time, increasing the conversation rates for purchase.

End Notes

[1] Too big for Excel is not Big Data! (Chris, 2013).  The New York Stock Exchange generates about 1 TB of new trading data per day, Facebook hosts about 10 billion photos (1 Petabyte), Angestry.com stores 2.5 Petabytes of genealogy data, and the Internet Archives stores around 2 Petabytes, growing at a rate of 20 Tb per month (Seminar Report Downloads, 2012).

[2] It was first conceived as a web search engine for yahoo (Big Data Analytics: Unleashing the Power of Hadoop, 2013)

[3] A (subjective) figure of twenty Terra Bytes is cited as a tipping point for considering Hadoop (Chris, 2013) “Rama Ramasamy”.

[4] There are projects that provide real-time capabilities to applications for results from Hadoop, but it is primarily a batch-orientated processing solution (HandsonERP & Mir, 2013).

[5] LZO or Snappy Compression can be applied to HDFS files to reduce the bytes read or written thus improving the efficiency of network bandwidth and disk space (Julio, 2009).

[6] Easy to obtain, available and affordable hardware that is not unique in any function.

[7] The ability of two or more entities or components to exchange information

and to use the information that has been exchanged “to meet a defined mission or objective; in

this scenario this term specifically relates to access to information and the infrastructure that supports it. (The Open Group, 2002)

[8] The process of combining components into an effective overall system. In this scenario the phrase “access to integrated information” is used repeatedly. The term “integrated information” means an overall system in which a single view of information is presented, and is achieved by combining various information sources (The Open Group, 2002).

[9] By 2015, 50% of Enterprise data will be processed by Hadoop (HandsonERP & Mir, 2013).

[10][10] A Hadoop cluster can run alongside a trading system obtaining copies of the trading data, coupled with the reference data of parties to the trade, the system can continually monitor trade activity to build relationships between people and organisations trading with each other, watching for patterns that reflect rogue trading (Cloudera, 2013).

[11] MapReduce is a programming model for processing large data sets with a parallel and distributed algorithm on a cluster, doing filtering and sorting on the distributed servers, running various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and fault tolerance and overall management of the whole process (WikiPedia, 2013).

[12] The progress estimator for MapReduce in Hadoop is simplistic, and often produces inaccurate estimates.

[13] Collapsing data silos creates a pro-BI data base, on which to position BI and Visualisation tools.

[14] To use Hortonworks’ online Hadoop data source for Tableau, connect to the Hortonworks Hadoop Hive Connection of Tableau to (Sandbox VM) to start visualising data from Hadoop (Hortonworks, HOW TO: Connect Tableau to Hortonworks Sandbox, 2013).  Get the driver from here: http://www.tableausoftware.com/drivers.

[15] http://www.linkedin.com/in/kenkrugler

[16] Through a partnership with Cloudera, Tableau software has built a Hadoop connector for doing big data visualisation (Williams, 2012).  Also see footnote  14 on how to use Hortonworks online Hadoop data source for Tableau.

[17] Hadoop is not yet capable of answering simple queries with very quick turnaround. (Hortonworks, 2013).

[18] Hadoop query cancellation is not straight forward.  Queries used by Tableau can only be abandoned, but the query continues on the Hadoop cluster, consuming resources (Hortonworks, 2013).

[19] Large quantities of data are distilled to something more manageable (Hortonworks, Apache Hadoop Patterns of Use, 2013).


Big Data Analytics: Unleashing the Power of Hadoop. (2013, August 21). Retrieved from Database Trends and Applications: http://www.dbta.com/Articles/Editorial/Trends-and-Applications/Big-Data-Analytics-Unleashing-the-Power-of-Hadoop-91261.aspx

Chris, S. (2013, September 16). Don’t use Hadoop – your data isn’t that big. Retrieved from Chris Stucchio: http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html

Cloudera. (2013). TEN COMMON . Chicago: Cloudera. Retrieved from http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf

Downey, J. (2011, July 1). Big Data and Hadoop: A Coud Use Case. Retrieved from Jim Downey: http://jimdowney.net/2011/07/01/big-data-and-hadoop-a-cloud-use-case/

Gualtieri, M. (2013, october 23). 5 Reasons Hadoop Is Kicking Can and Taking Names. Retrieved from Information Management : http://www.information-management.com/blogs/5-reasons-hadoop-is-kicking-can-and-taking-names-10024986-1.html

HandsonERP, & Mir, H. (2013, April 5). Hadoop Tutorial 1 – What is Hadoop? Retrieved from YouTube: http://www.youtube.com/watch?v=xWgdny19yQ4

Hortonworks. (2013). Apache Hadoop Patterns of Use. Chigago: Hortonworks. Retrieved from http://hortonworks.com/wp-content/uploads/downloads/2013/04/Hortonworks.ApacheHadoopPatternsOfUse.v1.0.pdf

Hortonworks. (2013, August 26). Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Platform. Retrieved from SlideShare: http://www.slideshare.net/hortonworks/best-practices-forhadoopdataanalysiswithtableauv10-25612601

Hortonworks. (2013, Novemeber 07). HOW TO: Connect Tableau to Hortonworks Sandbox. Retrieved from Hortonworks KB: http://hortonworks.com/kb/how-to-connect-tableau-to-hortonworks-sandbox/

Julio, P. (2009, October 19). What is Hadoop used for>. Retrieved from SlideShare: http://www.slideshare.net/PhilippeJulio/hadoop-architecture

Kernochan, W. (2011, October 20). Big Data, MapReduce, Hadoop, NoSQL: The Relational Technology Behind the Curtain. Retrieved from enterpriseappstoday: http://www.enterpriseappstoday.com/business-intelligence/big-data-mapreduce-hadoop-nosql-2.html

Linthicum, D. (2013). Making Sense of Big Data. Chicago: IBM.

Payandeh, F. (2013, September 8). Hadoop vs. NoSql vs. Sql vs. NewSql By Example. Retrieved from Big Data Studio: http://bigdatastudio.com/2013/09/08/hadoop-vs-nosql-vs-sql-vs-newsql-by-example/

Popescu, A. (2012, February 8). Visualizing Hadoop data with Tableau Software and Cloudera Connector for Tableau. Retrieved from myNoSQL: http://nosql.mypopescu.com/post/17262685876/visualizing-hadoop-data-with-tableau-software-and

Seminar Report Downloads. (2012, February 06). Hadoop. Retrieved from Seminar Report Downloads: http://seminarreportdownloads.wordpress.com/tag/new-york-stock-exchange/

Srivastava, A. (2011, April 26). A Hadoop Use Case . Retrieved from Anjur-sa.blogspot.com: http://ankur-sa.blogspot.com/2011/04/hadoop-use-case.html

Tableau. (2013, October 25). Hadoop. Retrieved from Tableausoftwar: http://www.tableausoftware.com/solutions/hadoop-analysis

The Open Group. (2002). Interoperable Enterprise Business Scenario. Chicago: The Open Group.

Welcome to Apache™ Hadoop®! (2013, October 24). Retrieved from hadoop.apache.org: http://hadoop.apache.org/

WikiPedia. (2013, October 23). Apache Hadoop. Retrieved from WikiPedia: http://en.wikipedia.org/wiki/Apache_Hadoop

WikiPedia. (2013, October 19). MapReduce. Retrieved from WikiPedia: http://en.wikipedia.org/wiki/MapReduce

Williams, A. (2012, November 9). Hadoop. Retrieved from Siliconangle: http://siliconangle.com/blog/2011/11/09/tableau-brings-data-visualization-to-hadoop/


Microsoft® SQL Server™ AlwaysOn™ for Performance Mitigation?

Executive Summary

Microsoft® SQL Server™ AlwaysOn™ is a database clustering technology intended for high availability (HA) and Disaster Recovery (DR) with the ability to accommodate other beneficial workloads, such as read-only reporting query load sharing and database backups.

This article investigates using AlwaysOn™ as a performance optimisation (load sharing) mechanism for read-only reporting purposes.  In conclusion and by association with the intended purpose of AlwaysOn™, this function is not a preferred performance mitigation strategy for database access in an enterprise.  However, this does not imply that performance enhancements are not achievable, quite the contrary.

Firstly, AlwaysOn™ is not intrinsically designed for performance mitigation (between OLTP and Reporting workloads) reasons and the mitigation of concurrency, although it does provide secondary benefits to alleviate such contentions.  SQL Server 2012 can be positioned as a less-performant database system, when compared to SQL Server 2014, simply because of technological advancements in the latter.  This, specifically because of SQL Server 2014’s in-memory (Hekaton) ability and the ColumnStore indices of the Data Warehouse, amongst many new enhancements.

However, Hekaton is an OLTP beneficial enhancement, maybe even resulting in the removal of the need to have a secondary database over which to defray concurrency by sharing the load, as a result of the significant performance gains achieved by the technology.  However, AlwaysOn™ for geographically dispersed databases, is a superb solution for brokering regionalised access to data, discharging connections and load across the database system.  But, for a single enterprise architecture, mitigating performance for reporting style read-only queries, either by federation or a replication architecture, assuming that AlwaysOn™ as the perfect replacement, is erroneous.

Great performance gains are achievable through indexing secondary databases, via the primary, but requires them to be the same on both, despite different workloads.  Also, despite the additional performance benefits of the temporary query statistics in the tempdb and the compounded performance gains of SQL Server 2014, it does not herald AlwaysOn™ as the de facto load balancing mechanism for offloading reporting workloads by virtue of the database’s ability.  For this reason, it is not a preferable mechanism to spawn active secondary databases in lieu of any existing replicated databases, just because we can, thinking this is how to upgrade our replicated model.  It is probably more possible that by using SQL Server 2014 with Hekaton, increasing the hardware specification that the replicated databases can be abolished altogether, because of a much increased native ability of the SQL Server 2014.  With its in-memory abilities, any[1] amount of transactional or reporting queries should be tolerated by the database server.  Rationalising a proper enterprise data access strategy, so as to avoid the unnecessary implementation of business logic (modelling) on the database, is achievable through other technologies in augmentation of SQL Server 2014’s new performance abilities, e.g. Data Virtualisation, which is only as good as the underlying database performance.  AlwaysOn™ should not be seen as the panacea to these problems, because of the technology enhancements of the actual database server.

This article concludes that AlwaysOn™ should be used for its intended purpose, i.e. High Availability and Disaster Recovery, and in a limited capacity to defray concurrency and performance, except in the case of geographically dispersed database.  It should not be applied to mitigate performance issues between OLTP and Reporting queries as the preferred method, but rather that a more apt enterprise data access strategy be architected to benefit enterprise database performance issues.

[1] Factors more than with SQL Server 2012, so much so that we could almost (by comparison) say any amount of queries.


SQL Server® is a Microsoft™ database technology for enterprise relational database management systems.  It declares a feature called AlwaysOn™ that is essentially a High Availability, for planned and unplanned disaster recovery purposes, something akin to (other high availability options) such as replication, log shipping, sql clustering and database mirroring.  AlwaysOn™ addresses the functional needs that had been acquired from the aforementioned architectural patterns, in an attempt to augment and simplify their respective limitations, by providing the following abilities:

  • Database Replicas, up for five, one primary and four[1] secondary stored
  • High Availability and Disaster Recovery, as an enhanced mirroring with read-only access
  • Synchronous and asynchronous transactions between primary (R/W) and replicas (R), with replicas using snapshots to avoid blocking
  • Use replicas for reporting access
  • Automatic failover

A combination of the best of failover clustering and database mirroring is the new HA/ DR feature of an Availability Group[2] of the AlwaysOn™ function.  An Availability Group is a group of databases that fails over, together (as a unit) from one replica of SQL Server to another replica within the same availability group (Ali, 2012).  Each availability replica resides on a different node of a single Windows Server failover cluster (WSFC)

This article considers only the viability of using AlwaysOn™ for the purpose of offsetting reporting workloads[3].

To realise how to leverage AlwaysOn™ for performance optimisation is to understand how the primary and replica databases can be used for this purpose.

Active Secondary Capabilities

Using active secondary[4] capabilities provides better resource utilisation (lowers cost) by offloading read intent applications[5] and backup jobs to secondary replicas.  This reduces the load on a primary[6] database (AlwaysOn Availability Groups (SQL Server), 2012).  The readable secondary, although providing read-only access, is not set to read-only, but is dynamic.  On a typical secondary, the data is near real-time, i.e. only a few seconds (Active Secondaries: Readable Secondary Replicas (AlwaysOn Availability Groups), 2012).

The deflection of workloads from the primary database is achieved by directing read-only connections to a readable secondary replica, but such a connection should tolerate latency.  To remove blocking, the read-only workloads use row versioning[7] and are automatically mapped to a snapshot isolation transaction level, in spite of the actual explicit setting, and in ignorance of locking hints.  Read-only routing is achieved by employing an availability group, listener to which the client directs the connection, as a read-only intent. There is no such thing as load balancing[8] because the connection is routed to the first available and readable secondary on the routing list of the primary.   A secondary database can influence performance on the primary because of the snapshot isolation transaction level modus. Ghost record lean-up on the primary can be blocked by transactions on a secondary.  For example (in an extreme case) if a secondary database has a long running query or when it gets disconnected or the data movement is suspended, it prevents log truncation, the secondary may have to be removed from the availability group to release the primary database’s holding pattern[9]. (Active Secondaries: Readable Secondary Replicas (AlwaysOn Availability Groups), 2012).

Indexing & Statistics

To deflect workloads to a replica in the availability group, for reporting purposes, yields a query pattern that is significantly different from a transactional database.  Presenting a more optimal performance yield, requires that the replica be able to curtail its response in a pro-reporting query fashion.  Schema or data changes, on a secondary database, must come from the primary, because of the read-only attribute of the replica.  Creating the indices on the primary allows the redo process to transfer them to the replicas.

To optimise query plans on the read-only replicas, different from that on the primary (but in addition to those on the primary) the statistics on the secondary is created in tempdb[10].  These statistics are temporary, as read-only statistics that are only created by SQL Server, but can be dropped by the user.

SQL Server 2014 a Better Option?

Almost always, a standard line from a Microsoft consultant is, “upgrade to the next version” for every existing version’s weaknesses.  So, what does SQL Server 2014 offer for performance mitigation between OLTP and reporting workloads, such as can be achieved in SQL Server 2012 with AlwaysOn™?  SQL Server 2014’s in-memory database ability[11] (Hekaton) is integrated with AlwaysOn™ by having the tables also readable in the secondary[12] (Diaconu, et al., 2013).  Hekaton, however is an OLTP[13] optimisation, using in-memory direct pointers (in place of disk-stored-indices), not indirectly such as page/ row ids to address data.  Hekaton is integrated into SQL Server 2014. (Larson, Zwilling, & Farlee, 2013).

But, does Hekaton benefit an active secondary for the purpose of offloading query pressure from the primary?  There are a few major Hekaton improvements:

  • In-memory tables[14]
  • Indices, hash and range with direct pointers to data, all in memory[15]
  • Stored Procedures compiled to native machine code
  • High level of concurrency[16] not relying on portioning, using latch free data structures to avoid physical interference amongst threads with a new optimistic, multi-version concurrency control technique to reduce interference among transactions. (Larson, Zwilling, & Farlee, 2013).

From a published AlwaysOn™ features perspective, SQL Server 2014 provides the following enhancements over SQL Server 2012:

  • Enhanced Availability for read-only Replicas
  • Increased Number of Replicas
  • Integration with Windows Azure
  • Enhanced Diagnostics

These don’t, on face value, translate to any material benefit over SQL Server 2012, by just looking at what’s new in AlwaysOn™.  But, if the advances of Hekaton is considered, having all of those features available in the secondary databases, provides a significant performance ability yield, but that by virtue of the database engine and not because of AlwaysOn™.

The Column Store Index option[17] (Data Warehousing) is part of the in-memory ability of SQL Server 2014 providing greater data compression for more real-time analytics support.  The resource governor of SQL 2014 also provides the ability to perform IO management (Team, SQL Server 2014: A Closer Look, 2013).  It also includes buffer-pool-extension support for solid state drives, enabling faster paging (Foley, 2013).  All of these, yet again, are database engine optimisations and enhancements, and not AlwaysOn™.  So, any performance gains are because of the database technology and not as a result of the AlwaysOn™ feature and any metrics thus achieved, should be attributed to the database technology only.

[1] This is a SQL Server 2012 constraint that is doubled with SQL Server 2014 (Otey, 2013).

[2] Availability groups cannot be created across domains, nor against multiple clusters in the same domain, nor having two replicas on the same node, part of the same group, neither replicate system databases, neither perform cross database transactions in a group (Golla, 2012).

[3] Scaling read-only workloads to replicas is available only in the SQL Server Enterprise Edition and attracts a full new server licence for every active replica. “When the secondary server does any work at all, like taking requests or running backups off of it, or any of the readable features of AlwaysOn are enabled then it must be fully licensed” (Hughes, 2012).

[4] Change tracking and change data capture are not available on secondary databases that belong to a readable secondary replica.

[5] This is very advantageous specifically for geographically distributed instances of a databases

[6] In SQL Server 2014, the relationship between the primary and the replica is more stable, in that if the primary goes down, the replica is still usable (Ozar, 2013).

[7] Row versioning increases data storage in the primary and secondary, although version data is not generated by the primary database.  A read-committed snapshot isolation (RCSI) level setting is turned on for the primary database.

[8] In SQL Server 2014, load balancing can be implemented using simple DNS round-robin or specialised load balancing solutions, either hard or software based. (Team, AlwaysOn in SQL Server 2014 CTP1, 2013).

[9] However, if ghost records (on the primary) is still needed by the secondary, a DBCC SHRINKFILE on the primary database log may fail.

[10] Any server restart flushes all read-only replica statistics customised for the read queries.

[11] In-memory performance is achievable across OLTP, DW and BI (Underwood, 2013).

[12] Failover time is not dependent on size of durable memory optimised tables (Underwood, 2013).

[13] OLTP-like, i.e. short lived transactions, high degree of concurrency, e.g. stock trading, order processing, or travel reservations.  Long running transactions should be avoided.

[14] Max size SCHEMA_AND_DATE = 512 GB (Underwood, 2013)

[15] In the event of a failure, all tables and indices are rebuilt entirely from the latest checkpoint and logs.

[16] Concurrency is obtained through multi-versioning per update per user.

[17] Column Store Indices can be updated for real-time data without dropping and recreating them (Belnekar, 2013).  In-memory ColumnStores provide fast execution for data warehouse queries and eliminates the need for other indices.  Microsoft’s PDW makes use of the in-memory ColumnStore in its Delta Store, converting data into a columnar format once the segment (1M rows) is full and is the preferred storage engine of the PDW. (Underwood, 2013).

Works Cited

Active Secondaries: Readable Secondary Replicas (AlwaysOn Availability Groups). (2012). Retrieved from Microsoft Technet: http://technet.microsoft.com/en-us/library/ff878253.aspx

Ali, A. (2012, January 19). Using The AlwaysOn Feature of SQL Server 2012. Retrieved from SQL Server Performance: http://www.sql-server-performance.com/2013/alwayson-clustering-failover/

AlwaysOn Availability Groups (SQL Server). (2012). Retrieved from Microsoft Technet: http://technet.microsoft.com/en-us/library/hh510230.aspx

Belnekar, P. (2013, June 18). Priti Belnekar’s Blogs. Retrieved from Blogspot: http://pritibelnekar.blogspot.com/2013/06/sql-server-2014-new-features.html

Diaconu, C., Freedman, C., Ismert, E., Larson, P.-A., Mittal, P., Stonecipher, R., . . . Zwilling, M. (2013). Hekaton: SQL Server’s Memory-Optimized OLTP Engine. Redmond: Microsoft. Retrieved from http://research.microsoft.com/pubs/193594/Hekaton%20-%20Sigmod2013%20final.pdf

Foley, M. (2013, June 11). Microsoft’s SQL Server 2014: More than just in-memory OLTP. Retrieved from ZDNet: http://www.zdnet.com/microsofts-sql-server-2014-more-than-just-in-memory-oltp-7000016672/

Golla, K. (2012, September 17). SQL Server 2012: Always On FAQs. Retrieved from Blogs MSDN: http://blogs.msdn.com/b/srgolla/archive/2012/09/17/sql-server-2012-always-on-faqs.aspx

Hughes, J. (2012, October 17). Does Sql Server 2012 Always On support Active/Passive? Retrieved from Stack Exchange: http://dba.stackexchange.com/questions/27088/does-sql-server-2012-always-on-support-active-passive

Larson, P.-A., Zwilling, M., & Farlee, K. (2013). The Hekaton Memory-Optimized OLTP Engine. Redmond: Microsoft. Retrieved from http://sites.computer.org/debull/A13june/Hekaton1.pdf

Otey, M. (2013, September 20). Changes to SQL Server 2014 AlwaysOn Availability Groups. Retrieved from SQL Seerver Mag: http://sqlmag.com/sql-server-2014/changes-sql-server-2014-alwayson-availability-groups

Ozar, B. (2013, June 03). (Almost) Everything You Need to Know About SQL Server 2014. Retrieved from Brent Ozar: http://www.brentozar.com/archive/2013/06/almost-everything-you-need-to-know-about-the-next-version-of-sql-server/

Team, S. (2013, June 27). AlwaysOn in SQL Server 2014 CTP1. Retrieved from SQL Server Blog: http://blogs.technet.com/b/dataplatforminsider/archive/2013/06/27/alwayson-in-sql-server-2014-ctp1.aspx

Team, S. (2013, June 05). SQL Server 2014: A Closer Look. Retrieved from SQL Server Blog: http://blogs.technet.com/b/dataplatforminsider/archive/2013/06/05/sql-server-2014-a-closer-look.aspx

Underwood, J. (2013, September 22). SQL Server 2014 In-Memory, High Performance and Mission Critical Overview. Retrieved from Slideshare: http://fr.slideshare.net/idigdata/sql-server-2014-inmemory-and-mission-critical

Irritating Corporate Clichés

Read the article “Stop Using These 30 Phrases At Work!” on LinkedIn, and got compelled to thinking about these annoying expressions that get thrown around meetings and the boardroom in our contemporary corporate culture.

Here are somereally gnarly sayings that we “oh so love” to throw around to sound kewl:

• What’s good for the goose is also good for the gander
• The Lion’s share
• Off the back off that
• Picking low-hanging fruit
• We’ve all gotto tighten our belts
• We’ve got to put a pin in that one
• Until we get traction on that project
• Look into the crystal ball. . . crystal gazing
• Back to square one
• Big ticket items
• At this point in time
• Crashing and burning
• Step up to the plate
• Get it on your radar
• Not in a good space right now
• From the (word) [get] go
• Get you r ducks in a row
• Thousand foot (birds-eye/ helicopter) view
• Got too much on our plate
• Blue skies, green fields
• Behave yourself into a new attitude
• Like-for-like (forget comparative)
• Push the envelope
• Check all the boxes. . . cross the Ts and dot the Is
• On an ongoing basis
• Let’s chat about it off-line. . . guys take that off-line please

•Let’s only have one meeting please
• At the end of the day
• Game-on. . . play ball. . . if you don’t want to play ball
• The bottom line is
• At this moment in time
• Flush out
• Grass-roots level
• Face the music
• It’s arguably a good solution
• Let’s face it
• Pedal to the metal
• My bad!
• Man up and move out
• Expanding our horizon
• Let’s rewind
• Drink the cool-aid
• Keep it real. . . keep it high-level. . . the real deal
• Game changer
• Boil the ocean. . . water cooler conversation
• Ramp-up. . . push back. . . run it up the flag pole and see how it flies . . . let’s flesh it out
• Flying under the radar
• 24/7/365
• The train hasn’t (too many have) left the station, too many thrown under the bus
• Hit the ground running. . . live one day at a time
• “air quotes”
• best-of-breed
• Seamlessly. . . dove-tailing
• future-proof
• we feel your pain. . . the pain-points are
• Been there, done that, got the T-shirt
• Organically grow a line of business
• We’re gonna take shots for that
• It’s what it is. . . like a fish out of the water
• Guru. . . Nothing to fear, but fear itself
• Where the rubber meets the road. . . at ground zero. . . at the flash-point
• Loop-back. . . claw-back clause
• Let’s not reinvent the wheel here
• I could run this list by you, but to make a long story short. . . the list could be endless, a smorgasbord, whatever. . . you get the picture!

KDB+ and qStudio Kick-Start


KDB+ is a time-series analysis data solution for market data solutions for large financial institutions.  Supports real-time and historical data analysis and monitoring on a single platform.  Consequently it provides access to huge amounts of historical and real-time data for analysis of ticks, using its query language ‘q’.  KDB combines Complex Event Processing and a Database.  It handles feed handlers such as Reuters, Tibco, and Bloomberg. (Power. Flexibility. Access., 2013).

This is a kick-start, getting KDB+ and q started and working with KDB+ in qStudio.

Installing & Connecting to KDB via UI

Running a KDB+ database requires the database and a user-interface. (UI)  There are several UIs available, some of those are listed on the Kx Wiki (Startingkdbplus/introduction, 2013).  This article selected qStudio (qStudio, 2013).  The operating system is Microsoft Windows 7.

Installing KDB+

The installation is straight forward.  Download the system from Kx named “Free Trial” (kdb+: The world’s most powerful number cruncher, 2013).

  • Extract to c:q
  • Update Environment Variable[1]:
    • System Properties >> Environment Variable >> New
    • Variable name: QHOME
    • Variable value: c:q
    • Open the console:
      • Windows+R >> CMD
      • cd
      • cd c:qw32
      • q
      • Download and load the console help from code.kx.com/wsvn/kx/kdb+/d/help.q ((root)/kx/kdb+/d/help.q – Rev 1258, 2013)
      • Copy the file into c:q
      • From the console, type q)l help.q, then help`, to display help (Cookbook/GettingStarted, 2013).

Testing a simple query, using the trade.q example database that comes with the installation, type the command: q) select sym, time, price, size from trade at the q) command prompt and hit enter.

The results display the one and only record in the trade.q file.

Table 1 q results from trade.q file

sym time         price size


a   09:30:00.000 10.75 100

Run another test query from the file sp.q => type into q command-line: q)l sp.q.

Issue a query as follows: q)select qty by s from sp.

Table 2 q results from sp.q query

s | qty

–| ———————–

s1| 300 200 400 200 100 400

s2| 300 400

s3| ,200

s4| 100 200 300


[1] If installed in c:q, it is deemed the default and an Environment variable is not required.

Connecting to KDB via UI qStudio

To connect to KDB, download the UI called qStudio for Windows from TimeStored (qStudio, 2013).

Install and associate the UI installation with the KDB files, all part of the installation and intuitive enough to not warrant further explanation.

The first thing to do, is to connect to a running KDB+ server (qStudio kdb+ IDE Help, 2013).

Create the KDB+ server by opening q in a console window and starting q, as described in the Installing KDB+ section above.  Provide a port to listen on (Startingkdbplus/ipc, 2013), e.g. –p 5010 => c:q> start w32q trade.q –p 5010.

The command (above) starts q, loads trade.q (q) l trade.q) and assigns a port, i.e. 5010.  This renders a running KDB+ server to which qStudio is able to connect.

Open aStudio and right-click servers >> add servers, see Figure 1 below.  Specify the port, i.e. 5010 and click the [test] button, to which the pop-up window answers: “Connection works”.  If, for some reason, the window speaks of not connected/ failed/ not working, the KDB+ server is probably not running.  The Server Tree is expanded and should yield trade as a table.  Load other tables by issuing the l file.q command from the q prompt.  Refresh the Server Tree to show these tables.

qStudio Adding KDB+ Server

qStudio Adding KDB+ Server

Figure 1Adding a Server to qStudio


Works Cited

(root)/kx/kdb+/d/help.q – Rev 1258. (2013, July 17). Retrieved from Subversion Repositoriescode : http://code.kx.com/wsvn/code/kx/kdb+/d/help.q

Cookbook/GettingStarted. (2013, 07 17). Retrieved from Kx: http://code.kx.com/wiki/Cookbook/GettingStarted

kdb+: The world’s most powerful number cruncher. (2013, July 17). Retrieved from Kx: http://kx.com/kdb-plus.php

Power. Flexibility. Access. (2013, 07 17). Retrieved from Kx: http://kx.com/kdb-plus-taq.php

qStudio. (2013, July 17). Retrieved from TimeStored: http://www.timestored.com/qstudio/

qStudio kdb+ IDE Help. (2013, July 17). Retrieved from TimeStored: http://www.timestored.com/qstudio/help/#gettingStarted

Startingkdbplus/introduction. (2013, February 08). Retrieved from Kx: http://code.kx.com/wiki/Startingkdbplus/introduction#1.1_Overview

Startingkdbplus/ipc. (2013, July 18). Retrieved from Kx: http://code.kx.com/wiki/Startingkdbplus/ipc



NoSQL Data Stores is an emerging technology, spurred by the increase in web usage and the need to track user interactions, logs, web-clicks, big data, low latency, lower cost, simplified development etc. (Fowleruk, 2013).

An organisation constantly has to evaluate what is best for its enterprise architecture and suit technologies to business or information technology problems.  Evaluating NoSQL technologies produces two main advantages to take advantage of, viz. reduction in development drag and data access performance.  Viewing these advantages in isolation makes it seem that the choice should be natural.  Nevertheless, understanding NoSQL is to ratify and corroborate these factors and attribute them to a problem solution.  Understanding NoSQL also in light of being a tools for a purpose, not only in isolation, but potentially in association with other exiting, even complementary tools.

 While NoSQL yields fast data retrieval performance, with great horizontal scalability, something a traditional relational database technology like SQL Server is incapable of without much trouble and cost, these referred advantages come at a cost.  This cost is generally the factor to consider and which becomes a determinant in a decision on the technology.

A big challenge vests in the handling of OLTP data, specifically where this data has to be presentable within a near-real-time frame for reporting, fringed by historical records, combined by a SQL query, and presented to a user in good time.  Reporting causes locking on relational data stored that impacts the data insert process, which pushes back on the provisioning system, e.g. the trading engine that causes service-level-agreement (SLA) infringements.  It is a tremendous draw-card to provision a highly scalable, fast database technology, like NoSQL, to avert this.  The irresistible attraction of NoSQL, speed up development, without a full view of the sacrificed, could lead to major problems down the line.

The general ruling within the user community on when it is best to use a NoSQL database is “it depends” (Eng, 2013) with some saying that in many cases the drawbacks of NoSQL outweigh the advantages.  Implementing a NoSQL data store requires a specific use-case, relative to its requirements.  NoSQL databases lend a lot of autonomy to developers by means of dynamic schemas, meaning that the application holds control over what and how that what is stored, making the code play well with the database.  This overcomes the requirement for elaborate ORM mappings in an attempt to relate data to code.  Despite this boon, the bane is that all data manipulation logic vests in the application, making easy data technology changes very cumbersome and transfers the integration effort right back into the application.  Furthermore, performing data manipulating happens through the application code, removing SQL out of the mix.  Construing this as an advantage has to admit the downside that application has to acquire all the data that it queries, because it does not happen on the database.  Horizontally scaling the database forestalls performance bottlenecks, but sacrifices consistency in that only two consistency attributes of the CAP theorem may be realised.

No situation is definitive.  Evaluate each case on its merits, at the hand of what solution is best suited to the particular database type and technology.  Do not use NoSQL when the requirement is:

  • OLTP
  • Consistent Data at all times to all viewers
  • Complex relationships in the data
  • Mature and Stable product

However, a few proficient technologies, addressing very new and cutting edge requirements, such as in-memory databases, caching of data and polyglot persistence requires attention to see if they solve an ongoing and particular problem in the enterprise architecture.  These are technologies like MongoDB, VoltDB, Microsoft Hekaton, and RavenDB, where not only in-memory capabilities are realised, but also consistency, and even horizontally scalable relational models.

Deciding definitive use-cases for NoSQL, relational databases, NewSQL or polyglot computing, is a complex arbitrage that is not expressible in a succinct capacity, but has to be elaborated and expressed categorically, per requirement.

This paper illustrates this complexity, by explaining the database types and use-cases, drawing attention to potential requirements.


The easiest way to think of a NoSQL database is that it (as database) does not adhere to a traditional relational database management system structure.  In business solutions, the reality is that both relational and NOSQL databases may not be the best solution for all situations. (Menegaz, 2012).

The appearance of NoSQL[1] databases has prompted a re-look at the traditional relational data models to see how this movement, in technology, positions itself in the enterprise architecture.  The NoSQL user community is punting this technology to be a lightweight, horizontally scalable, simple to support, schema free, easy to replicate, with simple API, that stores huge amounts of data and even (some) have ACID[2], BASE[3] or REST[4] qualities. (LIST OF NOSQL DATABASES , 2013).  However, a full guarantee of ACID is actually just an eventual consistency for transactions limited to single data items, where performance and real-time processing is more important than consistency. (NoSQL, 2013).  The CAP theorem (Brewer’s Theorem) states that it is impossible for a distributed computer system to simultaneously provide all three guarantees, viz. Consistency, Availability, Partition tolerance. (CAP Theorem, 2013).  It is impossible for a distributed computer system to simultaneously provide all three of the guarantees of the CAP theorem. (Nurse, 2010).  The trade-off, for NoSQL is a “Pick two” event between speed, consistency and data volume. (Ayushman Jain, 2013).

The explosion of events, products and objects, associated with the web, increased data volumes to store for purpose of tracking user data.  Frequent access and processing this data, e.g. social networks, where real-time activity feeds for users are processed, increases the performance requirements for systems and their data.  Scaling, being the core-requirement, is an aspect that a relational system does not cope well with and coupled with cheap storage and increased processing power, warrants that these quantities of data be stored in a non-relational fashion, i.e. NoSQL databases

The question about NoSQL vs. Relational Databases requires an answer about what is good for the organisation in what application.  The answer may suggest that an organisation may use many data technologies and apply them where they fit the requirement.  The ability to match requirements to technology requires knowledge about the different approaches and technologies.  NoSQL is an emerging technology with much fewer tools than that available to the relational counterpart, e.g. graphics, reporting, or some open-source tools.  With about twenty different NoSQL choices, each with its own philosophy on working with the data, toolmakers have a challenge in creating tools.  In essence, NoSQL tosses some functionality for speed. (Wayner, 2012).  However, NoSQL has a definitive advantage for application development productivity and large-scale data. (Pramodkumar J. Sadalage, 2012)

An organisation’s data lasts much longer than its programs that requires a stable data storage platform, which is well understood and accessible from many application-programming platforms.  The problem of making code play well with relational models gave rise to technologies such as ORM[5] and NoSQL.  According to Martin Fowler, author of NoSQL Distilled (Pramodkumar J. Sadalage, 2012), NoSQL does not present the death knell to relational databases just as much as it is also not signalling an indication that it may announcing its end.  He says that relational databases will not be the only databases in use. Stated by Fowler: “many projects (strategic), probably a majority, are better off sticking with the relational orthodoxy.  But, the minority that shouldn’t is a significant one.” (Polyglot Persistence, 2011).

Improved read-performance through highly optimised key-value pairs, for simple retrieval and append operations, provide huge benefits in terms of latency and throughput.  Real-time web applications and big-data industry use, are where industry growth vests.

NoSQL benefits agile-style development through its dynamic (Spartan[6]) schema, allowing data inserts without a pre-defined schema.  No advance knowledge is required about a schema for the NoSQL database and programmers can make the decision whenever the need to store something.  This freedom may lead to increasing inconsistencies and NoSQL offers no support to limit this problem that rests on development to mitigate.  In contrast to a relational database change that may require DBA approvals, it may be easier to sustain than having to deal with half-dozen different keys created on the fly by programmers. (Wayner, 2012).  NoSQL is achieving development agility, for application changes, through the abstraction established between application and data.  The NoSQL data store encapsulates data as a decoupled structure from the application, allowing easy database changes without interference with the application, by having the data logic as an implementation of the code.  The application’s data code enables data distribution, distributed queries, and aggregate results, across all database instances.  Writing such additional code can also accommodate resource failures, data re-balancing, replication, and to perform joins across different databases.

Integrated caching and auto-partitioning of the database, over a distributed architecture, scales horizontally (in contrast to RDMS’ vertical scaling) and yields a pro-performance data architecture.  Coupled with the maintenance features of a cloud solution, deployment eases the need for capacity planning, maintenance and intellectual capital costs for having a NoSQL data capability.  Integration with a caching technology[7] keeps most-used data in memory[8] for quick access, balancing load demands through the horizontally scaled distributed architecture of the database.  The Agile development methodology brings with it a few constrictions on the relational database.  The well-known problematic relationship between code and the relational model does not benefit Agile[9], and Agile does not benefit relational models.  Changes to a data model in an Agile development, requires evolutionary changes, without knowing the full set of requirements upfront.  Frequent changing of a relational model, compounded by the presence of data, creates a lot of pain for developers, to both application and database.  A common utterance is: “It was the mapping and migrations I wanted to get rid of” (Hariri, 2010).  NoSQL provides a good solution for this through Event Sourcing.

The implementation of a NoSQL database span a few types (What is NoSQL, 2013):

  • Document database
  • Graph stores
  • Key-value stores
  • Wide-column stores

Implementing a data storage model varies for a relational and NoSQL data store.  A NoSQL database implementation varies by database type.  Relational databases use individual records, stored in rows and tables in an often, normalised[10] structure, which is variable for NoSQL.

The presentation of queries, against a relational structure, on conjoining tables, materialises through SQL join commands, retrieving meaningful information as a set of records.  NoSQL, relative to the database implementation type, store data as either key-value pairs, value columns, or JSON / XML documents.  A join means consistency and a join is relatively costly as a data operation.  NoSQL makes it difficult to keep various data entities consistent and is best suited for more simple tasks. (Wayner, 2012).

Inflexible relational-database schemas, compared to the dynamic nature of NoSQL schemas, seems as if the former is a forlorn option.  Nevertheless, as with all things, the NoSQL database’s dynamic schema property comes with its own challenges.  Although adding new information on the fly, as well as storing dissimilar data together, are main features, wide-column stores find this challenging.  It is also relatively difficult to change between NoSQL data store vendors because the data logic vests in the application code; a lot of integration code is required to bridge the different technical philosophies of different databases. (Wayner, 2012). “It’s easy to end up with a solution which resembles a relational database to the extent that you begin to wonder – if you should not just use a relational database.” (Saunders, 2010).

Manipulating data, for a relational database, is by using a structured query specific language like SQL, building queries to present to the data model for data retrieval.  The NoSQL data store requires queries come through object-orientated APIs.  A main point of difference is that relational stores can resolve aggregates on the database, while (many) NoSQL databases ship all the data (required for the aggregate) to the application, pulling it all over the network. (Wayner, 2012).

Query consistency is strong for relational databases, but is very dependent on the particular NoSQL product.  Some provide strong and others eventual consistency.  Joining on data requires consistency, and joins enable more complex database operations.  There are three reasonable approaches to consistency in a replicated data scenario:

  1. Traditional/ near-perfect consistency: processing stops to ensure the system that an update has propagated to all replicas, typically using two-phased commits.  A single failure can bring part of the system to a halt.
  2. Eventual consistency: inaccurate reads are permissible, as long as the data are eventually synchronised, and data accuracy is less than ideal
  3. Read-your-writes (RYW) consistency: data from any single write is accurately read-guaranteed y, despite minor network outages and node failures.  However, a sequence of errors can produce inaccuracies, something that perfect consistency will not

Comparing relational and NoSQL database options pitches both functionality and deployment models.  Most formidably is the deployment ability of NoSQL as a horizontally distributed and partitioned database, where this is achievable for relational databases with great difficulty.

Google (Bigtable) and Amazon (Dynamo) were early adopters of large cluster in the avoidance of relational databases.  The cloud and the rise of web-services make is easier for applications to choose their own data storage.  Achieving large-scale data storage is easy through the scalability of the NoSQL clustering ability, able to store in the Peta-byte range of data.  The alternative data models of NoSQL allows efficiency over these large data sets, things not easily achieved in a relational model.

Developing applications on relational models tie the code to the database and make them interdependent, creating development drag.  Applying an ORM to this problem eases the dependency, but still attracts a significant integration effort.  Matching the database technology to the problem domain yields a more optimised solution.

One approach to the problem, is to consider best-of-breed databases, arrayed in a presentation to the applications using them.  These applications issue polyglot[11] database scripts, against the database array, even introducible on an existing code base.  Be this as simple as it sounds, comes with the complexity of knowing these new mechanisms.

A formidable hurt-point for relational databases is schema changes.  The fact that a NoSQL database is schema-less does not imply no problem, applications must still manage how the data is structured.

[1] Not only SQL, but there is no standard definition on what it really means, but there are a few characteristics: not-relational, no SQL, designed to run on a cluster, tend to be open-source, no fixed schema allowing data storage in any record (Martin Fowler, Pramod Sadalage, 2012).

[2] Atomicity, Consistency, Isolation and Durability

[3] BASE; the alternative to ACID, Basically Available, Soft State and Eventually Consistent. (Pritchett, 2007)

[4] Representational State Transfer for distributed systems with state management

[5] Object Relational Mapping technologies such as NHibernate and Entity Framework.

[6] Spartan, i.e. free and deliberate nature of the dynamic schema, a phrase coined by Rob Conery? (Conery, 2010).

[7] SQL Server 2014 (Hekaton) although not a NoSQL database is also able to provide huge performance optimisations through its in-memory caching of most used tables.  Hekaton does not have the capability to horizontally scale, i.e. partition/ shard.

[8] One of the main motivations to integrate in-memory-based solutions with a NoSQL DB is to reduce the cost per GB of data, as opposed to putting all the data in memory, which can be too costly.

[9] Agile development methodology

[10] Normalisation according to Boyce-Codd is simply splitting out data into various tables, according to a classification, to avoid duplication, i.e. to minimise redundancy and dependency.  Relationships between these new tables establish the link between the data relationships, according to a taxonomy.

[11] Program script written in a valid form of multiple programming languages that performs the same operations or output independent of the programming language used to compile or interpret it. (Polyglot (computing), 2013).  Neal Ford coined the term in 2006. (Selmer, 2012).

NoSQL Database Use-Cases

The question is when or when not to use NoSQL and the answer is it depends. (Miner, 2012).  It is a truism that we should choose the right tool for the right job.  Divide use-cases into categories to group related situations, discriminating between a good and poor case. (What The Heck Are You Actually Using NoSQL For?, 2012).  No situation is definitive.  Evaluate each case on its merits, at the hand of what solution is best suited to the particular database type and technology.

NoSQL achieves two main advantages:

  • Programming Productivity
  • Data Access Performance

Generally, NoSQL database use-cases can effectually to accomplish data solution for:

  • Big data
  • High volume writes
  • Fast key-value access
  • Flexible schema and data-types
  • Schema migration
  • Write availability
  • Ease of maintenance and administrative operations
  • High availability
  • Parallel computing & Distributed support, i.e. horizontal scaling
  • Splitting data into two segments, viz. high-availability environment and historical reporting data.  The data users need to access, i.e. high-availability are stored in a NoSQL database, e.g. MongoDb[1], while the historical decision support data is stored in a relational store.  An ORM achieves the bridge between the NoSQL and relational store, e.g. DataMapper. (Conery, 2010).

In a more particular sense, use NoSQL for:

  • Large non-transactional data streams such as application logs or clickstreams, i.e. visitor mouse clicks
  • Synchronising on and off-line data, where web applications are made available off-line, only to synchronise with the database when online again, such as HTML5 that has support for offline storage and SQL already. (Finley, 2010).
  • Fast response time despite load
  • Avoid heavy joins
  • Gaming support for real-time systems where low latency is critical
  • Accommodate heterogeneous write, read, query and consistency patterns
  • Load balance data usage concentrations to utilise hardware more efficiently
  • Real-time data processing, insert, update and query
  • Hierarchical data support
  • Dynamic table creation
  • High performance web site caching tiers
  • Real-time page-view counters
  • Document orientated databases with flexible schema and schema changes
  • MapReduce[2] analytics with scaled systems (essentially saying SELECT then GROUP BY)
  • Embedded databases for simple storage
  • In-memory databases requiring high write ability, e.g. chat programs to display user active time values
  • Balance high-frequency stream data processing with simultaneous lower-frequency multi partition queries using materialised views
  • Calculations on cached data without an ORM
  • Intersection computing of large data sets where a join would be too slow

Do not use NoSQL databases, most of the time and in specifically not in cases where the following apply.

  • Online Transactional Processing (OLTP), e.g. complex multi-object transactional are generally (bar VoltDB[3]) not supported
  • Transactional consistency is required over performance.  NoSQL rely on applications for data integrity and relational database systems (declarative) are still the best at this
  • Many different applications (with different owners) access data
  • High level of security required at database level to protect data
  • Data independence because the application drives everything about the data and applications retire but not the data
  • SQL is a requirement; few NoSQL systems supply a SQL interface
  • Complex Relationships, are much better in relational systems or Graph databases
  • Maturity and Stability.  NoSQL is emergent, and people know relational systems better and there are more tools available for them

1.1.                     Programming Productivity

Displaying or collecting data, in terms of aggregates, or where data include complex, nested or hierarchical structures with many relationships (graphs) or the data is non-uniform, NoSQL databases perform well.

1.2.                     Data Access Performance

To spread (partition) data across many database servers, to leverage performance, e.g. fragmented (portioned) and replicated.  In addition, where aggregate data need to join from multiple tables in a relational store, or to query complex relational data.  “It is true that joins and complex queries have disappeared in Document databases, but relationships between objects in the real world have not, and based on how we need to work with this information, different techniques can should be applied” (Hariri, 2010).

There are two database technologies, unique in the NoSQL repertoire that buck the norm, viz. VoltDB (although not a NoSQL deviates much in their design pattern) and Radis by VMWARE. (What The Heck Are You Actually Using NoSQL For?, 2012).

VoltDB[4] is specifically suited for application in:

  • A financial trade monitoring
  • Package tracking
  • Ticket reservations
  • Telephone exchange call detail record

Radis (VMWARE) being a data structure server is suited for:

  • Circular log buffers
  • Real-time statistics having fast atomically incremented counters
  • Partitioning repositories with the ability to identify which partition is used for a specific user’s data by polling the database every few seconds, e.g. GitHub implementation
  • Transient data
  • Easy setup and fast performance
  • Share state between processes, e.g. lad data and read data at the same time, on the same set
  • Capped log implementation, by adding data to log while trimming older data off
  • Dynamic schemas
  • Reduce impedance mismatch, e.g. the data model in the database can more closely match the data model in the application

[1] Using MongoDB, for the purpose of speed, requires that the driver between the application and the database also serve this purpose.  For example, translating between the application and database by encapsulating data in easy-to-use dictionaries, although very useful in presentation, wastes time.  Using a connector such as Monary, relying on NumPy arrays instead of dictionaries, speeds up the transfer by a factor of 7.  This even on relational databases. (Farrugia, 2012).

[2] MapReduce = mapping function that maps input data into its final form that is executable in parallel on each system.  Reduce function operates on results of the mapping, executed repeatedly until results are obtained. (Nurse, 2010).  This was pioneered by Google and is a design pattern. (Pozzani, 2013).

[3] VoltDB is considered part of the NewSQL (as opposed to NoSQL) family because it represents a modern version of the relational database that provides scalable performance like that of NoSQL, but for OLTP workloads while maintaining ACID guarantees. (NewSQL, 2013).

[4] If you can get a relational database with all the scalable ACIDy goodness, why would you ever stoop to using a NoSQL database that might only ever be eventually reliable? VoltDB is not just competing against NoSQL, it is aiming squarely at existing relational database vendors by using the patented technology leap play. (Hoff, 2010).  VoltDB is the primary database in the category In-memory databases under the heading Distribute query to fragments to data nodes. (NewSQL, 2013).

Types of NoSQL Databases

NoSQL databases are categorised according to their data model, expressed in the list tabled in Table 1 below.

Table 1 NoSQL Database Categories (Pramodkumar J. Sadalage, 2012)

Data Model Example Databases
Key-value BerkeleyDB



Project Voldemort



Document CouchDB





Column-Family Amazon SimpleDB




Graph FlockDB


Infinite Graph



Although the lines between key-value pair and document databases are often blurry

1.1.                     Key-Value Databases

These databases use a simple hash table to store key-value pairs.  Access to records is by ID (key), akin to the primary key of a relational data store.  This key can be any data type.  The operation is to get a value associated with the key, or set (append/ overwrite) a value for the given key, or to delete a key and its associated value.  Typical use-cases include session data, shopping cart and user profiles.  Use these kinds of structures (databases) only for transient data, caching or data accessed via unique key, such as session data, user profiles, or shopping carts.

Consistency applies in the context of a single key-value pair and needs a specific strategy to handle distributed key-value pairs, because the newest write wins reporting all to the client for conflict resolution.

No ACID because of the distributed cluster.  Asserts consistency by quorum of nodes to allow a write.  The Riak solution (eventually consistent) allows control over aspects of the CAP[2] theorem (Figure 1 on page 15). (RNDr. Irena Holubova, 2012).  Consistency is the C in ACID and CAP or the concept of atomicity.  The Availability indicates that the service is available and operates fully.  Partition tolerance, with the service acting as an atomic processor operating across nodes of data with the communication consistent between nodes.  A drop in communication potentially creates a partition (wall or division) between nodes, preventing communication.  The system needs to tolerate this, i.e. not cause a problem by having data replicas available to serve in place of the failed node.  The definition of partition tolerance is as follows: “No set of failures less than total network failure is allowed to cause the systems to respond incorrectly” (Gilbert & Lynch, 2002).

CAP theorem

CAP theorem

Figure 1 CAP Theorem

Fragmentation (partitioning[3]) and Replication techniques are used to scale.  Use the key-values to partition the database.  Replicas should be available to take over in the case of failure.  Applications only query by key, not by value.  The key’s design must be unique across the database.

The application level manages how and where data gets stored.  The value of the key determines on which node the key is stored.

  • Single key for related data structures
  • Key holds identification data, for example a user’s session ID
  • Can include nested data structures
  • Data is set and retrieved once
  • Multiple keys for related data structures
  • Key incorporates the name of the stored object
  • Multiple targeted fetches needed to retrieve related data
  • Decreased change of key conflicts
  • Bucket stores different kinds of aggregates
  • Increases chance of key conflicts

Do not use a key-value database for managing relationships between sets of data or to correlate data between sets of keys.  Some solutions provide link walking but that is unusual.  Situations employ multi-operational transactions, e.g. saving multiple keys, because in the failure of one, a rollback is not possible.  Searching keys, based on an element in the value part of the records is also not a feasible use-case for this database type, because operations are limited to one key at a time with no possibility to operate on multiple keys. (RNDr. Irena Holubova, 2012).

According to DB-Engines.com, Rdis (VMWARE) is the most popular in-memory key-value store, yielding no notable speed difference between write and read operations (Redis, 2013).

1.2.                     Document Databases

These NoSQL database types store documents with keys to access them, are similar to Key-Value Databases in many respects, but being hierarchical tree data structures that consists of maps, collections, scalar values, or nested documents.  Can perform dynamic manipulation of the structure of documents, and see these documents.  The storage format of documents often in JSON/ XML/ YAML/ BSON/ PDF/ MS-Word, Excel etc. format, and each document can have its own structure, i.e.is non-uniform.  The dataset assigns an automatic ID value to each document.  Some databases offer MapReduce techniques to retrieve information, based on the document’s content. (NoSQL, 2013).  MapReduce is much the same as saying SELECT and then GROUP BY in a way that is entirely confusing SQL users. (Peschka, 2010).

Consistency and transactions apply to single documents with replication and partitioning done according to the documents.  Return partial documents by using JSON queries on the document store.

Event logging, with various attributes, the content management, blogging forms, web analytics and ecommerce applications are all use-cases for document databases.  However, transnational management across multiple documents (records), and adhoc cross-document queries are not advisable use-cases.

Stored records exist by indexed key, with column groups of families akin to the tables of a relational store.  Super columns and super columns into super column families group related columns.  A key-space (top-level container) acts the role of schema (as relational concept) by which to apply configuration parameters and operations, e.g. replicas and data repair operations.  The key-space creation invokes the specification of columns and adding new columns happen throughout, to the rows to which they are applicable.

Achieve data access through get, set and delete operations, potentially by use of the database’s query language.  Transactions are atomic at single document level and transaction involving more than one operation is not possible, however RavenDB supports transactions across multiple operations.

Use a Document Database for event logging, content management and blogging forums, counters, web or real-time analytics[4] and expiring data or ecommerce applications.  They should not be used where ACID is required or for adhoc aggregate queries.  Do not use a document database in cases where complex transactional spanning of operations require atomic cross-documents operations, except where the database technology explicitly supports it, or where queries run against a varying aggregate structure.


[1] RavenDB is built in .Net with LINQ, uses JSON, Full-text search (Lucene) with two versions, viz. Server and Embedded.

[2] CAP (Brewster’s Theorem) Consistency, Availability, and Partition Tolerance. (Browne, 2009)

[3] Known colloquially (coined by Google engineers) as sharding, is partitioning a database over a cluster of servers, also known as horizontal partitioning.  However, standard industry terminology (losing the perceived “cool factor” to sound relevant and hip, speak about a partition or partitioning.  “Somehow, somewhere somebody decided that what they were doing was so cool that they had to make up a new term for what people have been doing for many many years. It is partitioning” (Schlossnagle, 2007).

[4] Streaming analytics taking advantage of time and time aggregations. (Craig, 2012).

Works Cited

Analyst, M. (2010, 02 15). Modern Analyst. Retrieved 02 15, 2010, from Modern Analyst: http://www.modernanalyst.com/

Ayushman Jain. (2013, March 27). Using Cassandra for Real-time Analytics: Part 1. Retrieved from Planet Cassandra: http://www.planetcassandra.org/blog/category/brewer’s%20theorem

Browne, J. (2009, January 11). Brewer’s CAP Theorem. Retrieved from JulianBrowne.com: http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

CAP Theorem. (2013, June 28). Retrieved from WikiPedia: http://en.wikipedia.org/wiki/CAP_theorem

Conery, R. (2010, May 19). NoSQL in the Wild. Retrieved from WekeRoad: http://blog.wekeroad.com/2010/05/19/no-sql-in-the-wild

Craig, T. (2012, December 12). AnalyticsPBI for Azure: Turning Real-Time Signals into Real-Time Analytics. Retrieved from Big Data Big Analytics: http://blog.patternbuilders.com/tag/nosql/

Eng, I. (2013, May 15). SQL versus NoSQL – which is better for your business data? Retrieved from Izenda Blog: http://blog.izenda.com/sql-versus-nosql-which-is-better-for-your-business-data/

Farrugia, C. (2012, November 05). MongoDB Performance Pitfalls – Behind The Scenes. Retrieved from Trackerbird: http://blog.trackerbird.com/content/mongodb-performance-pitfalls-behind-the-scenes/

Finley, K. (2010, July 13). NoSQL Database CouchDB Turns 1.0. Retrieved from ReadWrite: http://readwrite.com/2010/07/13/nosql-database-couchdb#awesm=~obbI8hQ9w7piJm

Fowler, M. (2013, July 05). Application Database. Retrieved from Martibn Fowler: http://martinfowler.com/bliki/ApplicationDatabase.html

Fowler, M. (2013, July 05). Integration Database. Retrieved from Marting Fowler: http://martinfowler.com/bliki/IntegrationDatabase.html

Fowleruk, A. (2013, January 11). NoSQL, huh, what is it good for?…. Retrieved from Adam’s Big Data Discoveries: http://adamfowlerml.wordpress.com/2013/01/11/nosql-huh-what-is-it-good-for/

Gilbert, S., & Lynch, N. (2002). Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-toleratn Web Services. Boston: http://lpd.epfl.ch/. Retrieved from Public Library Boulder IBM: http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf

Hariri, H. (2010, Novemeber 11). Is NoSQL for me? I’m just a small fish. Retrieved from Devlicio.us: http://devlicio.us/blogs/hadi_hariri/archive/2010/11/24/is-nosql-for-me-i-m-just-a-small-fish.aspx

Hoff, T. (2010, June 28). VoltDB Decapitates Six SQL Urban Myths And Delivers Internet Scale OLTP In The Process. Retrieved from High Scalability: http://highscalability.com/blog/2010/6/28/voltdb-decapitates-six-sql-urban-myths-and-delivers-internet.html

LIST OF NOSQL DATABASES . (2013, 06 04). Retrieved from NoSQL: http://nosql-database.org/

Martin Fowler, Pramod Sadalage. (2012). NoSQL Introduction. Boston: Thoughtworks.

Menegaz, G. (2012, October 01). What is NoSQL, and why do you need it? Retrieved from ZDNET: http://www.zdnet.com/what-is-nosql-and-why-do-you-need-it-7000004989/

Miner, S. (2012). Using NoSQL. Boston: Gordon College.

NewSQL. (2013, July 01). Retrieved from WikiPedia: http://en.wikipedia.org/wiki/NewSQL

NoSQL. (2013, July 3). Retrieved from WikiPedia: https://en.wikipedia.org/wiki/NoSQL

NoSQL. (2013, July 03). Retrieved from WikiPedia: https://en.wikipedia.org/wiki/NoSQL

Nurse, C. (2010, Janauary 02). Look Mom – NoSQL. Retrieved from Slideshare: http://es.slideshare.net/charlesnurse/look-mom-nosql

Peschka, J. (2010, Septemeber 16). Five Reasons to Use NoSQL. Retrieved from Facility9: http://facility9.com/2010/09/16/five-reasons-to-use-nosql/

Polyglot (computing). (2013, March 12). Retrieved from WikiPedia: https://en.wikipedia.org/wiki/Polyglot_(computing)

Polyglot Persistence. (2011, Novemeber 16). Retrieved from Martin Fowler: http://martinfowler.com/bliki/PolyglotPersistence.html

Pozzani, G. (2013). Introduction to NoSQL. Rome: http://profs.sci.univr.it/. Retrieved from http://profs.sci.univr.it/~pozzani/Materiale/nosql/01%20-%20introduction.pdf

Pramodkumar J. Sadalage, M. J. (2012). NoSQL Distilled. Boston: Addison Wesley Professional. Retrieved July 05, 2013, from http://books.google.co.za/books/about/NoSQL_Distilled.html?id=nG32ygAACAAJ&redir_esc=y

Pritchett, D. (2007, May 02). The Challenges of Latency. Retrieved from InfoQ: http://www.infoq.com/articles/pritchett-latency

Redis. (2013, July 01). Retrieved from WikiPedia: http://en.wikipedia.org/wiki/Redis

RNDr. Irena Holubova, P. (2012). NoSQL Datbases Classification. Boston: http://www.ksi.mff.cuni.cz/.

Saunders, N. (2010, August o5). The “NoSQL” approach: struggling to see the benefits. Retrieved from What You’re Doing Is Rather Desperate: http://nsaunders.wordpress.com/2010/08/05/the-nosql-approach-struggling-to-see-the-benefits/

Schlossnagle, T. (2007, September 08). Partitioning vs. Federation vs. Sharding. Retrieved from Esoteric Curio: http://lethargy.org/~jesus/writes/partitioning-vs.-federation-vs.-sharding

Selmer, P. (2012). NoSQL Stores and Data Analytics Tools. London: BirkbeckCollege: UCL Institute of Neurology. Retrieved from http://www.dcs.bbk.ac.uk/~ap/teaching/ADM2012/NOSQL.pdf

Wayner, P. (2012, July 07). 7 Hard Truths about the NoSQl Revolution. Retrieved from Infoworld: http://www.infoworld.com/d/data-management/7-hard-truths-about-the-nosql-revolution-197493?page=0,1

What is NoSQL. (2013, July 05). Retrieved from 10gen The MongoDB Company: http://www.10gen.com/nosql

What The Heck Are You Actually Using NoSQL For? (2012, December 06). Retrieved from High Scalability: http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html

Data/ Application Integration: Data Virtualization


Where an Application Enterprise landscape, viewed from a Data Integration perspective, is comprised of a classic point-to-point integration model, it “does not represent an effective application integration solution.” (Linthicum, 2004).  An alternative, referred to as a Centralised Architecture, which is known as Data Virtualisation, aims to supplant the point-to-point integration architecture, remedy the point-to-point woes, by providing a  single access point to data, for applications.

Business agility within an Information Technology enabled enterprise, is entwined with its expediter,  the Information Technology Department.  Where Information Technology can’t perform, business suffers.  Where Information Technology is non-optimal, business loses revenue.  The goal to meet is the total quality management concept of Edwards W. Deming, i.e. to reduce waste through the optimisation of production.

The reduction of waste is achieved by implementing and running lean systems.  These systems should use enterprise and industry standard practices and technologies to optimise the enterprise architecture.  For Data Integration, such concept is known as Data Virtualisation.  The expressed enterprise solution for Data Integration is the Data Delivery Platform, as described by author Rick L. van der Lans, of the book: “Data Virtualization for Business Intelligence Systems”.  The main goal of the Data Delivery Platform is to offer a single point of data access for all applications.  Being constrained to the single access concept, all services, technologies, modelling and profiling of data are controlled and availed through one edifice.  This consolidates the business logic and allows for many beneficial features to result, such as Data Quality, Master Data Management, Meta Data Management, and logical enterprise data consolidation, over and beyond the benefits of the single access point.  Implementing Data Virtualisation does not imply that all systems be moved away from the current integration architecture at once, but the organisation is able to do this over a time that allows for comfortable and sustainable change.  The organisation may take this opportunity to define its data landscape more thoroughly, by learning along with the change.

Through Data Virtualisation, it is envisaged that Application Integration is facilitated by its Data Integration, as a logical subset.  The Enterprise Architecture is optimised by reducing integration points, potentially consolidating infrastructure to cut that foot print, by saving the spend.  As the enterprise identifies business optimisation opportunities, at the hand of deductions forthcoming from the Data Virtualisation benefits, physical data consolidation may be had.  This implies that further optimisations may result in extra savings not only from an analysis impact perspective, but again to infrastructure.  This means that specific data characteristics can be more adequately matched to hardware and disk formats, providing more performance enhancements to reporting requirements.

The core performance benefit of Data Virtualisation lies in the technology.  It can be implemented by augmenting the implementation with an in-memory data store that serves as a data cache.  Along with the intrinsic ability of Data Virtualisation software, i.e. its own query engine by which to optimise queries over heterogeneous data sources, the in-memory structure may serve the query engine with reference data, cached transactional data, or query enhancing cached results to aid a specific report or query.

In summary, by not implementing a Data Virtualisation technology, a company’s systems are left to deal with the ever-increasing point-to-point integration complexity, as they grow, being difficult to assess change, even change, by not being agile enough to sustain change.  Information Technology is left to deal with protracted assessments when change is required, complex implementations of change, with the ever-present chance of interfering with the functionality of other, non-change-related systems.  Large point-to-point integrated systems also need a huge human resources contingent and infrastructure foot print, increasing expenditure and dependency on non-essential technologies.

This article looks at Data Virtualisation, more specifically the Data Delivery Platform, as an Enterprise Architecture.

An Overview of the Data Delivery Platform

An Enterprise Data Delivery Platform (DDP), is to be understood as defined by Rick L. van der Lans, in his book: “Data Virtualization for Business Intelligence Systems”, i.e. to facilitate a single access point to data, for real-time, ad-hoc, and Service Orientated (message based) architectures.

He defines the Data Delivery Platforms as follows: “The Data Delivery Platform is a business intelligence architecture that delivers data and metadata to data consumers in support of decision-making, reporting, and data retrieval, whereby data and metadata stores are decoupled from the data consumers through a metadata-driven layer to increase flexibility, and whereby data and metadata are presented in a subject-orientated, integrated, time-variant, and reproducible style.” (van der Lans R. L., 2012).

“A Data Delivery Platform must give data to a broad range of business applications, business intelligence tools, middleware such as an Enterprise Service Bus, or Business Process Management (BPM), using the access methods, formats and protocols required by these diverse consumers.” (Eve, Robert, 2010). 

We therefore should imbibe Data Virtualisation and extend the context to include real-time transactions and message based data entities, as inter-system message-formatted data, or SOA Enterprise Service Bus functions.  This architecture, call it the Data Delivery Platform,  attempts to make a total solution for data delivery in the enterprise, for decision support (data marts or enterprise data warehouse) and historical transactional records.

Data Virtualisation is the process of abstracting the data sources of disparate systems through a unified data access layer.  Access to data is so achieved by connecting to a single source, facilitating data abstraction, i.e. presenting only relevant data: “Rule 9: Logical Data Independence”, and encapsulation, i.e. access implementation hiding.  (Harrington, 2009).  A great advantage that Data Virtualisation provides over the hub and spoke message bus systems is that it is much more scalable for disaster, i.e. point failure situations.  Data Integration, viewed in this context, is a subset of Application Integration, i.e. to allow applications, contemporary, and legacy, to be able to communicate with each-other.  Data Integration is however, larger in scope than Application Integration, which also addresses the integration of data into structures such as data warehouses for the purpose of business intelligence.

In a current point-to-point Information Technology topology, the DDP (as a single source) replaces the “spaghetti” integration routes between applications and data stores.  To a degree, the business logic of the modelling of data via queries (for extraction into applications) as well as the construction of message based datagram to legacy, e.g. mainframe systems can be facilitated by the DDP.  In a point-to-point architecture, this logic is spread over application and database domains, with business logic vested in application code, database objects, and integration servers, e.g. BizTalk.  When using BizTalk, for SOA message exchanges, these are sent between systems  through the  integration BizTalk server that acts as an Enterprise Service Bus (ESB) service.  The ESB idea, therefore, taken up together with Data Virtualisation, collectively becomes a Data Delivery Platform for the Enterprise.

Data Virtualisation


Data Virtualisation, as concept, is commonly used with Data Integration, Business Intelligence, Service-orientated Architecture Data Services, Cloud Computing, Enterprise Search, and Master Data Management.  “It is both a Data Integration approach and a technology.  It is used as middle-ware that leverages high-performance software and advanced computing architecture to integrate and deliver data from multiple, disparate sources in a loosely couples, logically federated way.” (Eve, 2011).

Data Virtualization has its roots in Business Intelligence.  The idea comes from the mid 90’s, when it was known as “virtual data warehousing and marketed as an alternative to traditional data warehouses.” (Brunelli, 2013).  Data virtualisation is foundational to Data Integration; it enables fast and direct access to the critical data and reports, required and trusted by the business.  It is not to be confused with simple and traditional Data Federation.  Instead, it is as a superset, which must complement existing data architectures to support Business Intelligence Agility (BIA), Master Data Management (MDM) and a Service Orientated Architecture (SOA).  It handles the complexity involved, in integrating diverse and distributed data.  As Forrester Research, states in the report: “Data provided through the data services layer is updated, transformed, and/or cleansed when (or before) applications access it. . . Data virtualization solutions provide a virtualized data services layer that integrates data from heterogeneous data sources and content in real-time, near real-time, or batch as needed to support a wide range of applications and processes.” (Yuhanna & Gilpen, 2012).

Unfortunately, any reference to Business Intelligence conjures up images of costly and mostly failed data warehouse implementations.  To avoid any undue associations, it is important to make a distinct statement that this article refers to Business Intelligence as any form of reporting that is created from raw business or transactional data.  This paper aims to explain the concepts and benefits of Data Virtualisation, by describing it within the a current point-to-point architecture, and to the concept of a Data Delivery Platform.


A fundamental of computer science that is the root of data virtualization: “Applications should be independent of the complexity of accessing data.  They should not have to bother about where data is located, how it is accessed, or what its native format is. . . The broader context of data virtualization emphasises the abstraction and decoupling principle of E.F. Codd between all data sources, all consumers, all data delivery styles, and all data management functions.” (van der Lans R. L., 2012).

The main goal with Data Virtualisation is to achieve Business Agility. (Davis & Eve, 2011). According to The Data Warehouse Institute (TDWI) 34% of business users are dissatisfied with IT-delivery of Business Intelligence capabilities.  Van der Lans lists the TDWI’s five factors towards self-service BI as follows”:

  • Constantly changing business needs: 65%
  • IT’s inability to satisfy new requests on time: 57%
  • The need to be a more analytics-driven organization: 54%
  • Slow and untimely access to information: 47%
  • Business user dissatisfaction with IT-driven BI capabilities: 34%

According to the Aberdeen Group’s report: “Three Steps to Analytic Heaven (2011)”, 43% of respondents indicate that their “time window for decisions, [are] increasingly compressed.” (Aberdeen Group, 2011).  Agility in business, i.e. to enable the business through its information technology, is a major hurdle to overcome and having a common, enterprise data integration layer through Data Virtualisation, “can become the heart of most systems.” (White, 2011).  Achieving business agility, against the backdrop of a growing business volume of data sources and source data is the foremost pressure that demands a more efficient approach to data utilisation for business intelligence purposes.  The Aberdeen Group quotes this as a significant pressure to organisations where 49% of respondents cited this. (Aberdeen Group, 2011).

Data Virtualisation provides the ability to standardise, through layers of abstraction, into a unified logical representation to all consumers throughout the enterprise, as a semantically structured, and consistent view. (Loshin, 2010).  And, for this reason, it makes it possible to conduct real-time or near-real-time analytics because the Data Virtualisation layer provides access to data without having to create new copies of the information, eliminating the need to replicate it and introduce data flaws.  One of the differences between data federation and data virtualisation is the latter’s ability to write data back to source.  As opposed to some Enterprise Data Integrations technologies, making use of read-only optimisations, a data virtualisation layer provides write-backs.

“One of the less obvious advantages of database virtualization is a reduction in the read I/O issued against the underlying physical storage (Storage Area Networks) that ultimately stores the data for virtual databases.” (Hayward, 2013).  Also, the original data is not being moved, so any tinkering with data quality can be done in the Data Virtualisation layer, without messing up the source.  Such a layer also alleviates the architecture from the legacy mainframe concept of overnight batch processing, by mapping through this layer to original data artefacts.  The idea of Data Virtualisation, basing the thought on the definition of Big Data, is to bring querying data more into step with the characteristics of Internet search engines.  By optimising the search algorithms, the source data’s format becomes less relevant.  Therefore also the relevance of a relational database management system’s structure.  This structure, primarily required to aid a Structured Query Language (SQL) to extract specific data records, but now as authority passes to intelligent search engines; relieve the databases (data sources) from having any structure at all.  The Data Virtualisation technology becoming the single point of structure, and through abstraction, hides whatever the data structures are even with encapsulation, the implementation details of the objects.  The query-ability of the data repositories are consequently no longer vested in their structure, and the conformance of SQL to extract the data, but rests in the single point of query, i.e. the Data Virtualisation layer, making use of the data model and the search algorithms to get data from relational and non-relational stores, seamlessly.

Optimisation does not have to end with the Data Virtualisation layer, but can also spill over to the physical data stores.  Because Data Virtualisation also provides the ability to write-back that is to change the original data, optimisation should not stop within the logical data model of the Virtual Database.  Data may also be consolidated physically.  Consolidation being a boon forthcoming from the Data Virtualisation’s ability to give insight into enterprise data, teaching the business and IT where optimisations may be affected.

Data consolidation is a huge attempt on a physical data store level.  Data Virtualisation provides an advantage for analysing data, because of its consolidation ability, and consequently produces the ability in that for strategic motions and decisions.  Data cleansing and enrichment are achieved first through the Data Virtualisation layer in that one consistent version is made available to all consumers.  The underlying data could be semantically inconsistent, as a result of different business unit’s different interpretation of the same data business concepts, such as a customer.  Resolving such inconsistencies is a by-product of the Data Virtualisation layer.  The canonical result is a conformed representation derived from all the many attributes of the concept, out of all the data sources, grouped into semantically similar entities.

“Enterprise data quality can be viewed as a Gordian knot and data virtualization provides a fundamental approach to slice through this barrier to consistency and finesse the organizational challenges associated with data quality improvement.  Any data quality team faced with the challenges discussed in this paper would benefit from considering data virtualization as an approach to implementing data quality control and consistency.”  (Loshin, 2010).  But, “To get data virtualization right, it requires a deep and thorough understanding of very complex problems that exist in the data integration domain.” (van der Lans R. L., 2012).  Consider the architectural diagram: Enterprise Application Integration with Data Delivery Platform, depicting a physical data consolidation for transactional, historical, and summarised (market and enterprise) data, with data marts and enterprise data warehouse.

In summary, and in accordance with a broad view of Data Virtualisation, it is to combine heterogeneous sources of data, through a logical model, into a single point of access.  This layer has a universal view of the data, which lends to it capabilities to consolidate objective business views of data entities.  Mechanisms such as meta- and master data management allow for enterprise consolidation data practices to profile enterprise data to the extent that feasible changes are possible for the enterprise business taxonomy.  Where data is often exported to other systems, and source data problems in quality occur where a retro-fix is required, the re-introduction of the repaired data introduces new issues for the downstream systems, such as making changes independent of meta- and master data definitions.  Such problems are manifold for silo architectures and increase operational costs, because each time data is extracted and copied, it introduces a new opportunity for data flaws.

Business Drivers and Success Factors

The Enterprise Data Delivery Platform is a motion towards agile enterprise business analytics.  In order to achieve this, all enterprise data must be provisioned as information to the business from which to derive decisions, in the quickest and most cost-efficient way possible.

“The long [Enterprise Data Warehouse] EDW timescale is a concern for many organizations.  Businesses want to improve their decision-making today, not in two years’ time.  Often, too, individual decision-makers want to see their reports and dashboards immediately, without waiting for the EDW process to complete, and so interim solutions are created.  However, the underlying business intelligence requirements change frequently—66% of them at least monthly, according to Forrester.  The likelihood is therefore that by the time the EDW is complete, the business’s requirements will have moved on, and any analytics that have already been developed in early iterations will need to be reworked.  The traditional way to report rapidly on data within transaction systems is either to download it to Microsoft Excel (with all the inherent problems of uncontrolled data and calculation errors), or else to build the reports directly on the source applications.” (Capgemini, 2013).

The challenges in Information Management, in the provisioning of data for decision analytics, are twofold, first the frequent nature of requirements, and secondly the sluggish response by information technology to offer computerised solutions for these requirements in reasonable time.  This is the case, mainly because of the impact on data and application integration.  The problem, therefore, is vested in the interrelated integration points between applications and their data.  Data Virtualisation, in support of real-time and message based data exchanges, supplants the traditional point-to-point integration with a single data access façade, called the Data Delivery Platform.  This platform is the concept and technology to vet and prove for implementation, and to resolve those aspects that impede the business in achieving timeous technology solutions to changing business requirements.

On top of business decision-making agility, is also the cost based nature of an Information Technology department. The only way to optimise it, is to make sure that it runs lean.  This can only be achieved through optimisation of performance and reduction the of infrastructure.

Achieving the economics of scale, through implementing a Data Virtualisation optimisation, requires that it be justified, not only for the cost, but the business need and ability to meet what it is punted being capable of.

Business Drivers for Data Virtualisation in the Organisation

The business must get some benefit from Data Virtualisation to prove the cost and effort of the alternative.  A few drivers for data integration, through Data Virtualisation to be:

  • Pressure from business for quick, cost-effective reporting solutions
  • Need to build real-time reporting and analytics and self-service Business Intelligence
  • Overcome departmental politics to allow applications to share crucial information
  • Require an Enterprise Service Bus (ESB) to exchange data between applications
  • Integration and combination of multiple data types, e.g. external social and/ or unstructured data with internal BI
  • Multiple BI tools in use, each with their own presentation layer

Success Factors for a Virtual Database

Once the concept is implemented and the money spent, how is the efficiency of the Data Virtualisation measured?  A few factors that could signify success:

  • Data Quality: Very bad data quality can hamper the performance of the database by having to carry out complex rules
  • Complexity of source systems: requires the modelling of complex queries or make data more granular for extraction, both of which impacts the performance
  • Stability of underlying systems: the virtual database is only as good as the underlying databases
  • Metadata: shared metadata is essential for ensuring consistency
  • Real-time updates: the virtual database should not be employed to make real-time updates to source systems
  • Integration Scope: realistic integration must be scoped, such as where real-time and historically granular data are mashed for reporting solutions, using extensive business rules and transformation, can cause performance degradation.
  • Give yourself plenty of time to implement the technology–Pella
  • Reduce facility costs: Where physical data stores are consolidated, the hardware infrastructure costs are reduced, hence the total facility cost
  • Error Reduction: An integrated data interface minimises complexity and improves operational efficiency.  This directly contributes to reducing the chance of error

Benefits of Data Virtualisation

Money spent must procure benefit, and consequently an organisation must attain some benefits from the Data Virtualisation implementation, to the effect of:

  • Early and iterative business (analyst) involvement
  • Makes data available in a unified format, a single environment for data integration and data federation
  • The ability to create virtual views with no data movement – but also easily reuse for batch
  • Hides complexities from applications (abstraction) and provides seamless access
  • Reduces the data duplication phenomenon by retaining data at source
  • Provides a single version of the truth
  • Integrates real-time data into the decision (historic) data paradigm
  • Provides stepping stone to agile BI (self-service)
  • Provides a viable alternative to EDB (enterprise data bus) technologies such as TIBCO and JBossMQ where integration is primarily for reporting of read access
  • Can also provide SOA solution with applications accessing the virtual database through web style services
  • Can be deployed in a phased manner to gradually build the enterprise model
  • A pre-built library of rich ETL-like advanced data transformations
  • Carry out functions (on the fly):
    • Data quality
    • Transformation
    • Masking

Specific Benefits for Changing a Point-to-Point Architecture

The question that begs answering is what specifically would be helpful for a point-to-point organisation in implement Data Virtualisation, as a mechanism for Data Integration.  The most outstanding goal to be achieved, for any cost-cantered business service, is Operational Efficiency.  An Information Technology department is a business enabler that costs the business the amount of money used to provision the business services.  These are cost, in accordance to the spend, required to bring products and services to market.  Managing this spend, to make it as efficient as possible, requires that its operational processes be as lean as possible by reducing the waste as much as possible. (Deming, 1982).

If the current application integration landscape of the organisation’s enterprise architecture, with respect to data integration, is a heterogeneous, disparate, and point-to-point architecture, Data Virtualisation can offer significant improvements.  Enterprise application integration (EAI) is a term used for denoting the process of combining separate applications in to a cooperating federation of applications.  There are two logical architectures by which to achieve this, viz. point-to-point connections (the current architecture) and middleware based integration, e.g. a Data Delivery Platform.

Data Virtualisation, promoted to its matured last state as an Enterprise Data Delivery Platform, is a type of Centralised Application Integration Architecture.  This design, over the traditional point-to-point, offers the opportunity to optimise the information technology within the business architecture, not only on performance but more specifically on cost, by reducing complexity and infrastructure demands.

Point-to-Point Application Integration

Point-to-Point integration on the middle tier allows an application to link to another by a connecting pipe through which the communication is facilitated, and generally relies on message queues to transport messages.  It is limited in that it becomes tedious and ineffective to bind more than two applications, especially because it does not facilitate middle-tier processing, i.e. applications logic applied to information flowing through the pipe.  It certainly is possible to link more than two applications with traditional point-to-point middleware, but “doing so is generally not a good idea. . . an does not represent an effective application integration solution.” (Linthicum, 2004).

A point-to-point integration’s infrastructure proves brittle because applications are tightly coupled.  Changes to this architecture may break any function of applications involved.  Each integration point forms an entity to support, and integration points between applications are double the amount of applications being integrated.

Middle-ware-based Application Architecture

This middleware purposes to reduce the interdependence of integrated applications.  The integration points exist in a one-to-one ratio with applications.  This reduces the risk of errors and integration maintenance.  Application logic may be added to the middleware to facilitate complex operations on the travelling data, such as transformations, aggregations, routing, separating, and converting messages.  The only impediment is to set up the middleware and transplant the point-to-point applications to use it.  But, it is touted by Forrester Research that this may be a boon and not an impediment, for organisations can govern the transition, using their time to transform the organisation, thus reducing the change management costs.

Centralised Architecture

By virtue of the word centralised, a centralised architecture is located central to the enterprise architecture, allowing applications to interact with data sources via this logical concept.  Heterogeneous data sources, spanning physical relational, no-sql, files, and services are accessed without any knowledge of the implementation, or underlying data structures.  Security, from a user’s perspective of an application, is implemented in this centralised architecture, providing an overall security.  This becomes the single location for applications through which to interact with each-other and the enterprises data.

Such an architecture improves maintenance and management, and has business rules for routing and transformations, making the debugging of applications particular to one set.  A Data Delivery Platform’s aim is to achieve all the punted advantages of a centralised architecture, but specifically to:

  • Improve Data Quality: Meta- and Master- Data Services
  • Loosen inter-application dependencies
  • Split out transactional, historical, and reference data.  This to allow the physical hardware to be consolidated and specified for specific data characteristics, viz. transactional and decision support data
  • Provision volatile data cache for real-time processing, allowing for reference and historic transactional data to be cached for performance
  • Eliminate the intra-day to history batch processing to do transactional data roll-overs
  • Provision data for development and testing, through data masking
  • Consolidate Security
  • Optimise reporting queries comprising intra-day and historic transaction combinations, by utilising a common data cache and query engine
  • Provide an enterprise view of data, through the centralised architecture’s ability to model data by making use of abstraction, including various data sources
  • Provide aggregated (summarised) data store(s) for historical record keeping requirements
  • Establish a horizontally infinitely scalable architecture, to meet enterprise performance demands for reporting and data services
  • Provide a single-point enterprise application architecture for the organisation’s business systems

Data Integration Architecture

In order to achieve optimal and efficient enterprise application architecture, is to integrate applications with respect to their data.  The Enterprise Application Integration with Data Delivery Platform picture illustrates such an architecture.  Data Integration is achieved through a Centralised Architecture by creating a logical grouping of integration technologies called the Enterprise Data Delivery Platform that combines practices such as Data Virtualisation, Enterprise Service Bus, and Object Relational Mapping (ORM).  The latter being a programming technique in which a metadata descriptor is used to connect object code to a relational database, but also is a type of Data Virtualisation implementation. (van der Lans R. L., 2012).

The drive of this architecture is to merge data access, by providing a single point of access to all applications.  It also has consolidation in mind, especially for the physical data stores.  Transactional and transaction history stores are consolidated into a single store, being the historic repository, by injecting an intermediary in-memory intra-day transaction repository cache, as part of the Data Delivery Platform.  The historic transaction stores are trickle-fed (asynchronously) from the in-memory cache, being the primary receptacle of the real-time transactional feeds.  The Data Virtualisation technology facilitates the queries from applications, for data spanning intraday and historical records, by serving appropriately.

In addition to transactional (intra-day) and transaction history, a second class of data exists, suffice to term that Market Data, or data to sell.  This is summarised data, i.e. data in an aggregated form, information, as it is classified for Business Intelligence.  Physical data separation,between transactional and analytical date, achieves the goal of separating data characteristics so that the hardware and storage types may be differentiated in accordance to the requirements of such data.  For example, transactional data stores require different hardware to decision based data, in that the input-output characteristics determine that one is written in short bursts, while the other is read in long contiguous extents.  One cannot use parallelism on the processor, while the other can.  Having both on the same hardware is counter-productive.  The same applies to disk formats.

Reference data synchronisation is important in terms of the master to slave relationship of the data, between the master copy and other consuming systems, such as Complex Event Processing.  The Data Delivery Platform, being the single point of data integration, uses the cached (in-memory) data store  from where to despatch reference data updates to synchronise other systems.  The enterprise synchronisation ensures that the master copy of persisted reference data is up to date at the end and start of each business day.  The in-memory cache is refreshed at start of day, and synchronisation perpetuated throughout the enterprise architecture.

The Data Virtualisation technology is extended by the in-memory database, using it as a cache for all sorts of reference, intraday transactional, and query beneficial data.  In addition to caching and data access, the Data Delivery Platform must also facilitate functions of an Enterprise Service Bus, by catering for Service Orientated (SOA) message based information exchanges, such as done by a Message Bus.  In this architecture, Object Relational Mapping (ORM) is to be deemed a type of Data Virtualisation implementation.  Its function is to map a single application to a single data source.  This is also achievable via the Data Virtualisation layer, and it remains a moot point if the Data Virtualisation technology won’t be able to provide the Object Relational Mapping functions more economically.

Enterprise Application Integration with Data Delivery Platform

Enterprise Application Integration with Data Delivery Platform


Restrictions & Limitations

The term Business Intelligence (see Definitions below) is restricted to show the practices, technologies (applications, tools, infrastructure) by which raw transactional data is analysed to make it meaningful information that is delivered to the business (for business purposes), for strategic, and tactical insights, by means of a report or query, and other concepts, such as data mining and data warehousing analytics.

The scope of this article, in discussing the concept of a Data Delivery Platform, restricts itself to the meaning of Business Intelligence as inclusive of all concepts other than data mining and data warehousing.  The resultant meaning thus, in summary, for Business Intelligence to be the facilitation of raw (transactional) data, via tools and technologies, using best practices, as information accessible to the business on which to do analytics, by way of reports and queries, to make decisions in support of strategy, tactics, operations and business insights.


Business Intelligence

Various definitions exist, but suffice to quote about three authoritative sources:

  • Gartner: Business intelligence (BI) is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance. (Gartner, 2013).
  • Forrester:
  1. Business intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making[1].
  2. A set of methodologies, processes, architectures, and technologies that leverage the output of information management processes for analysis, reporting, performance management, and information delivery[2].
  • Olivia Parr Rud: Business intelligence (BI) is a set of theories, methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information for business purposes. BI can handle large amounts of information to help show and develop new opportunities. Making use of new opportunities and implementing an effective strategy can give a competitive market advantage and long-term stability. (Rud, 2009).
  • CIO: Business intelligence, or BI, is an umbrella term that refers to a variety of software applications used to analyze an organization’s raw data. BI as a discipline is made up of several related activities, including data mining, online analytical processing, querying and reporting. (Mulcahy, 2010).

[1] When using this definition, BI also has to include technologies such as data integration, data quality, data warehousing, master data management, text, and content analytics, and many others that the market sometimes lumps into the information management segment. (Forrester, 2010).

[2] We also refer to data preparation and data usage as two separate, but closely linked, segments of the BI architectural stack. (Forrester, 2010).


Encapsulation, is a form of information hiding, and consists of separating the external aspects of an object, which is accessible to other objects, from the internal implementation details of the object, again hidden from other objects.


In abstraction, non-relevant tables, columns, and rows are hidden and is synonymous with the Codd concept of logical data independence.  This is typically achieved in SQL databases, via views.

Data Virtualization

Webopedia: Data virtualization is a methodology that takes a layered approach to dealing with huge volumes of data from disparate sources.  The phrase virtualization is used because data virtualization is the aggregation of the data from disparate sources, including databases, systems and storage, to create a single virtual view from within a front-end systems; such as applications or dashboards.  Data virtualization is commonly associated with  business or enterprise applications including sales force automation, customer relationship management, enterprise resource planning, and business intelligence. (Webopedia, 2013).

Rick L. van der Lans: Data virtualization is the technology that offers data consumers a unified, abstracted, and encapsulated view for querying and manipulating data stored in a heterogeneous set of data stores. (van der Lans R. L., 2012).

Techopedia: Data virtualization is the process of aggregating data from different sources of information to develop a single, logical and virtual view of information so that it can be accessed by front-end solutions such as applications, dashboards and portals without having to know the data’s exact storage location. (Janssen, 2013).

Virtual Database

A virtual database is also called a federated database, and is a mechanism to query several databases as if they were a single entity.  It is also called sharding.  Federated databases are a subset of Data Virtualisation.

Application Integration

Java World (Kang, 2002): Application Enterprise Integration combines separate applications into a cooperating federation of applications.  Tow logical integration architectures for integrating applications exist: Direct point-to-point and middleware based integration.

PC Mag Encyclopaedia (PC Mag, 2013):

  • Translating data and commands from the format of one application into the format of another. It is essentially data and command conversion on an ongoing basis between two or more incompatible systems.Implementing application integration has traditionally been done by tedious programming, or occasionally one package might support the interfaces of one or two other packages. However, the trend today is to use message brokers, applications servers and other specialized integration products that offer a common connecting point. Since the advent of the Web, these pre-packaged “middleware” solutions have become widely used to Web enable the enterprise.
  • Redesigning disparate information systems into one system that uses a common set of data structures and rules.

Investopedia (Definition of ‘Enterprise Application Integration’, 2013): The translation of data and other commands from one application format into another. Enterprise application integration is an ongoing process between two incompatible systems. This can allow for differing financial applications to interface effectively and process data or transactions.

SearchSOA (EAI (enterprise application integration), 2013): EAI (enterprise application integration) is a business computing term for the plans, methods, and tools for modernizing, consolidating, and coordinating the computer applications in an enterprise.

Typically, an enterprise has existing legacy applications and databases and wants to continue to use them while adding or migrating to a new set of applications that exploit the Internet, e-commerce, extranet, and other new technologies.

EAI may involve developing a new total view of an enterprise’s business and its applications, seeing how existing applications fit into the new view, and then devising ways to efficiently reuse what already exists while adding new applications and data.

Enterprise Service Bus

SearchSOA (enterprise service bus (ESB), 2013): An enterprise service bus (ESB) is a software architecture for middleware that provides fundamental services for more complex architectures. For example, an ESB incorporates the features required to add a service-oriented architecture (SOA). In a general sense, an ESB can be thought of as a mechanism that manages access to applications and services (especially legacy versions) to present a single, simple, and consistent interface to end-users via Web- or forms-based client-side front ends.

Techopedia (Enterprise Service Bus (ESB), 2013): An enterprise service bus (ESB) is an integrated platform that provides fundamental interaction and communication services for complex software applications via an event-driven and standards-based messaging engine, or bus, built with middleware infrastructure product technologies. The ESB platform is geared toward isolating the link between a service and transport channel and is used to fulfil service-oriented architecture (SOA) requirements.

Data Integration

Gartner (Gartner, 2013): The discipline of data integration comprises the practices, architectural techniques and tools for achieving the consistent access and delivery of data across the spectrum of data subject areas and data structure types in the enterprise to meet the data consumption requirements of all applications and business processes.

Techopedia (Data Integration, 2013); Data integration is a process in which heterogeneous data is retrieved and combined as an incorporated form and structure. Data integration allows different data types (such as data sets, documents and tables) to be merged by users, organizations and applications, for use as personal or business processes and/or functions.

Object Relational Mapping (ORM)

Digplanet (Object-relational mapping, 2013): Object-relational mapping (ORM, O/RM, and O/R mapping) in computer software is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a “virtual object database” that can be used from within the programming language. There are both free and commercial packages available that do object-relational mapping, although some programmers opt to create their own ORM tools.

Techopedia (Object-Relational Mapping (ORM), 2013): Object-relational mapping (ORM) is a programming technique in which a metadata descriptor is used to connect object code to a relational database. Object code is written in object-oriented programming (OOP) languages such as Java or C#. ORM converts data between type systems that are unable to coexist within relational databases and OOP languages.

Douglas K Barry (Barry, Douglas K.;, 2013): Object-relational mapping (OR mapping) products integrate object programming language capabilities with relational databases managed by Oracle, DB2, Sybase, and other RDBMSs. Database objects appear as programming language objects in one or more existing object programming languages. Often, the interface for object-relational mapping products is the same as the interface for object databases.

Works Cited

Data Integration. (2013). Retrieved April 22, 2013, from Techopedia: http://www.techopedia.com/definition/28290/data-integration

Definition of ‘Enterprise Application Integration’. (2013). Retrieved April 22, 2013, from Investopedia: http://www.investopedia.com/terms/e/enterprise-application-integration.asp

EAI (enterprise application integration). (2013). Retrieved April 22, 2013, from searchsoa.techtarget.com: http://searchsoa.techtarget.com/definition/EAI

enterprise service bus (ESB). (2013). Retrieved April 22, 2013, from Search SOA: http://searchsoa.techtarget.com/definition/enterprise-service-bus

Enterprise Service Bus (ESB). (2013). Retrieved April 22, 2013, from Techopedia: http://www.techopedia.com/definition/5229/enterprise-service-bus-esb

Object-relational mapping. (2013). Retrieved April 22, 2013, from Digplanet: http://www.digplanet.com/wiki/Object-relational_mapping

Object-Relational Mapping (ORM). (2013). Retrieved April 22, 2013, from Techopedia: http://www.techopedia.com/definition/24200/object-relational-mapping–orm

PC Mag. (2013). Retrieved April 22, 2013, from Definition of:application integration: http://www.pcmag.com/encyclopedia/term/37910/application-integration

W. Edwards Deming. (2013). Retrieved April 22, 2013, from WikiPedia: http://en.wikipedia.org/wiki/W._Edwards_Deming

Aberdeen Group. (2011, March). Three Steps to Analytic Heaven. Aberdeen Group, 28.

Analyst, M. (2010, 02 15). Modern Analyst. Retrieved 02 15, 2010, from Modern Analyst: http://www.modernanalyst.com/

Barry, Douglas K.;. (2013). Object-relational mapping (OR mapping) definition. Retrieved April 22, 2013, from Service Architecture: http://www.service-architecture.com/object-relational-mapping/articles/object-relational_mapping_or_mapping_definition.html

Brunelli, M. (2013, February). In an information downpour, data virtualization products bloom. Retrieved April 09, 2013, from Search Data Management: http://searchdatamanagement.techtarget.com/feature/In-an-information-downpour-data-virtualization-products-bloom

Capgemini. (2013). Data Virtualization. Retrieved April 04, 2013, from Capgemini.com: http://www.capgemini.com/sites/default/files/resource/pdf/data_virtualization._how_to_get_your_business_intelligence_answers_today.pdf

Davis, J. R., & Eve, R. (2011). Data Virtualization: Going Beyond Traditional Data Integration to Achieve Business Agility (1st ed.). Ashland, OR, USA: Composite Software, 2011.

Deming, E. W. (1982). Out of Crisis. Cambridge, MA, U.S.A: Massachusetts Institute of Technology.

Eve, B. (2011). Achieving Business Agility with Data Virtualization. Compostire Software, 2.

Eve, Robert. (2010, April 13). Data Services Platforms – Bringing Order to Chaos. (Composite Software) Retrieved April 15, 2013, from Ebiz.net: http://www.ebizq.net/topics/bi/features/12478.html?page=3

Forrester. (2010). Enterprise Business Intelligence Platforms. Forrester Wave.

Gartner. (2013, April 15). Business Intelligence (BI). Retrieved April 15, 2013, from Gartner: http://www.gartner.com/it-glossary/business-intelligence-bi/

Gartner. (2013). Data Integration. Retrieved April 22, 2013, from Garner IT Glossary: http://www.gartner.com/it-glossary/data-integration-tools/

Harrington, J. L. (2009). Relational Database Design and Implementation: Clearly Explained (3rd ed.). Burlington, MA, USA: Morgan Kaufmann.

Hayward, M. (2013, February 15). Alleviate load on SANs with Data Virtualization. Retrieved April 15, 2013, from http://dboptimizer.com/2013/02/15/alleviate-load-on-sans-with-data-virtualization/

Janssen, C. (2013). Te. Retrieved April 19, 2013, from Techopedia: http://www.techopedia.com/definition/1007/data-virtualization

Kang, A. (2002, August 09). Enterprise application integration using J2EE. Retrieved April 22, 2013, from Java World: http://www.javaworld.com/javaworld/jw-08-2002/jw-0809-eai.html?page=1

Linthicum, D. (2004). Next generation application integration: from simple information to Web services (2nd ed.). Boston, MA, U.S.A: Addison-Wesley Professional.

Loshin, D. (2010, June). Effecting Data Quality Improvement. Knowledge Integrity Incorporated, 11.

Mulcahy, R. (2010). Business Intelligence Definition and Solutions. Retrieved April 15, 2013, from CIO: http://www.cio.com/article/40296/Business_Intelligence_Definition_and_Solutions

Rud, O. P. (2009). Business Intelligence Success Factors. Hoboken, NJ, USA: John Wiley & Sons, Inc.

van der Lans, R. L. (2012). Data Virtualization for Business Intelligence Systems (1st ed.). (A. Dierna, & R. Day, Eds.) Waltham, MA, USA: Morgan Kaufmann Publishers as an import of Elsevier.

van der Lans, R. L. (2012). Data Virtualization for Business Intelligence Systems (1st ed.). (A. Dierna, & R. Day, Eds.) Waltham, MA, USA: Morgan Kaufmann Publishers as an import of Elsevier.

Webopedia. (2013). Data virtualization. Retrieved 2013, from Webopedia: http://www.webopedia.com/TERM/D/data_virtualization.html

White, D. (2011, April 01). Agile BI: Three Steps to Analytical Heaven. Retrieved April 11, 2013, from Mxisoft.com: http://www.mxisoft.com/Portals/53068/docs/3%20Steps%20to%20Analytic%20Heaven.pdf

Yuhanna, N., & Gilpen, M. (2012, Jannuary 05). The Forrester Wave: Data Virtualization Q1 2012. Retrieved April 11, 2013, from Informatica.com: 1888_forrester-wave-data-virtualization_ar.pdf

SQL Server Data via Enterprise Library 5.0


This is a very simple Enterprise Library 5.0 example.

Getting started with reading data from SQL Server, via Enterprise Library 5.0, requires that the framework be installed on the developing machine and added to the Project References, and  ‘using’ references, of the development environment, e.g. Visual Studio 2012.

The software is available from MSDN [1]: Enterprise Library 5.0 — May 2011 (Build 5.0.414.0), but it is recommended that you use NuGet [5] to put the Enterprise Library onto the solution in Visual Studio.  This way, the latest version of the Enterprise Library section is downloaded and installed, e.g. 5.0.505.0.  I am not sure why the MSDN link points to Enterprise Library (for a stand-alone Msi download) an older version(5.0.414.0) and NuGet has the latest, i.e. 5.0.505.0.

NOTE: Using NuGet to get the Enterprise Library in the: Tools >> Library Package Manager >> Manage NuGet Packages for Solution >> Online >> All, then type ‘entlib’ into the Search Online (Ctrl+E) box, and hit enter button.

Install the following packages from NuGet:

  • Microsoft.Practices.EnterpirseLibrary.Common
  • Microsoft.Practices.EnterpirseLibrary.Data
  • Microsoft.Practices.EnterpirseLibrary.ServiceLocation
  • Microsoft.Practices.EnterpirseLibrary.Unity
  • Microsoft.Practices.EnterpirseLibrary.Unity.Configuration
  • Microsoft.Practices.EnterpirseLibrary.Unity.Inception
  • Microsoft.Practices.EnterpirseLibrary.Unity.Inception.Configuration

In Visual Studio, using the menu path: Tools >> Library Package Manager >> Package Visualizer, you can get a listing of all the packages installed (see image below) for this solution.

Package Manager Visualiser

Package Manager Visualiser

Using Enterprise Library leverages functionality from a collection of reusable application blocks to assist developers with common development objects.   Amongst many, the common block are Data Access, Exception Handling, Logging, Security and Validation.  To get to the NuGet packages (online), as well as see which ones have been installed, in addition to the above method, use the Visual Studio menu path: Tools >> Manage NuGet Packages for Solution… The image (below) lists all packages (you can type entlib into the search Online field) with a green-tick indication of those you have already installed.

NuGet Package Manager for Solution

NuGet Package Manager for Solution

This article focuses on the Data Access Application Block.


Rewriting common code, for every specialised implementation of application code, is not very optimal because it is error prone, because it does not lend itself to reproducing exactly the same implementation.  This creates a maintenance dependency, as errors may vary in the different code blocks that essentially do the same thing.  A standardised layer of optimised code, substituting common boiler-plate code plumbing, should be used.  This layer is the Microsoft Enterprise Library.

This article looks at standardising database access through the enterprise library.


Enterprise Library Installation

Download and install Microsoft Enterprise Library, or use NuGet (recommended).  Create a Visual Studio 2012 console application project, to test a data retrieval.  Add references of the Enterprise Library to the project, or if you have used NuGet, these are added automatically to the solution.

NOTE: NuGet adds a new config file, i.e. packages.config, to track installed packages and versions, but, this file is not part of the project and should not go into source control.

Configuration & Code

Once a reference has been added to the project, have a look at the Codeplex [2] examples for download.  From this example: [EntLib Dev Guide RC – Code Samples.zip], look in the DataAccess > DataAcess folder for the App.config file.

Two sections are important, viz. configSections, and connectionString.  These have to be added, with the former being configured to point at the installed dataConfiguration, i.e. Enterprise Library.  The latter is the connection string to the actual database.  See the next section on how to use the Enterprise Library Config. tool, to create/ change the app.config file.  This is a whole lot less error prone than rolling your own.

Code (from the Microsoft example):

        <section name="dataConfiguration" type="Microsoft.Practices.EnterpriseLibrary.Data.Configuration.DatabaseSettings, Microsoft.Practices.EnterpriseLibrary.Data, Version=5.0.505.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35" requirePermission="true" />
<dataConfiguration defaultDatabase="TradeInformation" />
    <add name="TradeInformation" connectionString="Database=TradeInformation;Server=TradeDatabasseServer;Integrated Security=SSPI"
        providerName="System.Data.SqlClient" />
<startup><supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.0"/></startup></configuration>

NOTE: Instead of editing the configuration file manually, it can also be done via a tool that provides a user-screen.  Simpy right-click on the configuration file in the project (the one referring to Enterprise Library) and select Edit Configuration File, from the menu list.  For more detail, see the MSDN example [3]: Using the Configuration Tools.  If this option does not show up (right-click) make sure you have the version that targets Visual Studio 2012, in stalled: http://visualstudiogallery.msdn.microsoft.com/029292f0-6e66-424f-8381-3454c8222f9a [4], restart Visual Studio.

EnterpriseLibrary.Config Right-Click

EnterpriseLibrary.Config Right-Click

When using the Enterprise Library.Config (right-click the app.config) application to configure the data access configuration.  The user-interface requires that a Block (Add Data Settings) be added.  In the screen-grab (below) the app.config file can be seen, with the EnterpriseLibrary.Config user-interface, used to create it with below.  Doing it this way eliminates human configuration errors.

Enterprise Library.config

Enterprise Library.config

The defaultDatabase setting is a default connection, chosen by the GetInstance() method, of the connection listed first, below, when no database key is specified for the method.   The second example illustrates a connection, made by specifying the database connection key/ name.

_db = EnterpriseLibraryContainer.Current.GetInstance();
_db = EnterpriseLibraryContainer.Current.GetInstance("TradeInformation");

NOTE: If you get the error “The OutputPath property is not set for this project” when building the solution, make sure that the Configuration Manager (Debug, Release), as well as the Properties of every project in the solution are targeting the same CPU, e.g. x64.  Also refer to [http://gabrielmagana.com/2010/04/solution-to-the-outputpath-property-is-not-set-for-this-project/] and [http://connect.microsoft.com/VisualStudio/feedback/details/518181/msbuild-gets-confused-with-build-configuration-and-starts-generating-output-path-errors], if this is not the problem.

Code Example

Using the Codeplex example, I changed it a bit to make a simple Enterprise Library database call to defaultDB over which an ExecuteReader loops to print each record (in this case: trade) out to the console window.

I had a few issues with the .dll file references of Enterprise Library files.  The short of it is that when these come into the project via NuGet, they are copied (the packages) into the  c:users<your profile>DocumentsVisual Studio 2012Projects<project name>packagesEnterpriseLibrary.Common.5.0.505.0libNET35<.dll name>.   A copy also resides in the bin folder:

  • Microsoft.Practices.EnterpriseLibrary.Common.dll
  • Microsoft.Practices.EnterpriseLibrary.Data.dll
  • Microsoft.Practices.Unity.Configuration.dll
  • Microsoft.Practices.Unity.dll
  • Microsoft.Practices.Unity.Interception.Configuration.dll
  • Microsoft.Practices.Unity.Interception.dll

The project references point to the “packages” folder (build), but the application references the bin folder, at run-time.  Make sure that you have these references pointing to the correct .dll files for the Enterprise Library.  You could find yourself with endless problems when the defaultDB = EnterpriseLibraryContainer.Current.GetInstance(); attempts to get the database instance, using the Enterprise Library. It will either complain about not being able to load the correct .dll files, or that the database key (referring to the database name of either your named or default database in the app.config file) could not be loaded.

Code (Program.cs):


using System;
using System.Data;
using System.Data.Common;
using System.Data.SqlClient;
using System.ComponentModel;
using Microsoft.Practices.EnterpriseLibrary.Common.Configuration;
using Microsoft.Practices.EnterpriseLibrary.Data;

namespace za.co.acme.DataAccess
    class Program
        static Database defaultDB = null;
        static void Main(string[] args)

            defaultDB = EnterpriseLibraryContainer.Current.GetInstance<Database>();
        private static void ReadTradesToConsole()
            String tSQL = "SELECT TradeID, ReceivedSequenceNumber, TradeDateTime, TradePrice, TradeSize FROM [dbo].[TradeInformation]";
            using (IDataReader reader = defaultDB.ExecuteReader(CommandType.Text, tSQL))
                while (reader.Read())
                    for (int i = 0; i < reader.FieldCount; i++)
                        Console.WriteLine("{0} = {1}", reader.GetName(i), reader[i].ToString());

The Codeplex exmaple provides a list of different usage patterns for the data, e.g. reading data from a stored procedure (with named/ unamed parameters), reading data as objects/ XML/ Scalar value, even doing all of this asynchronously. The asynchronous methods check the Enterprise Library’s public virtual bool SupportsAsync { get; } method, to see if the operation is possible, before attempting it.


No. Author Web Site Name URL Date Authored Date Cited
 1.  Microsoft  MSDN  http://msdn.microsoft.com/en-us/library/ff632023.aspx  May 2011  28 January 2013
2.  Microsoft  CodePlex  http://entlib.codeplex.com/  14 December 2012  28 January 2013
3.  Microsoft  MSDN: Using the Configuration Tools  http://msdn.microsoft.com/en-us/library/ff664633(v=pandp.50).aspx  na  28 January 2013
4.  Microsoft  MSDN: Visual Studio Geleray – EnterpriseLibrary.Config http://visualstudiogallery.msdn.microsoft.com/029292f0-6e66-424f-8381-3454c8222f9a  na  14 March 2013
5.  NuGet Docs  NuGet Frequently Asked Questions  http://docs.nuget.org/docs/start-here/nuget-faq  na  15 March 2013

Last Updated: 19 March 2013 09h19

Complex Event Processing (CEP): Microsoft StreamInsight – The Basics


Complex Event Processing (CEP) is an in-memory method for analysing continuous stream(s) of data that could come from multiple sources, for events or patterns, based on a standing declarative query, in order to derive some conclusion, with near-zero latency, and high event volumes.

The main purpose is to detect meaningful events from the incoming data, such as market opportunities, or business threats, and respond to these as fast as possible.   Input streams could be very diverse, for example social media, stock market feeds, traffic reports, or orders and sales.  A CEP must be able to detect defined patterns in the data stream(s).  These patterns are used in Operational Intelligence, providing real-time insight into the designated business paradigm catered for by the CEP query.  Two broad applications, viz. computation and detection functions are performed by most CEP engines.  The former is calculative, such as stock trade statistics, and the latter for detecting patterns, in a surveillance type function, providing real-time (per second) business intelligence and decision support abilities.

Well known technologies in this market are: StreamBase, Sybase ESP, WebSphere Business Events, Esper, Informatica RulePoint, Progress Apama, Microsoft StreamInsight, and Tibco.

This article deals with Microsoft StreamInsight, within a financial services company, creating stock market feed trade statistics.


A real-time requirement exists for calculating stock market feed trade statistics, to produce trade volume, trade value, and the number of trades, per stock.  The results are published on a stock-ticker dashboard.


Use Complex Event Processing (CEP) to parse the stock market feed against a declarative computational query, to produce the required statistics from the streaming feed [1].  For this solution, Microsoft StreamInsight is the CEP engine employed to calculate and correlate in-flight events, on the stock market feed.

StreamInsight is packaged together with SQL Server 2012, and a user has to comply with the relevant SQL Server licensing to use it.  It can also be downloaded separately from the Download Center [2].

Jump-start information, that is to understand what the product is, getting started, planning the architecture, developer’s guide, glossary, and class library, is available on MSDN [3].  From MSDN blogs for StreamInsight (The StreamInsight Team), a few valuable resources are obtainable, viz.  A Hitchhiker’s Guide to StreamInsight 2.1 Queries, with its accompanying Visual Studio solution, as well as StreamInsight 2.1 samples (CodePlex) together with the queries discussed, as part of the LinqPad samples.  The StreamInsight Team, on SQLblog.com (Roman Schindlauer) has also created a resource listing forStreamInsight 2.1 [5].

Installing StreamInsight

If you have the SQL Server 2012 installation DVD (en_sql_server_2012_developer_edition_with_sp1_x86_x64_dvd_1173438.iso), it may contain an older version of StreamInsight, e.g. StreamInsight 2.0.

For the latest version, it’s best to download it from the Download Center [2].  Obtain the below cited two files, corresponding to the correct chip-set of your machine.

  • StreamInsight.msi
  • StreamInsightClient.msi

The Installation page on MSDN [7], lists all the system requirements and prerequisites, viz.

  1. Administrative rights on machine
  2. Installed .Net framework 4
  3. SQL Server 2012 product key

The installation dialogue requests an instance name, for this installation.  NOTE: There could be multiple installations, running side-by-side, and the instance name is the differentiation.  Call it something descriptive, for example (in this case) STOCKSTATISTICS.

The licensing of StreamInsight, prompted at installation by choosing between Evaluation and Product Key, where product key is the obvious choice, and the company has some volume license agreement with Microsoft for SQL Server 2012, the product key might not be available from the MSDN subscription, to enter in the Enter the product key: field for the StreamInsight installation.  Don’t lament, it can be obtained from the installation DVD, by navigating to the file: <Drive>:x64DefaultSetup.ini, and reading the PID=”<Product Key>”, as indicated in the picture below.

SQL Server 2012 Product Key from Installation DVD

SQL Server 2012 Product Key from Installation DVD

There are a few security (user) permissions (group StreamInsightUsers$MyInstance), for the StreamInsight Windows service, under whose context execution takes place to use things like the debugger.

NOTE: The SQL Server Product Key, corresponding to the edition that you are licensed for, determines the StreamInsight edition that is enabled.

The second file, of the installation, is StreamInsightClient.msi, and is used by client applications to connect to a StreamInsight service that resides on another machine.  It installs client-side libraries and does not require a product key.

StreamInsight Architecture

The StreamInsight server is the run-time component.  It comprises event sources (data generators), a standing query and sinks (data consumer) to process results.  Interfaces to the event generators, e.g. stock ticker, database, or web-dashboard, are created to display results, as the data is continuously streamed into the StreamInsight platform [8].

Data sequences are different ways in which data is consumed, viz. pull (IEnumerable), push (IObservable) and temporal (IStreamable).

The temporal stream is a potentially endless stream over a time interval, with a CTI (current time increment: time sequence) time stamp.  Events, i.e. the most basic unit of processable data, comprising a header (metadata) and payload (.Net data structure) holding the data associated to the header, are passed through the CEP.  The CIT allows processing out of sequence of arrival, in relation to the application timestamps.  This type implements IQueryable, IObservable, and IStreamable over source types, and a user can switch across the source types by using bridge methods [11].  StreamInsight implements its own flavour of LINQ, which is used to express queries over the CepStream type.  There are a few LINQ providers:

  • LINQ to Objects – IEnumerable is implemented to pull an ordered list of objects of the same type data-sequence for access, or IStreamable on a temporal stream of events with unique characteristics
  • LINQ to SQL
  • Reactive Expressions (Rx) for .Net – IObservable is implemented for an ordered list of objects of the same type and pushed to consumers to be accessed with Rx.  IObservable is the corner-stone of Rx, which is declarative code to orchestrate and coordinate asynchronous and event-based programs [12].

The temporal stream data-sequence is associated with something like a stock-ticker stream that provides stock prices.  Because the temporal stream data-sequence implements IObservable, an event-based interface, it enables LINQ to Events for composing event streams.  The IQueryable interface (using LINQ to SQL) enables a filtered database query, to return data relative to the source stream, for processing.  This is advantageous where stock prices change, to generate events for high, or low price fulctuations.

When envisaging the solution, in this case creating trade statistics from a stock ticker, the result and how to get it, are relevant in considering what to implement.  The four statistics, viz. Volume, Value and Number of Trades, are all calculative in nature, so values are aggregated and deals are counted.  The events comprise a fixed payload (.Net structure) of known types and fields, meaning that a typed adapter can be defined.  However, switching between event models, is facilitated through bridging methods, by adding “To” in front of the  respective method, e.g. ToObservable().

StreamInsight supports three event models, viz. interval, point and edge, with interval being the most generic.  Interval, by virtue of the name, is time bound, requiring a start and end time in the metadata of the event.  The point event model represents a single point in time, or the start time of the event.  The end time is added by StreamInsight as a tick.  Both these event models can be used for a stock ticker, depending on the nature of the event to react to.  For example, if a bid price is valid over a duration, the Interval event model can be used, whereas if stock statistics, such as Volume, is calculated, the Point model can be used.

StreamInsight can be deployed as an embedded (into the solution) or separate server (shared metadata across applications) installation.

The last consideration is resiliency, i.e. recovery from outages in as little time as possible [9], checkpoints .

Unique Details of this Solution



The solution comprises a feed, a calculation, and a published result.  It simply consumes a stream of message, from a WCF service, which in turn queries a database that is populated by an exchange trading system.  These messages, sent in standard XML and conformed to the Financial Information Exchange (FIX) protocol [13], are received by StreamInsight, and converted into the relevant trade statstics, viz. Volume, Value and the Number of Trades.

The results are published to a stock-ticker, per stock code, indicating the running values of the stock statistics, in real time.

The solution must connect to the WCF service, and stream all trades, as they come in from the trading engine, down to the standing query, and converted into the statistics.

The challenge is to decide on the implementation architecture of StreamInsight.  For a simple start, a few contributors, notably Microsoft: The StreamInsight Team, and JAhlen have provided many examples on CodePlex, in addition to the Hitchhiker’s Guide to StreamInsight and MSDN.

This solution uses the HelloWorld approach, as a first iteration.

Trade Statistics Architecture

Trade Statistics Architecture

The Visual Studio 2012 Project

StreamInsight development encapsulates the following actions:

  • Create a server instance, based on the StreamInsight installed instance name
  • Create the application
  • Define source
  • Create a query for the stream
  • Define a Sink
  • Bind query to Sink

The development uses the respective libraries, for complex event processing, to implement the functionality required, viz.

  • System.ServiceModel
  • Microsoft.ComplexEventProcessing
  • Microsoft.ComplexEventProcessing.Linq
  • Microsoft.ComplexEventProcessing.ManagementServices
  • System.Reactive
  • System.Reactive.Linq

NOTE: To help with the development of LINQ against StreamInsight, the application LinqPad should be used.  The current version (18 January 2013) is: 4.43.06 and is a free download from http://www.linqpad.net/ [15].  The StreamInsight team has a kick-start article (SQLBlog.com [14]) describing the installation of the StreamInsight driver for LinqPad.

Server & Application

The StreamInsight server exposes an endpoint [10] for client programs to connect and use the server, using an http (WCF) reference.  The server is either created in the same process (in-process, or embedded) of the application, or can be created on a remote machine (separate process), to work on the application there.  A remote server could be the installed StreamInsight Host Windows Service, configured at installation, or a new service host can be created in the code, and exposed as an end-point by the application.


This is the input stream, implemented by using an input adapter factory, to create a CEPStream object.  This is an event source that support IObservable, IEnumerable, IQueryable, and IStreamable interfaces.  IEnumerable is a push, and IObservable (.Net 4) a pull, data function.  Classes that implement these interfaces are created and bound to the query.

Stream Query

The query is written using Linq on the Source, which is a filter operation that drops every event that is not matched.  Use LinqPad by which to create the query.

Sink & Binding the Query

This is an output, to be bound to the query and writes the results to a delivery platform, such as the console window.  The Sink gets deployed to the Server with an allocated Name.  The Query has to be bound to the Sink and executed.  The observable query is bound to the observer output Sink and run in a process of the server.

Solution Components

Solution Components

The Solution Components comprise the WCF stock trade feed, which is converted from IEnumerable to a CEP stream (IQStreamable) point event.  An Observable is defined on the application, for the type of source, i.e. the Trade entity class (FIX compliant Trade Capture Report), per time span, using the bloiler-plate code of the HelloWorld Codeplex example.

The code below defines the observable, calling LiveTradeCaptureReports() that creates the IObservable<T> from the source, i.e. the input data from the WCF service.


        static IQStreamable<trade.TradeCaptureReportEnrichment> CreateStream(Application application)
            uint cti = 1;

            DateTime startTime = DateTime.Parse("2013-01-03");
            var ats = new AdvanceTimeSettings(new AdvanceTimeGenerationSettings((uint)cti, TimeSpan.FromTicks(-1)), null, AdvanceTimePolicy.Drop);

            return application.DefineObservable(() => LiveTradeCaptureReports()).ToPointStreamable(
                r => PointEvent<trade.TradeCaptureReportEnrichment>.CreateInsert(DateTime.UtcNow, r), ats).Deploy("wcfTCRSource");


The LiveTradeCaptureReports() method, with the ToObservableInterval method, are listed below, and are called by the CreateStream method that produces the IQStreamable source for the CEP LINQ query.


        private static IObservable<trade.TradeCaptureReportEnrichment> LiveTradeCaptureReports()

            // Reference to class that connects to WCF to get the IEnumerable source
            TradeCaptureReportSource tcrSource = new TradeCaptureReportSource();

            // Create the CEP Stream
            _tcrEnumerableSource = tcrSource.GetTradeCaptureReportEvents();

            return ToObservableInterval(_tcrEnumerableSource, TimeSpan.FromMilliseconds(100), Scheduler.ThreadPool);

        private static IObservable<T> ToObservableInterval<T>(IEnumerable<T> source, TimeSpan period, IScheduler scheduler)
            return Observable.Using(
                () => source.GetEnumerator(),
                it => Observable.Generate(
                    _ => it.MoveNext(),
                    _ => _,
                    _ =>
                        return it.Current;
                    _ => period, scheduler));

I used the style of JAhlen‘s stock ticker example (CodePlex) by defining an output method, in which to put the LINQ query. It calls a display method where the Sink is defined, to bind the output to the Sink, in this case the console window.

The DisplayPointResults method:


       static void DisplayPointResults<TPayload>(this Application app, IQStreamable<TPayload> resultStream)
            var consoleObserver = app.DefineObserver(() => Observer.Create<PointEvent<TPayload>>(ConsoleWritePoint));

            var binding = resultStream.Bind(consoleObserver);
            using (binding.Run("ServerProcess"))
                Console.WriteLine("***Hit Return to exit after viewing query output***");

The ConsoleWritePoint method:


        static void ConsoleWritePoint<TPayload>(PointEvent<TPayload> e)
            if (e.EventKind == EventKind.Insert)
                Console.WriteLine(string.Format(CultureInfo.InvariantCulture, "INSERT <{0}> {1}", e.StartTime.DateTime, e.Payload.ToString()));
                Console.WriteLine(string.Format(CultureInfo.InvariantCulture, "CTI    <{0}>", e.StartTime.DateTime));

The program.cs file, all the code together:


namespace HelloWorld
    using System;
    using System.ServiceModel;
    using System.Collections.Generic;
    using System.Reactive;
    using System.Reactive.Concurrency;
    using System.Reactive.Linq;
    using Microsoft.ComplexEventProcessing;
    using Microsoft.ComplexEventProcessing.Linq;
    using System.Globalization;
    using trade = Acme.Trade.Business.Entities;

    static class Program

        // Global variables
        private static IEnumerable<trade.TradeCaptureReportEnrichment> _tcrEnumerableSource;
        private static IQStreamable<trade.TradeCaptureReportEnrichment> inputStream;

        static void Main(string[] args)

            // The StreamInsight engine is a server that can be embedded (in-memory) or remote (e.g. the Azure Service).
            // We first use Server.Create to create a server instance and return a handle to that instance.
            using (Server server = Server.Create("STATISTICS"))
                Application application = server.CreateApplication("RealTimeStatistics");

                inputStream = CreateStream(application);

                // First query output
                ConfirmedTradeCaptureReports(application, inputStream);



        private static IObservable<trade.TradeCaptureReportEnrichment> LiveTradeCaptureReports()

            // Reference to class that connects to WCF to get the IEnumerable source
            TradeCaptureReportSource tcrSource = new TradeCaptureReportSource();

            // Create the CEP Stream
            _tcrEnumerableSource = tcrSource.GetTradeCaptureReportEvents();

            return ToObservableInterval(_tcrEnumerableSource, TimeSpan.FromMilliseconds(100), Scheduler.ThreadPool);

        private static IObservable<T> ToObservableInterval<T>(IEnumerable<T> source, TimeSpan period, IScheduler scheduler)
            return Observable.Using(
                () => source.GetEnumerator(),
                it => Observable.Generate(
                    _ => it.MoveNext(),
                    _ => _,
                    _ =>
                        return it.Current;
                    _ => period, scheduler));

        static IQStreamable<trade.TradeCaptureReportEnrichment> CreateStream(Application application)
            uint cti = 1;

            DateTime startTime = DateTime.Parse("2013-01-03");
            var ats = new AdvanceTimeSettings(new AdvanceTimeGenerationSettings((uint)cti, TimeSpan.FromTicks(-1)), null, AdvanceTimePolicy.Drop);

            return application.DefineObservable(() => LiveTradeCaptureReports()).ToPointStreamable(
                r => PointEvent<trade.TradeCaptureReportEnrichment>.CreateInsert(DateTime.UtcNow, r), ats).Deploy("wcfTCRSource");


        static void ConsoleWritePoint<TPayload>(PointEvent<TPayload> e)
            if (e.EventKind == EventKind.Insert)
                Console.WriteLine(string.Format(CultureInfo.InvariantCulture, "INSERT <{0}> {1}", e.StartTime.DateTime, e.Payload.ToString()));
                Console.WriteLine(string.Format(CultureInfo.InvariantCulture, "CTI    <{0}>", e.StartTime.DateTime));

        static void ConfirmedTradeCaptureReports(Application app, IQStreamable<trade.TradeCaptureReportEnrichment> inputStream)
            var confirmedTCR = from tcr in inputStream
                               group tcr by new { tcr.Symbol, tcr.OBTradeSubTypeName } into TradeTypeByInstrument
                               from window in TradeTypeByInstrument.HoppingWindow(TimeSpan.FromSeconds(20), // Window Size
                                                                                    TimeSpan.FromSeconds(10)) // Hop Size
                               select new 
                                   Symbol = TradeTypeByInstrument.Key,
                                   Volume = window.Sum (e => e.ExecutedSize),
                                   Value = window.Sum(e => e.TradeValue),
                                   Trades = window.Count()


        static void DisplayPointResults<TPayload>(this Application app, IQStreamable<TPayload> resultStream)
            var consoleObserver = app.DefineObserver(() => Observer.Create<PointEvent<TPayload>>(ConsoleWritePoint));

            var binding = resultStream.Bind(consoleObserver);
            using (binding.Run("ServerProcess"))
                Console.WriteLine("***Hit Return to exit after viewing query output***");


No. Author Web Site Name URL Date Authored Date Cited
1. Microsoft Microsoft.com http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/streaming-data.aspx 2012 08 January 2013
2. Microsoft Download Center http://www.microsoft.com/en-us/download/details.aspx?id=30149 2012 08 January 2013
3. Microsoft MSDN: Microsoft StreamInsight http://msdn.microsoft.com/en-us/library/ee362541(v=sql.111).aspx 2012 08 January 2013
4. The StreamInsight Team Microsoft StreamInsight blog http://blogs.msdn.com/b/streaminsight/archive/2012/08/01/a-hitchhiker-s-guide-to-streaminsight-2-1-queries.aspx 1 August 2012 08 January 2013
 5.  Roman Schindlhauer  StreamInsight Team  http://sqlblog.com/blogs/stream_insight/default.aspx  na  08 January 2012
 6.  SlySoft.com  Virtual Clone Drive  http://www.slysoft.com/en/virtual-clonedrive.html  na  08 January 2012
 7.  Microsoft  MSDN: Microsoft StreamInsight – Installation  http://msdn.microsoft.com/en-us/library/ee378749(SQL.111).aspx  2012  08 January 2012
 8. Microsoft StreamInsight Resiliency http://msdn.microsoft.com/en-us/library/hh290501(v=sql.111).aspx 2012 08 January 2012
 9. Microsoft StreamInsight Server Architecture http://msdn.microsoft.com/en-us/library/ee391536(v=sql.111).aspx 2012 08 January 2012
 10. Microsoft Publish and Connect to StreamInsight Server http://msdn.microsoft.com/en-us/library/ff518487(v=sql.111).aspx 2012 08 January 2012
 11. Microsoft Using Event Sources and Event Sinks (StreamInsight) http://msdn.microsoft.com/en-us/library/gg144958(SQL.111).aspx 2012 09 January 2012
 12. Jeffery van Gogh Writing your first Rx application http://channel9.msdn.com/blogs/j.van.gogh/writing-your-first-rx-application 17 November 2009 09 January 2012
 13. FIX Protocol What is FIX http://www.fixprotocol.org/what-is-fix.shtml” na 09 January 2012
 14.  StreamInsightTeam  Introduction to LINQ driver for StreamInsight 2.1  http://sqlblog.com/blogs/stream_insight/archive/2012/06/26/introduction-to-linqpad-driver-for-streaminsight-2-1.aspx  26 June 2012  18 January 2012
 15.  LinqPad.net  LinqPad.net  http://www.linqpad.net/  na  18 January 2012

Last Updated: 21 January 2013 17h18