Using Big Data: Interoperability Issues
To derive meaning quickly, from huge quantities of data, has been the objective of enterprises since the inception of databases. Data virtualisation makes large and complex data sets understandable and enables domain experts to spot underlying trends and patterns. Companies would want to learn of signs and patterns in their data to warn them, not only of hazards such as fraud or production delays, but provide insight into opportunities such as the treatment of diseases or leveraging different logistics approaches. Traditionally, business intelligence required static reports, created from an underlying data warehouse, comprised of conspicuously formatted data in specialised data tables. Now, with a proper visualisation tool, on top of big data, the business analytics process may be shortcut; stakeholders use a visualisation tool to model their data requirements to see relationships and spot patterns that analytics professionals might be missing. (Linthicum, 2013).
Hadoop has become the preferred platform for processing large quantities of data. These large quantities may also refer to structured data that cannot be analysed in a reasonable length of time. The better information is, and the more routinely it is put in the hands of those who need it, the higher the intelligence derived from it. Despite this, Hadoop is a batch-oriented system and cannot (as yet) answer simple queries with a good turnaround time (Hortonworks, 2013). But, Hadoop handles data that traditional relational databases, data warehouses, and other analytic platforms have been unable to effectively manage—including user-generated data through social media and machine generated data from sensors, appliances, and applications. That is, all forms of data. Hadoop does this through a distributed file system, called the Hive Distributed File System – HDFS, providing the portability, scalability (over commodity hardware) and reliability and MapReduce processing for parallel distributed systems (Srivastava, 2011). “Once data exceeds the capabilities of a single computer, we face the problem of distributing its storage and processing.” (Downey, 2011).
The Open Group suggests The Need for Boundaryless Information Flow (III-RM). The boundaryless Information Flow is essentially resolving the problem of getting information to the right people at the right time in a secure, reliable manner, in order to support the operations that are core to the extended enterprise. According to the Boundaryless Organisation it is not implied that there are no boundaries in the flow of information, only that they should be made permeable or penetrable. However, the information systems in place today do not allow for information to flow in support of the boundaryless organisation, but when data do, it will be a boundaryless information flow. The Open Group published the Interoperable Enterprise Business Scenario in 2001 that crystallises this need for boundaryless information flow, describing how this need drives information customers’ deployment of their information systems. A key driver is to “create a worldwide market for interoperable information products supporting access to integrated information, in which all stakeholders’ needs are addressed.” (The Open Group, 2002).
The lack of the right information to the right person at the right time is equably a lack of interoperability, preventing organisations from achieving their business objectives. Information as an asset, in principle, passes the point of departure “that information technology merely supports or enables the business – increasingly, information is the business.” (The Open Group, 2002). To address the lack of interoperability between systems that provide access to data is to attend to the boundaryless information flow. Technology infrastructure is used to facilitate interoperability by either using Common-off-the-shelf (COTS) appliances or a collection of best of breed integration to reduce interoperability issues.
The Open Group (The Open Group, 2002) identifies high-level technical requirements for a boundaryless information flow architecture:
- Openness – standards based open interfaces need to exist for critical interfaces to enable interoperability more easily and support integration of components, for areas:
- Web portals
- Storage Connections
- Data Integrity – the same data stored in multiple systems must be conformed to mean the same thing throughout
- Availability – must be guaranteed and where maintenance requires downtime, a fall-back must be available
- Security – support different security levels and protection based on the sensitivity and use of the data, as well as the normal requirements for data integrity, confidentiality, logging and tracking. Must provide flexible policy management and strong authentication providing users with a single sign-on and user administration.
- Accessibility – solutions must provide global accessibility to the information without compromising security
- Manageability – solutions must be manageable
- Internationalisation and Localisation – accommodate multiple languages and adapt to culture where deployed internationally
Standardisation of the information technology of an organisation, to achieve a boundaryless information flow, must be measurable in terms of business value, improving business operations, effectiveness, and also that of the Information Technology capability in the organisation.
- Drive revenue growth
- Lower Information Technology expenditure
- Increase % of procurements against standards
- Reduce spend on customisations
- Improve cycle time for rolling out upgrades
In the Business Intelligence sphere, a few key trends are identified with the emerging use of big data (Linthicum, 2013):
- The ability to leverage both structured and unstructured data, and visualise that data
- Ability to imply structure at the time of analysis, to provide flexibility from the underlying structure of the data (be it structured or unstructured) through decoupling
- Leverage current or near-real-time data for critical information
- Allowing BI analyst to mash up data that are outside of the organisation (like the cloud) to enhance the analysis process
- Ability to bind data analytics to business processes and applications to allow issue resolution without human intervention, i.e. known as embedded analytics
An organisation should seek to support of the boundaryless information flow idea in any strategy it evaluates for information integration. Hadoop, being a framework in support of Big Data, i.e. to provision results/ data over large heterogeneous data sources to analytics tools, or data stores, is a technology to be considered in support of the boundaryless information flow.
Business Intelligence and Big Data
Big data analytics has a potential, which is at odds with the traditional Business Intelligence approach. To force this existing paradigm on big data analytics is missing the full potential of this technology, or to fail altogether (Linthicum, 2013).
In the previous two decades, calculating summaries, sums and averages were sufficient, but large complex data require new techniques. To recognise customer preferences requires the analysis of purchase history, browsing behaviour, products viewed, comments, suggestions, complaints, to predict future behaviours (Cloudera, 2013).
Hadoop, being able to work on Big Data, structured, and/ or unstructured (heterogeneous sources), is ideally positioned for provisioning data for Business Intelligence. Problems like advertisement targeting, recommendations engine (predicting customer preferences), customer churn (attrition) analyses, risk modelling, point-of-sale transactional analysis, predicting network data failures, threat analysis, and trade surveillance, can be solved with the ability of Hadoop (Cloudera, 2013).
The challenge with Big Data is the transformation of it into intelligent data. Because Hadoop is primarily a batch-orientated processing system, it is very suitable for Data Analytics (Payandeh, 2013). It is the crow-bar (no pun intended) that wrests open the capacious data-lake to parse, cleanse, apply structure and transform the data and arranges it for consumption by analytics applications.
The most popular spearhead of Big Data, right now, appears to be Hadoop (Kernochan, 2011). Hadoop is open-source software that achieves reliable, scalable, and distributed computing of large data sets across clusters of computers using programming models, designed to scale from a single to thousands of machines (Welcome to Apache™ Hadoop®!, 2013). Hadoop is a framework that supports data-intensive distributed applications that run in parallel on large clusters of commodity hardware that provide both reliability and data motion to applications. By using MapReduce (reducing the workload into smaller fragments, parallelising the effort over multiple servers) and a distributed file system that stores data on the compute nodes, it provides a huge aggregated bandwidth across the cluster, as well as the automatic handling of failures by the framework. “It enables applications to work with thousands of computational-independent computers and petabytes of data.” (WikiPedia, 2013).
Hadoop is also more cost-effective than traditional analytics platforms such as data warehouses, and it moves analytics closer to where the data is. It may commonly be used for optimising internal data management operations, but it is rapidly gaining ground as a strategic big data analytics platform (Big Data Analytics: Unleashing the Power of Hadoop, 2013).
The Apache-Hadoop platform is comprise of:
- Common – common utilities that support the Hadoop modules
- MapReduce – YARN-based system for parallel processing of large data sets
- Distributed File System (HDFS) – distributed file system for high throughput access to application data
- YARN – framework for job scheduling and cluster resource management
Other related projects are:
- Amabar – web-based tool for provisioning, managing, and monitoring Hadoop clusters, with dashboard
- Avro – data serialisation system
- Cassandra – scalable multi-master database with no single points of failure
- Chukwa – data collection system for managing distributed systems
- HBase – scalable, distributed database that supports structured data storage for large tables
- Hive – a data warehouse infrastructure that provides data summarisation and adhoc querying
- Pig – high level data-flow language and execution framework for parallel computation
- Zookeeper – high-performance coordination service for distributed applications
The Hadoop File System (HDFS) can store huge and many files to scale across the commodity nodes, breaking down enterprise data silos, this common data globule is referred to as a “data lake” (Gualtieri, 2013). These “data lakes” are queried by MapReduce, a distributed data processing query, in a highly parallel fashion.
Dealing with only files on disk, essentially indicates a lack of structure for the content, data maps are applied over the unstructured content to define the core metadata for it. For example, using Hive to provide a mechanism to project structure onto the data allows the data to be queried by a SQL-like language called HiveQL, or Pig, a MapReduce mechanism. A BI tool will thus obtain file metadata from the cluster (serving from both structured and unstructured sources) and retrieve the data from the data nodes into the structure using MapReduce to get to the required data, in parallel, and return the data to a BI visualisation tool, e.g. Tableau, or into other types of processing, into specific data structures, e.g. a dimensional model for complex analytical processing (Linthicum, 2013).
Big data is usually only processed periodically, e.g. daily or weekly. During processing, the load is dispersed over multiple processing nodes to speed it up. But, after the query is executed those nodes are no longer required until the next run. This is indicative of a perfect expansion-contraction pattern. To reduce cost and maximise hardware utilisation, it would be a boon to have the fabric dynamically provisioning the nodes in the grid, to jobs that are executing, and so provide an agile infrastructure. Hadoop should be thought of as a distributed operating system with many computers storing and processing data as if formed on a single machine (Ken Krugler). Adding to this, the ability to host Hadoop in the Cloud, to achieve an agile infrastructure, sees efficiency maximised and costs reduced for a multiple number of use-cases (Downey, 2011).
Hadoop is essentially used for (Julio, 2009):
- Log processing
- Recommendation systems
- Video and Image analysis
- Data Retention
Visualisation of Data and Hadoop
Hadoop is supported by Data Virtualisation technologies like Informatica, Composite Server, and Denodo. Visualisation technologies like Tableau can connect to multiple flavours of Hadoop, to bring data in-memory and perform fast adhoc visualisations to find patterns and outliers, in all the data in the Hadoop cluster. It is said about Tableau that it is a very elegant solution, obviating the need to move huge data into a relational store before analysing it (Tableau, 2013). Tableau uses ODBC to connect to the Cloudera connector for Tableau that uses HiveQL to connect to Hive and MapReduce over the HDFS and HBase, Hadoop (Popescu, 2012).
A visualisation tool (in this example Tableau) is connected to Hadoop from which to build visualisations just like when connected to a traditional database. Tableau achieves “on-the-fly ETL” by connecting to Hive with Custom SQL, allowing the batch-orientated nature of Hadoop to handle layers of analytical queries on top of complex Custom SQL with only an incremental increase in time.
Although Hadoop is essentially a batch-orientated system, there are a number of techniques for improving the performance of visualisations and dashboards, built from data stored on a Hadoop cluster. Custom SQL limits the data-set size to speed up exploring a new data set and build initial visualisations. Extracting data for Tableau, although Tableau is capable of handling large sets of data, it is nowhere near as well positioned as Hadoop to handle really large, or Big Data loads. An extract can be pre-formatted for Tableau, by using Hadoop to roll up dates, hide unused fields, and aggregate visible dimensions, to create a broader view of the data. Hive can, for example, partition records on disk and allow tableau to direct a query, aligned to these partitions, to greatly reduce the time spent on getting to the right data (Hortonworks, 2013).
Why is Hadoop most typically the technology underpinning “Big Data”? Data comes from a set of data sources, e.g. CRM, ERP, and Custom Applications. This data has to be queried somehow, either via an elaborate point-to-point integration, a virtual data store, data warehouse or even a Massively Parallel Processing Systems (MPP), e.g. Microsoft’s Parallel Data Warehouse. The application doing the query is probably some high-end business intelligence tool, e.g. SAS. As other unstructured machine generated or social origin data are added, the data integration becomes more complex.
Hadoop is not replacing traditional data systems, but is complementary to them (Hortonworks, Apache Hadoop Patterns of Use, 2013).
The traditional Enterprise Data Architecture is indicated inside the red-marked area, with the additionally added Hadoop ability. The specific use of Hadoop is directed by a few use-cases or patterns.
Observing analytic challenges from an infrastructure level, some clear patterns emerge and can fit into the following three patterns (Hortonworks, Apache Hadoop Patterns of Use, 2013):
- Hadoop as a Data Refinery
- Data Exploration with Hadoop
- Application Enrichment
Hadoop as a Data Refinery
Refine data sources into their commonly used analytics applications, e.g. providing a central view of Customer from all the data about them in either the ERP, CRM or bespoke systems. This pattern is especially powerful where Customer data are incorporated from their web sessions on the enterprise’s web site, to see what they have accessed and what they are interested in.
This pattern uses Hadoop to parse, cleanse, apply structure and transform such data and then push it into an analytics store for use with existing analytics tools. This pattern uses Hadoop as a type of staging environment between the source and the analytics tools, with Hadoop being the transformation agent.
Data Exploration with Hadoop
Hadoop is used directly on the data to explore it. It still acts the role of parser, cleaner, structure and transformer, but it provisions the data to visualisation tools, e.g. Tableau (Tableau, 2013). Financial institutions can use this pattern to discover identity fraud, i.e. perform some kind of surveillance activity.
Data, stored in Hadoop, is being used to impact an application’s behaviour, e.g. a web application using a user’s access patterns to curtail its depiction to suit the returning user’s usage characteristics. Hadoop parses, cleanses, applies structure and transforms the data and passes it directly into the application. This is an ability used by the large web companies in the world, like Yahoo and Facebook. The right data is enriched at the right time to the right customer. Massive amounts of patterns and repeatable behaviour are identified in order to customise the web-application experience of the user, by serving the right content to the right person, at the right time, increasing the conversation rates for purchase.
 Too big for Excel is not Big Data! (Chris, 2013). The New York Stock Exchange generates about 1 TB of new trading data per day, Facebook hosts about 10 billion photos (1 Petabyte), Angestry.com stores 2.5 Petabytes of genealogy data, and the Internet Archives stores around 2 Petabytes, growing at a rate of 20 Tb per month (Seminar Report Downloads, 2012).
 It was first conceived as a web search engine for yahoo (Big Data Analytics: Unleashing the Power of Hadoop, 2013)
 A (subjective) figure of twenty Terra Bytes is cited as a tipping point for considering Hadoop (Chris, 2013) “Rama Ramasamy”.
 There are projects that provide real-time capabilities to applications for results from Hadoop, but it is primarily a batch-orientated processing solution (HandsonERP & Mir, 2013).
 LZO or Snappy Compression can be applied to HDFS files to reduce the bytes read or written thus improving the efficiency of network bandwidth and disk space (Julio, 2009).
 Easy to obtain, available and affordable hardware that is not unique in any function.
 The ability of two or more entities or components to exchange information
and to use the information that has been exchanged “to meet a defined mission or objective; in
this scenario this term specifically relates to access to information and the infrastructure that supports it. (The Open Group, 2002)
 The process of combining components into an effective overall system. In this scenario the phrase “access to integrated information” is used repeatedly. The term “integrated information” means an overall system in which a single view of information is presented, and is achieved by combining various information sources (The Open Group, 2002).
 By 2015, 50% of Enterprise data will be processed by Hadoop (HandsonERP & Mir, 2013).
 A Hadoop cluster can run alongside a trading system obtaining copies of the trading data, coupled with the reference data of parties to the trade, the system can continually monitor trade activity to build relationships between people and organisations trading with each other, watching for patterns that reflect rogue trading (Cloudera, 2013).
 MapReduce is a programming model for processing large data sets with a parallel and distributed algorithm on a cluster, doing filtering and sorting on the distributed servers, running various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and fault tolerance and overall management of the whole process (WikiPedia, 2013).
 The progress estimator for MapReduce in Hadoop is simplistic, and often produces inaccurate estimates.
 Collapsing data silos creates a pro-BI data base, on which to position BI and Visualisation tools.
 To use Hortonworks’ online Hadoop data source for Tableau, connect to the Hortonworks Hadoop Hive Connection of Tableau to 192.168.56.101 (Sandbox VM) to start visualising data from Hadoop (Hortonworks, HOW TO: Connect Tableau to Hortonworks Sandbox, 2013). Get the driver from here: http://www.tableausoftware.com/drivers.
 Through a partnership with Cloudera, Tableau software has built a Hadoop connector for doing big data visualisation (Williams, 2012). Also see footnote 14 on how to use Hortonworks online Hadoop data source for Tableau.
 Hadoop is not yet capable of answering simple queries with very quick turnaround. (Hortonworks, 2013).
 Hadoop query cancellation is not straight forward. Queries used by Tableau can only be abandoned, but the query continues on the Hadoop cluster, consuming resources (Hortonworks, 2013).
 Large quantities of data are distilled to something more manageable (Hortonworks, Apache Hadoop Patterns of Use, 2013).
Big Data Analytics: Unleashing the Power of Hadoop. (2013, August 21). Retrieved from Database Trends and Applications: http://www.dbta.com/Articles/Editorial/Trends-and-Applications/Big-Data-Analytics-Unleashing-the-Power-of-Hadoop-91261.aspx
Chris, S. (2013, September 16). Don’t use Hadoop – your data isn’t that big. Retrieved from Chris Stucchio: http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html
Cloudera. (2013). TEN COMMON . Chicago: Cloudera. Retrieved from http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf
Downey, J. (2011, July 1). Big Data and Hadoop: A Coud Use Case. Retrieved from Jim Downey: http://jimdowney.net/2011/07/01/big-data-and-hadoop-a-cloud-use-case/
Gualtieri, M. (2013, october 23). 5 Reasons Hadoop Is Kicking Can and Taking Names. Retrieved from Information Management : http://www.information-management.com/blogs/5-reasons-hadoop-is-kicking-can-and-taking-names-10024986-1.html
HandsonERP, & Mir, H. (2013, April 5). Hadoop Tutorial 1 – What is Hadoop? Retrieved from YouTube: http://www.youtube.com/watch?v=xWgdny19yQ4
Hortonworks. (2013). Apache Hadoop Patterns of Use. Chigago: Hortonworks. Retrieved from http://hortonworks.com/wp-content/uploads/downloads/2013/04/Hortonworks.ApacheHadoopPatternsOfUse.v1.0.pdf
Hortonworks. (2013, August 26). Best Practices for Hadoop Data Analysis with Tableau and Hortonworks Data Platform. Retrieved from SlideShare: http://www.slideshare.net/hortonworks/best-practices-forhadoopdataanalysiswithtableauv10-25612601
Hortonworks. (2013, Novemeber 07). HOW TO: Connect Tableau to Hortonworks Sandbox. Retrieved from Hortonworks KB: http://hortonworks.com/kb/how-to-connect-tableau-to-hortonworks-sandbox/
Julio, P. (2009, October 19). What is Hadoop used for>. Retrieved from SlideShare: http://www.slideshare.net/PhilippeJulio/hadoop-architecture
Kernochan, W. (2011, October 20). Big Data, MapReduce, Hadoop, NoSQL: The Relational Technology Behind the Curtain. Retrieved from enterpriseappstoday: http://www.enterpriseappstoday.com/business-intelligence/big-data-mapreduce-hadoop-nosql-2.html
Linthicum, D. (2013). Making Sense of Big Data. Chicago: IBM.
Payandeh, F. (2013, September 8). Hadoop vs. NoSql vs. Sql vs. NewSql By Example. Retrieved from Big Data Studio: http://bigdatastudio.com/2013/09/08/hadoop-vs-nosql-vs-sql-vs-newsql-by-example/
Popescu, A. (2012, February 8). Visualizing Hadoop data with Tableau Software and Cloudera Connector for Tableau. Retrieved from myNoSQL: http://nosql.mypopescu.com/post/17262685876/visualizing-hadoop-data-with-tableau-software-and
Seminar Report Downloads. (2012, February 06). Hadoop. Retrieved from Seminar Report Downloads: http://seminarreportdownloads.wordpress.com/tag/new-york-stock-exchange/
Srivastava, A. (2011, April 26). A Hadoop Use Case . Retrieved from Anjur-sa.blogspot.com: http://ankur-sa.blogspot.com/2011/04/hadoop-use-case.html
Tableau. (2013, October 25). Hadoop. Retrieved from Tableausoftwar: http://www.tableausoftware.com/solutions/hadoop-analysis
The Open Group. (2002). Interoperable Enterprise Business Scenario. Chicago: The Open Group.
Welcome to Apache™ Hadoop®! (2013, October 24). Retrieved from hadoop.apache.org: http://hadoop.apache.org/
WikiPedia. (2013, October 23). Apache Hadoop. Retrieved from WikiPedia: http://en.wikipedia.org/wiki/Apache_Hadoop
WikiPedia. (2013, October 19). MapReduce. Retrieved from WikiPedia: http://en.wikipedia.org/wiki/MapReduce
Williams, A. (2012, November 9). Hadoop. Retrieved from Siliconangle: http://siliconangle.com/blog/2011/11/09/tableau-brings-data-visualization-to-hadoop/