1 Introduction
There is now an enormous quantity of data available in a wide variety of forms and more is continuously being generated. Collectively known as “Big Data”, this includes the vast amount of data that organizations have accumulated internally as well as that from external sources. These include social media and publicly available data from government databases as well as other data shared between organizations.
Big Data technologies were invented to store and process this vast amount of data into useable “Smart” Information. The most commonly mentioned tool is Hadoop[^1] which was developed by Yahoo and released as an open source tool written in Java on top of Apache. This provides a way of searching large data sets in parallel using commodity computing hardware.
Big Data requires more flexibility than is provided by conventional relational databases. This has led to the growth of the so called “NoSQL” type of database. These databases use looser consistency models than traditional relational databases in order to achieve horizontal scaling and higher availability. They are often optimized for appending and retrieval operations.
What is common across these technologies is that their initial aims focused on data processing capabilities rather than security and compliance. One particular concern has been lack of control over identity and access especially in the area of administration. This was fine when the application of the tools was confined to experimental or small scale usage. Now that they are being widely deployed for commercial application this is no longer satisfactory.
Organizations using Big Data need to remain compliant with a wide range of laws and regulations. Big Data can be misused through abuse of privilege; curiosity may lead to unauthorized access and information may be deliberately leaked. Mistakes can also lead to disclosure of sensitive information and incorrect analysis can lead to incorrect or inappropriate conclusions.
As described in KuppingerCole Advisory Note: “Big Data Security, Governance, Stewardship” - 71017 an information centric approach to big data is needed to ensure:
- Availability: individuals are able to access the Big Data and Smart Information they need to perform their business functions when and where they need it, and without delay.
- Integrity: individuals are only able to manipulate Big Data (create, change or delete) in ways that are authorized.
- Confidentiality: Big Data and Smart Information can only be accessed by authorized individuals and these are not able to pass data to which they have legitimate access to other individuals who are not authorized.
This can only be achieved with appropriate identity and access management for the big data technologies.