Technologies behind databases have never been so variegated as today. After the rise of relational SQL-based databases, new needs have emerged, with the consequences of fostering a plethora of new database models, e.g. the NoSQL ones. Today, new models are being proposed by researchers, with the emerging NewSQL databases. In this article, we will recap the talk delivered by Miguel Ángel Fajardo, VP of Technology at Geoblink, during Codemotion Rome 2019. Fajardo provided an interesting and detailed overview of the evolution of database technologies from the ‘60s to today, finally introducing the NewSQL databases.
The beginning of database technologies
The first significant class of technologies explicitly thought to support data management is traditionally considered the Information Management Systems (IMS). It was originally developed by IBM to support the Apollo program and the Saturn V moon rocket and it consisted on a very simple data structuring, based on a tree-shaped structure.
The main problem with this idea was in the limitation of this structure: trees do not allow to have cycles, and consequently it was not possible (or at least it is pretty difficult) to have more complex connections among the different data stored inside such a database.
To overcome such limitations, in 1970 Ted Codd published a paper to present his original idea of a relational model of data, which laid the foundations of many of the modern databases.
In the relational model, data is divided into tables, which in turn can be connected by means of relations (which gave the name to the whole model).
On top of this model, between the ‘80s and ‘90s a lot of technologies were developed around a relational database. In particular, we witnessed an impressive number of relational database management systems (RDBMS), including the well-known MySQL, Oracle, PostgreSQL, SB2, SQL Server and many others. All these technologies implemented the original relational database model, improving it by means of several peculiarities:
- relational databases allow for easily modeling data
- a high number of clients have been developed for interacting with such databases, in turn contributing to the wide adoption of such a model - which also means a big community of users
- RDBMS implement ACID transactions. This means that every transaction is: atomic (transactions are either success or failure), consistent (only valid data is saved), isolated (concurrent transactions do not affect each other), durable (committed data is persisted on disk)
Web 2.0 and the emerging of NoSQL
These features were immediately perceived as tremendously useful in the ‘90s. However, the 2000s is the era of Web 2.0, with several new needs. A very large volume of data (in the order of petabytes) need to be managed and systems must scale. The availability of faster networks and devices is also an important leverage to exploit.
In this context, the problem of scaling ACID transactions came up as one of the main difficulties to overcome. In particular, the need for horizontal escalation, with the consequent distribution of data in different physical machines, do not allow to reliably and effectively implement such transactions. Moreover, there is also the need to perform analytics over petabytes of data, and provide high availability across different regions.
To solve all these peculiar needs of big data, two different classes of approaches have been put in place: distributed processing (represented by technologies such as Hadoop, Flink, Apache Spark and many others) and NoSQL databases. The latter is a set of quite variegated classes of database, with the common feature of being non-relational - indeed NoSQL has been often reported as an acronym for Non-SQL, or Not Only SQL.
NoSQL databases can be divided in the following classes:
- key-values stores: in such databases, data is represented as part of a big “dictionary”, stored in the form of key-value entries. These databases are commonly used for user session data, component configuration, or to provide fast access by caching data. However, queries can be very complex to build in some cases, due to the specific structure of the DB. Examples of key-values stores are Redis, Amazon DynamoDB and Riak
- column-oriented DB: this class of databases is particularly suited for real time analytics, since it allows us to easily (and effectively) aggregate a lot of of data. However, column-oriented DBs do not perform well in cases of queries against a few rows. Moreover, they do not allow for flexible data schemas. Cassandra, Hypertable and Apache Hbase are probably the most representative instances of column-oriented databases
- document-oriented DB: these databases store data in the form of documents, which are often parts of sets known as collections. Documents allow for storing unstructured data and to represent models with many layers. Drawbacks in this case are the difficulties in making joins of such heterogeneous data. For similar reasons, queries can generally be not very flexible. The most representative database of this class is of course MongoDB, although there are other examples, such as Couchbase or RavenDB
- graph DB: these databases are particularly useful when the original connections among data can be mapped into a graph structure. This is the case for social networks data, or for those cases where algorithms based on routing or fraud/disease spreading detection need to be run. The main drawbacks of such approach are the difficulties in aggregating data, as well as in performing analytics. Examples of graph databases are Neo4j, ArangoDB and OrientDB
In general, the advantages of NoSQL model are easy scalability, high availability across different DCs, support for different data types and analytics. However, NoSQL also has drawbacks: they are generally more complex and expensive to maintain, require specific set up for the given set of queries, often rely on specific query languages and are fundamentally not transactional.
In addition to the above, there is another issue that currently applies to any distributed database, which is represented by the CAP theorem. Formulated by Eric Brewer, it states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
- consistency: every read receives the most recent write or an error
- availability: every request receives a (non-error) response – without the guarantee that it contains the most recent write
- partition tolerance: the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
To overcome this limitation, in 2012 a research team from Google proposed a new class of relational databases which is currently known as NewSQL. These databases seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system. NewSQL databases are also horizontally scalable and allow for high availability.
The mechanisms that allow us to have all these features implemented in a single database are quite complicated and each implementation is different from the others. Basically, such systems employ a distributed cluster of shared-nothing nodes, in which each node manages a subset of the data. This requires the implementation of components that are responsible for concurrency control, flow control and distributed query processing. Of course, the distributed nature of NewSQL database requires the adoption of proper consensus protocols, as well as the timing synchronisation (e.g. for keeping transaction order).
There are many implementation of such databases class. Among them, it is worth mentioning Google Spanner, Amazon Aurora, Cockroach, Azure Cosmos DB, MemSQL and VoltDB.