Abstraction vs. Performance

Programming is hard. There is a lot of complexity that needs to handled. To hide at least some of this complexity, people come up with abstractions. SQL is an example of an abstraction – it provides a way to process data without worrying about all sorts of details like:

  • How to find the relevant data on disk or in memory?
  • In what order to process it?
  • Which join algorithm to use?
  • Which join order to use?

On distributed systems the problems get even more complex:

  • How to partition/replicate the data across the machines?
  • How to make sure that the data is consistent?
  • How to split the computation across the machines?
  • How and when to move the data around the machines?

SQL shields you from this complexity. There is still lots of other complexity left (and that’s why you should use Rax), but that’s another topic.

The problem with abstractions however, is that they can incur performance penalty. Pretending that a bunch of computers connected with network cables is a single database can cost a lot of communication and synchronization overhead. And what if the user doesn’t even care about some of these carefully crafted illusions? For example, not every user cares about strong data consistency. Not everybody needs serializability. It is therefore very tempting to give up the abstraction for the sake of performance. The whole NoSQL movement was born out of this temptation. The same holds for the rise of Hadoop and MapReduce in data-analysis applications. The good old SQL-based databases were not fast enough anymore.

However, the price for dumping the abstraction is that the analyst/programmer has to deal with a lot more complexity. That quickly starts to dampen the performance euphoria. Recently, an internet-of-things startup Revolv has dumped a NoSQL system for the good old PostgreSQL. They’ve got overwhelmed by the complexity. At the same time, the good old SQL databases keep improving their performance and scalability. Analyzing data on a Hadoop cluster using MapReduce is therefore getting less and less popular. Now everybody wants to talk SQL again, and it’s becoming more and more attractive with the rise of SQL-on-Hadoop products such as Impala, Hawq, Hadapt, Drill, Vectorwise (currently known as Actian Vector) or cloud-based, parallel databases such as Amazon’s Redshift or Snowflake. Soon everything will be back to normal: we will quit low-level hacking on Hadoop and go back to high-level languages like SQL, and even to high-level and easy-to-use languages like Rax 🙂