Welcome Snowflake – Coders Co.

Snowflake Computing has recently emerged from stealth with a bold claim of having reinvented the data warehouse. Ease of use is their main motto and I dare to say they live up to this promise. Since one of Snowflake’s co-founders,
Marcin Żukowski, is a good friend of mine, I’ve got a chance to play with Snowflake even during their stealth period. Therefore, we can now proudly present a new version of Rax: Rax/Snowflake.

To celebrate, I’ve decided to repeat my experiments with behavioral analysis on various SQL backends, this time also testing Snowflake. One of our customers, kindly provided me with a dataset containing TV viewing moments. Contrary to my previous experiment, this time we are dealing with a real dataset, not a demo sample. The main table (TvExposuresWeighted) contains approximately 200,000,000 records. The customer currently uses SQL Server to analyze this data and they are reasonably happy with its performance. However, in the near future, they will have to analyze a 10 times larger dataset and they are worried if their SQL Server will be able to handle it. They are curious about the performance of their analyses on different SQL engines, but they don’t have the time and resources to port their SQL queries.
This is a scenario where Rax comes in particularily handy. Writing a Tv reach analysis in Rax is a matter of 30 minutes, due to Rax’s expression power and handy temporal operators. After that, I can run the analysis on all SQL backends supported by Rax without any changes to the script. This is the Rax script that computes the reach of all TV channels:

// Set the period of interest to 2013-04-30 - 2013-05-01
|: PeriodOfInterest := (|)[(@)"2014-01-01",(^)"P2D"];
// Only take into account responders who watched more than 5 minutes
^: TvReachThreshold := (^)"PT5M";
// First import the data from the regular SQL tables.
{[#:RespondentId, |:Timeslot, #:ChannelId, &:Weight]}:
  TvExposuresWeighted :=
    attach [
        (#)"RespondentId",
        (|)[(@)"Timeslot_begin", (@)"Timeslot_end"],
        (#)"ChannelId",
        (&)"Weight"
      ]
    "TvExposuresWeighted";
{[#:RespondentId, |:Period, &:Weight]}:
  RespondentWeights :=
      attach [
      (#)"RespondentId",
          (|)[(@)"Period_begin", (@)"Period_end"],
      (&)"Weight"
    ]
    "RespondentWeights";
// Select only TV exposures within the period of interest. Note that TV exposures
// on the boundaries of the period of interest, will be cut to fit within.
// Use the temporal-and '@&@' operator to do it.
TvExposuresWeighted := TvExposuresWeighted @&@ {[PeriodOfInterest]};
// Time to start aggregating. Compute the total viewing time (in minutes)
// per respondent, per channel
{[#:totalDuration, #:RespondentId, #:ChannelId, &:Weight]} : Durations :=
  fold [/sum(.#1), .#2, .#3, .#4]
  project [.Timeslot.absolute.minutes, .RespondentId, .ChannelId, .Weight]
  TvExposuresWeighted;
// Filter out respondents who watched less than 5 minutes of a given channel.
Durations := select [.totalDuration >= TvReachThreshold.minutes] Durations;
// Compute the reach for each channel: sum of weights of respondents
// watching this channel divided by the sum of weights of all respondents.
& : TotalWeight :=
  fold [/sum(.#1)]
  fold [/mean(.Weight), .RespondentId]
  (RespondentWeights @&@ {[PeriodOfInterest]});
{[&:Reach, &:SampleSize, #:ChannelId]} : ReachPerChannel :=
  project [.#1/TotalWeight, .#1, .#2]
  fold [/sum(.#1), .#2]
  fold [/mean(.Weight), .ChannelId, .RespondentId]
  Durations;
`print ReachPerChannel;

The above script can be run without modifications on all backends supported by Rax.

SETUP

I’ve run my experiments on SQL Server, Azure SQL Database, AWS Redshift, and the new kid on the block – Snowflake. More specifically I used the following configurations:

SQL Server 2014 running on an AWS c3.8xlarge instance (32 CPUs, 60 GB RAM, 2x 320 GB SSD drives)
Azure SQL Database, performance levels P2, P3 (I’m not sure what it means, but these are their two highest performance levels)
AWS Redshift clusters consisting of 1, 2, 4, 8 and 16 dw2.large nodes (2 virtual cores, 15 GB RAM, moderate I/O performance each)
AWS Redshift clusters consisting of 2, 4 and 8 dw1.8xlarge nodes (16 virtual cores, 120 GB RAM, very high I/O performance each)
Snowflake warehouses of the following sizes: x-small, small, medium, large and x-large

In all scenarios, I have first created the TvExposuresWeighted and RespondentWeights tables and loaded data into them. I have not created any indexes on the tables, since an average Rax user doesn’t know and doesn’t even want to know about database indexes. I did not tweak any database parameters either.

PERFORMANCE

The barchart below shows runtimes (in seconds) of the TV reach script on each of the above SQL backends and configurations.

As you can see, for this dataset and this script, SQL Server is the best performing backend. This is not surprising, since the size of the data in this scenario is pretty small (the uncompressed size of the TvExposuresWeighted table is approximately 10 GB). Therefore, the power of MPP databases such as Redshift and Snowflake cannot be fully utilized. In fact, using an SQL engine in the cloud only slows down things, as you have to deal with network latency. Therefore, for such a small datasize, SQL Server wins simply because is it the only SQL backend that runs on the same machine as my Rax interpreter and therefore doesn’t suffer from the network latency.
Azure’s SQL database performs surprisingly badly. I do not know what their performance levels mean, so I can’t say anything about the reason for this low performance.
In the following experiment, I replicated the rows of both tables 10 times. I modified the respondent IDs, so that the resulting dataset has 10 times as many respondents. The resulting TvExposuresLarge table has approximately 2 billion rows and 100 GB uncompressed size. The runtimes of the Tv reach script are shown in the barchart below. Note that I skipped Azure this time, since the performance on the smaller dataset was already quite bad.

As you can see, this time SQL Server is not doing so well. The runtime of the script exceeds two minutes. Even though the data size is only 10x larger, the runtime is almost 100x longer. The reason for this is very simple: the data doesn’t fit in the memory anymore, and SQL Server had to perform disk I/O to process this query. The uncompressed data size here is ca 100 GB, while the machine on which SQL Server was running had only 60 GB RAM. This also means that SQL Server does not use any compression.
The runtimes on small Redshift clusters are also high, probably for the same reason. Redshift reports the size of the data on disk is ca 100GB, which means that it doesn’t apply any compression by default. Note the large improvement in performance when increasing the cluster size from 8 to 16 (query runtime goes from 29s to 8s). This is probably when the data starts to fit in memory.
The performance on Snowflake configurations is surprisingly good. Even Snowflake x-small warehouse (a single-node cluster) performs well. Snowflake compresses data by default. In this scenario, it reports that the data size on disk is approximately 20 GB. It also reports, that during query executions, it only scanned 10 GB. Remember, that I didn’t create any indexes, so something smart is going on here. Also worth noting is that Snowflake small warehouse (which is a 2-node cluster) beats a 2-node Redshift cluster consisting of the huge dw1.8xlarge machines, which are significantly more powerful than the machines Snowflake is using.
The optimal Snowflake warehouse size for this benchmark seems to be medium. Further increasing the size of the warehouse doesn’t make a difference. Marcin tells me that currently Snowflake has high startup overhead per query (time needed for query compilation and optimization), especially for the type of queries that Rax generates – with lots of small expressions. That’s why the runtime on Snowflake never goes below 6-8 seconds. They are planning to optimize the startup time in the future. Currently they’re working on other aspects of the system needed for crunching hundreds of terabytes of data fast. In fact a dataset of 100 GB is still quite small for an engine like Snowflake.

USABILITY

I can’t stress enough what a pleasure it is to work with Snowflake. Loading the data is a breeze – they have clear documentation and they can load (in parallel of course) from AWS S3 buckets, which is very convenient when you want to migrate from Redshift. Smart move, folks. Also, Snowflake automatically compresses the data. The user doesn’t even need to know what ‘compression’ means. Creating a larger dataset by replicating a smaller one also went smoothly on Snowflake. The INSERT ... SELECT query I used to achieve that ran in under 2 minutes. Finally and most importantly, resizing a warehouse in Snowflake is easy and super fast – for small and medium warehouses it takes literally a couple of seconds. Only for large warehouses you have to wait up to 5 minutes.
Redshift was a bit more work. It took me a while to figure out how to load the data – the documentation lacks some details and error messages are not always clear. Also, by default Redshift does not compress the data. After some googling, I found out that the user can and should turn on compression while loading. However, most data analysts will probably not know it. Replicating the data on Redshift was easy and fast, just like on Snowflake. Resizing a Redshift cluster, though, is rather slow – it takes 15 to 30 minutes, which gets rather annoying when you pay $70/hour. All in all, the Redshift experiment costed me a lot of time and money, even though the individual runtimes are short.
SQL Server is an entirely different story. The initial data load went well, as SQL Server’s Import and Export Data helper application is quite pleasant. Replicating the data, though, was hell on wheels. The INSERT ... SELECT query that works so well on Snowflake and Redshift completely chokes SQL Server. The transaction log grows very fast and the query gets extremely slow. I gave up after half an hour and tried a different approach. I exported the large dataset from Snowflake (again, fast and easy), downloaded it to the SQL Server’s machine and imported it using the Import and Export Data application. Unfortunately, since the data size was now 100 GB, it took many hours to download and import it. It also costed ca $150 dollar to rent the machine on AWS. And all this just to run a couple of queries, 2 minutes each…
Finally Azure SQL Database does not seem to allow loading data from text files. The only way to get data in, was to install SQL Server (in my Windows VM) and use some context menu in the SQL Server Management Studio to perform a very esoteric action called Export data-tier application. It took many hours to complete this process, even though I was only exporting the small dataset. I didn’t try to export the large (100 GB) dataset. Changing the database performance level was very easy. It didn’t result in good performance, though.