Vertica to Snowflake Migration: Lessons Learned

David Ruthven
6 min readNov 15, 2023

--

Better than imagined

Migrating from one database system to another is generally a non-trivial exercise. There has to be strong motivation to justify the effort, cost and risk of migrating. Migrating a highly optimised, high intensity compute workload operating against high volume data is the most challenging of all. Just over a year ago I was fortunate enough to work with a customer with just such a use case. We learned a lot along the way.

Use Case

The customer is one of the world’s largest Advertising Agencies and the primary use case was Data Engineering. Their application assimilates and analyses online user activity and models user journeys for behavioural targeting using look-a-like machine learning models. Every day, 10s of Terabytes of data equating to 100s of billions of rows are ingested, modelled, analysed and shared. The majority of workload is handled by Apache Spark but Vertica was used to perform 100s of large scale (> 1 trillion row) aggregations per day. Also the final version of data was published via Vertica to enable non-Spark users to easily access the results.

Reservations

When we first discussed the possibility of migrating the Vertica workload to Snowflake the customer was sceptical that we could be successful. The main reason for their scepticism was because the Vertica component had been in place for 7 years and was heavily optimised for this workload. The other reason was because they had already attempted to migrate to Google BigQuery and gave up. They had spent over 6 months attempting the migration but decided it would involve too much code re-factoring to make it work. Apparently, in the customer’s words, “Google BigQuery is not a real database”. Migrating from a true relational database to a true relational database had more promise but there was still the unknown quantity of that heavily optimised, high volume workload.

Why Snowflake?

Putting aside the reservations there were several reasons why Snowflake would benefit this use case.

  • Data Share. Many of our customer’s agencies and their customers (brands) would be able access this data directly via Snowflake data sharing which presents a more economical and efficient access path, particularly for such large data volumes
  • Data Growth. Ingest volumes were likely to expand so a simple and economical way to scale capacity would be of huge benefit
  • Cost Segregation. Some consumers of this data have custom workloads they want to run against this dataset. Allowing them to run those workloads either in their own Snowflake account or using their own dedicated Snowflake compute clusters could easily segregate the cost.
  • Modernise Skills. There was a definite appetite in the team to learn Snowflake. It may not be reason enough but the value of keeping a highly competent technical team energised should not be underestimated.

Vertica

I have worked with several database systems but not previously with Vertica. I always find that learning a related or similar technology also teaches you more about those you are already familiar with. There are always differences both positive and negative.

Vertica keeps all data sorted including temporary tables. It relies heavily on sort-merge joins which are very efficient with memory and very fast. It uses a concept called Projections which are similar to materialised views to enable access to permanently sorted versions of data. It also shards and replicates data across processors to run queries in parallel and to support resilience. It is clear that Vertica has an advanced query optimiser and produces readable execution plans (if you can consider any execution plan readable). The Vertica license model is based on data volume only, you bring your own storage and compute, unlike Snowflake where we include the storage and compute. The compute used by the Vertica system deployed in North America for this application had 1,952 cores and 15.6TB of RAM.

Evaluation

The PoC had to involve full volume datasets since aside from code migration feasibility, performance and sizing (cost) were critical success factors. The customer was hoping for on-par costs. The performance would have to be as good or better than Vertica and the system would have to scale easily in anticipation of growing incoming data volumes and downstream analysis.

The PoC immediately diverged from production when the customer chose to ingest data in Parquet format even though in production they would use compressed CSVs. It was easier for them to make Parquet files available for the test. The standard Snowflake COPY command was too slow to load 1.2 trillion rows of Parquet data. Using concurrent COPY commands got better aggregate performance. However using Snowpipe was the best option and was approximately 14 times faster at 55% of the cost. Although our Parquet vectorized scanner was available for external tables at the time we did not test that.

Having loaded the data we got off to a poor start when we discovered the application performed JOINs on multi-billion row tables on a BINARY column. The BINARY column data type was used for high in-memory data density. Snowflake was able to perform JOINs on BINARY columns but (at the time) did not estimate the cardinality accurately and this resulted in poor execution plans. Fortunately the workaround to use VARCHAR instead of BINARY was easy to apply and largely addressed the cardinality issue.

Snowflake has few published tuning options aside from re-coding SQL to be more sympathetic to the optimiser and query execution engine. As we all know tuning SQL has the biggest impact on performance, however to tune SQL you sometimes need to know how queries are executed which can vary a lot depending on the database engine. We did rely on our internal Snowflake applied performance team on more than one occasion.

Pleasant Surprises

The customer expected but was nonetheless impressed by the complete isolation of compute clusters allowing running of multiple workloads in parallel and the ability to scale and suspend compute near instantaneously.

In addition they found quite a few reassuring and unexpected positive outcomes.

  • Loading Data was faster. Vertica has the overhead of maintaining sort order.
  • Very little SQL re-factoring. Snowflake is a full RDBMS with a rich SQL dialect so very few changes to Vertica SQL had to be made to run on Snowflake.
  • On-par or better Performance. In spite of Vertica having been optimised for this workload over 7 years Snowflake could equal or improve on the performance and often with lower compute resources than the Vertica cluster. This in spite of Snowflake not having any SQL hints which were heavily used in the Vertica implementation.
  • Reduced Compute. The customer expected they would need a 5XL or 6XL Snowflake warehouse for some of the jobs, however we never needed more than a 4XL and many jobs were run on smaller warehouses.
  • Auto-statistics gathering is great. In many other database systems, Vertica included, gathering statistics has to be run separately from loads or modifications to data. Snowflake performs this on the fly.
  • Table Partitioning was not required. Snowflake does not have explicit table partitions, but data can be grouped together using table cluster keys. The concern was that in Vertica table partitions make table maintenance more efficient. However, deleting billions of rows was more efficient in Snowflake than traditional partition drop and swap operations in Vertica.
  • No Overprovisioning. Snowflake may or may not be more expensive per compute hour but one of the key reasons Snowflake is generally more economical than alternate systems is you don’t have to over-provision, the speed of scaling means you only need to provision capacity for the duration it is required.

Final Outcome

The PoC was run over a period of 3 months with production level data volumes and full volume query tests. This also allowed us to accurately calculate the Snowflake costs for their three production deployments in North America, EMEA and APAC.

The PoC was not without issues and workarounds had to be found on occasion. However you always know when a team are keen to make the project a success when they don’t throw their toys out of the pram at the first obstacle and when they themselves suggest workarounds.

The customer did successfully migrate their application from Vertica to Snowflake on time and on budget. Equally important is that the team are now highly proficient and confident with Snowflake.

--

--

David Ruthven

Snowflake Sales Engineer — opinions expressed are solely my own and do not express the views or opinions of my employer.