Snowflake as a Query Engine Agnostic Data Distribution Network

David Ruthven
6 min readApr 18, 2024

--

Query Engine Agnostic Data Sharing

What do I mean by Query Engine Agnostic Data Distribution Network?

Snowflake has a built-in simple, scalable, secure and economical way of delivering and sharing data across cloud vendors and regions. Sharing data between Snowflake accounts is a core capability of the Snowflake platform. With support for Iceberg tables we can more easily extend this capability to the growing list of query engines that can directly access Iceberg data with no need to ingest shared data.

Why is this important to your business?

Eliminating data siloes, including lake and warehouse siloes, and making data available to all relevant employees is core to leveraging your entire workforce to reveal data driven insights about all aspects of your business. With Snowflake deployed as a Data Distribution Network you connect your data to your users regardless of where that data is hosted and maintained or consumed. As a side effect of using Snowflake as a Data Distribution Network all your shareable data is then accessible within your Snowflake network allowing you to leverage all the scalability, governance and analytics use cases supported by the Snowflake platform and our vast array of partner solutions.

What is Snowflake Data Sharing?

Snowflake data sharing is a method for publishing and subscribing to shared data. Snowflake simplifies the data sharing process for all parties. The complexity, limited security and additional costs involved in sharing data with traditional methods is completely avoided. Since data never leaves the owner’s Snowflake account, Snowflake data sharing is also a foundational component of Snowflake data clean room solutions.

Snowflake data sharing for Subscribers

Snowflake data shares are live meaning subscribers to shared data always see the latest version as determined by the publisher. The data appears in their Snowflake account as if they had loaded it themselves. For subscribers, accessing shared data is instantaneous and there are no transfer or ingest costs. The only costs are the Snowflake compute credits used to query the shared data.

Snowflake data sharing for Publishers

The publisher has complete control over what data gets shared, even down to defining a prescriptive set of allowed queries if required. The publisher also controls which other Snowflake accounts can subscribe and can even turn off the share instantaneously. The publisher pays for the ingest and transform of the shared data, the replication and egress costs and the costs of the storage for the ‘headless’ replication account created on the target cloud region and owned by the publisher. For efficiency, performance and reduced cost, replication involves only replicating changed Snowflake micro-partitions (data blocks).

All this process of establishing, maintaining and monitoring data shares is fully automated and simplified through Snowflake Private Listings and our Provider Studio UI.

Sharing data between Snowflake accounts is as trivially simple as it can be. If you want to see for yourself just create a free Snowflake trial account and subscribe to any of the free data sets on Snowflake Marketplace which are all built on Snowflake data sharing.

Sharing data between non-Snowflake Query Engines

Many customers though also want to share data with non-Snowflake query engines. You could simply use the existing data sharing mechanism and then export data to cloud bucket storage and have the desired query engine access that data via an external table or ingest. This would still be preferable to most traditional data sharing mechanisms, however for much higher performance and to avoid ingest costs we recommend publishing to Iceberg tables at the subscriber end of the share.

At the time of publishing, Snowflake end-to-end data sharing for Iceberg tables from publisher to subscriber is still on the roadmap. For the time being the method illustrated below using Snowflake native tables can be used as an interim or even a longer term alternative approach.

In the above diagram, data to be shared exists in Snowflake native tables. In this example we have a base table and a delta table containing a historical record of all changes to the base table. These two tables are shared with the authorised Snowflake consumer account using Private Listings and Provider Studio to configure and automate building, replicating and monitoring the data share.

The consumer of a share has a number of options.

  1. Snowflake Native

If the consumer account wants to use Snowflake for their analysis they can directly reference the up-to-date shared native base table.

2. Iceberg

If the consumer wants to access the native base table data via Iceberg, an Iceberg copy can be maintained by Snowflake. The delta table allows for incremental changes to be applied to the published Iceberg table to minimise synchronisation delays and costs. If the providers runs the process to maintain the Iceberg table the consumer does not need to access Snowflake, they just see a locally provisioned Iceberg table.

3. Query Engine Native

If the consumer wants to ingest the base table data directly in their query engine they can leverage the delta table which contains CDC data to minimise ingest time and costs and apply changes to their own native table using a MERGE command.

What if the source data to be shared is not on Snowflake?

This is a very common scenario. Most of my customers have many valuable data sets managed by a variety of query engines. Sometimes they can use query engine specific sharing methods if the consumer also uses the same query engine. If that is not the case, the usual method of sharing across query engines is to export to storage at the publisher end, transfer the exported data to the subscriber region and ingest at the subscriber end. This means the data to be shared is no longer under the control of the provider query engine. It is also more expensive due to the export and ingest costs and means that there is latency between the published and the subscriber’s versions of the data.

We can leverage the Snowflake Data Distribution Network for this scenario as well as illustrated below.

On the provider side, data can be directly ingested into Snowflake from OLTP, IoT, Event, File and API data sources. For data which is managed and maintained in a 3rd party query engine, then an ELT or replication tool can be used to synchronise data to be shared into Snowflake.

On the consumer side, for consumers using Snowflake for their analytics they can just subscribe to the shared native tables. For consumers using 3rd party query engines that support Iceberg we can synchronise copies of the shared native tables with local Iceberg tables. For consumers who cannot use Iceberg or who want native query engine performance we can synchronise copies of shared native tables directly into their query engine using a reverse ETL or replication tool. There is always the option to export data into any supported Snowflake file format.

Non-Snowflake customers can also use a Snowflake Reader account, provisioned by the data provider, which gives them read-only access to a hosted Snowflake account.

Summary

Building your own Snowflake Data Distribution Network does not involve disrupting existing data pipelines nor migrating any workloads to Snowflake. It provides a valuable replacement service for simplifying data sharing across your organisation regardless of the data origins or consumer’s query engine requirements. As a side effect all of your shareable data is instantly accessible to authorised users within Snowflake allowing seamless collaboration between key functions, leading to transformative changes in virtually every aspect of your businesses.

--

--

David Ruthven

Snowflake Sales Engineer — opinions expressed are solely my own and do not express the views or opinions of my employer.