Snowpark for Python … Code Faster

David Ruthven
5 min readMar 27, 2023

--

One of the primary reasons that programming languages like Python are so popular with data engineers, data scientist and application developers is because with Python you can develop code much more rapidly. Python is interpreted (no waiting for compiling and linking ) which makes it easier to debug and faster to iterate and test. Python developers typically use notebooks for coding with interactive execution of code blocks and inline viewing of results. In addition, Python has an enormous global community which makes finding example code, reusable code and support that much easier. This substantial programming community also creates a large and wide variety of code libraries covering almost every programming task including connectors, data manipulation, visualisations, geospatial, UIs, machine learning … you name it.

However, there are two remaining hurdles that impede Python development that Snowflake and Snowpark can help with, namely fast and easy access to data and fast and easy access to compute.

Access to (Production) Data

When developing code you need access to data. In particular when developing new releases of production code you typically need to access production quality and production quantity data. In most database systems you would need to make a physical copy of production data, perhaps anonymised for security. It is non-trivial to find storage resources to accommodate data copies, it also takes time to copy data and you also need to worry about securing that copy. Ideally the data copy should be managed by the same repository that manages the production copy otherwise you have to add an additional governance layer.

Snowflake’s Zero Copy Cloning

Snowflake solves this issue with zero copy cloning (ZCC). ZCC can clone a table, schema or an entire database. The cloning process is a pure metadata operation, meaning that only copies of pointers to data blocks, Snowflake micro-partitions, are made. This means the cloning process is extremely fast taking only seconds or minutes using a single SQL command. There are no physical copies of any data blocks when the clone is initially created, hence the Zero. The cloned data is fully read-write and changes made by developers are private to their cloned copy. The cloned data is also managed using the same governance policies as the production data, although may be made even more stringent for developer use. Of course this is only viable because you can spin up one or more completely independent compute clusters for your developers to isolate any of their development and test activity from production activity.

Access to Data (Enrichment and 3rd party)

Developer productivity can be severely impacted by having to find and load data. For example if you are developing a geospatial application and you need data for a particular country you may need to find a source for the shapefiles, load and transform those shapefiles into a consumable format, then you can continue with your day job. Imagine if you could subscribe to this data just as easily as including a code library to extend code functionality.

Snowflake Marketplace and Data Sharing help you piece together your required datasets

Snowflake data sharing does exactly that. You can subscribe to any data on any Snowflake account which has been made accessible to you via a data share. Subscribing is near instantaneous regardless of the original dataset size. This is because there is no copy being made, the subscription is like a view on the original data. For security or ease of use the publisher of the shared data may only provide access via secure views or user defined functions. The subscribed data is read-only but is a live copy, meaning when you query that data you see the latest version. The range of data sets on the Snowflake Marketplace, you can subscribe to, is continually growing and covers categories including Commerce, Demographics, Economy, Energy, Environment, Financial, Geospatial, Government, Health and Life Sciences, Identity, Legal, Marketing, Media, Security, Sports, Transportation, Travel and Weather.

Access to Compute

Snowflake is famous for “solving all your scalability problems forever”. Now you can do that for your Python applications.

With Snowflake you can increase the size of a compute cluster (scale up) to improve individual query performance, for increased concurrency you can add more clusters (scale out) and to isolate workloads completely you can create separate single or multi-cluster Virtual Warehouses for each desired workload, allowing users to run their workloads anytime they require without worrying about impacting other users or batch activity.

Since Snowpark converts DataFrame references to Snowflake SQL statements your DataFrame code now benefits from the highly efficient Snowflake SQL query engine with its first class query optimiser and vectorized execution engine.

Summary

Snowpark for Python solves the last main hurdles for highly productive code and application development. Namely near instant access to data of any volume and near instant access to compute of any scale.

Resources

There are a growing number of super cool Snowpark for Python Quickstarts to help you get up to speed quickly, below are some I recommend:

If you don’t have access to Snowflake, no problem, you can fire up a free trial environment, click on the image below.

Let’s Go!

--

--

David Ruthven

Snowflake Sales Engineer — opinions expressed are solely my own and do not express the views or opinions of my employer.