Snowflake – Loading JSON to table in 3 steps

We frequently come across situations where we get a data source in JSON that we need to load into Snowflake. While Snowflake supports JSON data sources, there are some nuances of how the load process works that can take a while to understand and work around. The result is that we ended up spending more time than we initially planned for the tasks. Hopefully this post helps you get there faster!

Recently, I was tasked with loading some JSON from the Open FDA API (https://open.fda.gov/data/downloads/) into a single table. The data required was broken into 9 compressed JSON files…each of which being 600+ MB unzipped. All in all, there were 23,158,988 lines of JSON that needed to be loaded to the table…that’s a lot of data. I didn’t anticipate it at the time, but it’s no wonder that the size of these files would end up being the first issue that I encountered.

I had never handled loading JSON into a Snowflake table before, but after only a quick search within the Snowflake docs, I had found my solution. A script, specifically for loading JSON data into relational tables. Bingo. This solution involved the use of the PARSE_JSON function. PARSE_JSON interprets an input string as a JSON document, allowing you to access key/value pairs using dot notation, and load the values directly into the table.

So, I created my staging environment, staged my files and prepared them to be loaded to the table. Only there was a problem…the file size was too large to be accepted, and I was met with an error on load. Further research led me to believe that file sizes of between 100-250 MB of compressed size were perfectly suitable to be loaded. So then why wasn’t this working? The compressed file size of my JSON was less than 150 MB.

No big deal. As we all know, googling and reading docs is part of the job. So, I dug a little deeper into the Snowflake docs. On further investigation, it seemed that the problem was in the staging of my JSON. JSON is stored in the staging area as a string, and in one single column. Perhaps storing all of that data in a single column was the problem. I needed another method.

After another go at digging through documentation, I found the FLATTEN function. Flattening a column produces a lateral view of the data. Because JSON is staged in a single column, I thought that maybe this would be a great solution. It would allow me to individually load each value directly into the corresponding table columns, one row at a time. My syntax and logic were just sound, and this approach would work just fine under normal circumstances. But again, the command failed as I received my second load error due to excessive file size.

I proceeded to spend the next several hours scouring the Snowflake docs and community, looking for a solution. I tried combining the two previous methods, a method that involved transforming the JSON on load, and countless other functions. I even tried bypassing the CLI and loading the file directly through the Snowflake GUI. Yet here I was, getting the same file size error over and over.

What I ended up learned in the long run, is that although a single file of such size can be staged for upload, it cannot actually be copied into the table. It turns out that the max file size for copying JSON to a table is capped at just over 16.5 MB.

Given the amount of data I had to upload, I was looking at having to break the files up into ~ 375 different files in order to meet the 16.5 MB limit. I now had a different problem on my hands.

I opened the first of the 9 JSON files, which took several seconds to load in my IDE due to its sheer size. I scrolled to the bottom of the file…only 1.65 million lines, no big deal. If I started now, I could manually break this thing down into appropriately sized chunks at least within a year or two, right? Clearly, I was going to need an alternate method to accomplish this.

I began looking for a JSON splitting tool online and came across a couple of different options. Some were browser-based applications, while others were open-sourced projects. I trialed a couple of different applications, however none of them appeared to be made to handle JSON. They would all split the file, yes, but without any concern for where the split was happening. The integrity of my JSON was being compromised and the output was invalid JSON.

Finally, I stumbled across a python based JSON splitter, which ended up being the key that unlocked this entire process for me. And with that, let’s get to the stuff that you came here to read.

Step 1: Splitting Your Files

For this task, I was fortunate enough to stumble on JSON Splitter, an open source tool for breaking up large JSON files in to smaller, user specified sized files. This program was built in Python, which I didn’t have.

First, if you don’t already have it, download Python from https://www.python.org/. Once installed, you can ensure that the installation was successful by running the command py – version in your terminal.

JSON Splitter can be downloaded at https://github.com/jhsu98/json-splitter. Once installed, navigate to the directory that the splitter was installed in. For ease, I would suggest that the target file(s) that need to be split be stored within the same directory, or a new folder within that directory. Note: Your JSON file must be an array of objects, or multi-dimensional array to work!

Next, ensure you’re in the same directory in which the program is installed, and run the command py splitter.py.

If successful, you will be prompted to select the target JSON file that needs to be split. Again, the file must be an array of objects or multi-dimensional. The JSON data that I used was sampled from an API endpoint, which required me to manually alter it to fit the splitter criteria for use. Once the program accepts your file, you will be prompted for the file size that you wish your file to split into. I chose 15 MB just to ensure that I wouldn’t exceed the snowflake file size limit during the upload process. If successful, the newly created files will be created within the same directory as the splitter application.

Step 2: Stage Your Files

Once you’ve got all of your files split, it is time to stage them to be loaded into your snowflake table. To do this, you will need to log into your Snowflake environment using the SnowSQL CLI.

First, we must create our file format. Note that I am naming my format ‘json_file_format’ in this example.

Create file format json_file_format
type=’json’
allow_duplicate=false
strip_null_values=false
strip_outer_array=true
ignore_utf8_errors=false;

Next, we need to create our staging environment using the file format that we just set up. I am naming my staging environment ‘json_stage’, and referencing the previously created file format.

Create stage if not exists json_stage file_format = json_file_format;

Now we have our staging environment set up to receive our split files. Depending on the size of your original JSON file, it is likely that you now have more than a handful of files that need to be staged. So what’s the best way to do this?

Simple. All we have to do is specify that we want to load ALL of files. The command to put files into your staging area is a little different depending on whether you’re a Mac or PC user.

Linux/Mac

Put file:///directory

Windows

Put file://C:\directory

Again, we need to specify that we want to upload ALL files within the directory. This is the perfect scenario for us to use the * character to load all the files.

File://C:\directory\my_split_file_* @json_stage;

This tells the program to load all files starting with ‘my_split_file_’ and any following characters, into our staging environment ‘@json_stage’.

If this works correctly, you should see a display of all the uploaded files. To check the files that were loaded, you can run the command list @json_stage; to get a complete list of files currently staged.

Stage 3: Load Your Data

Finally, it is time to load our data into the table. Again, this assumes that you already have a table set up and ready to go. I used fields of the VARCHAR data type in my example.

Copy into (col1, col2, col3)
From (select
Parse_json($1):json_key_1,
Parse_json($1):nested.json_key_1,
Parse_json($1):json_key_3[0],
From @json_stage t)
On_error=’continue’;

Let’s go over what is happening here.

First, we are using the ‘copy into’ clause to tell the program what fields from our table we want to load our data into. Next, we use a sub-query to get our data. The JSON must be parsed in order to be read, so we will use the ‘parse_json’ function. The staging area stores our JSON in one column, which we reference using the ‘$1’ syntax. We can now reference the values we want by using dot notation. Lastly, we reference where this JSON data is coming from…our JSON staging area.

That’s it! Our data can be queried using normal SQL syntax.

Snowflake Database Recovery

If you google the word ‘snowflake‘ for the first time, you will likely get references to the ‘it insult‘ word,  Taiwanese dessert, a crystal of snow, etc.  Once you find that Snowflake is the hot new data warehouse in the cloud, your subsequent searches are likely to be more productive in terms of digging up documentation. However, complex topics such as “Snowflake database recovery” have likely not made it into Stackoverflow or your favorite internet group yet. 

In this article, I will tell you a story about a day I sweated and thought I would lose my job and how Snowflake saved it!

Cloud Big Data Architecture

The project I am working on involved building a data warehouse from scratch on AWS. It is quite complex, with myriads of data sources, ETL pipelines, integration with multiple systems, etc. A high level architecture is shown in the diagram below:


Figure 1: Cloud Big Data Architecture

Cloud Data Warehouse

We chose Snowflake to serve as the Cloud Data Warehouse. It is a scalable, performant and resilient data warehouse explicitly built for the cloud. It is available on AWS and Azure with GCP in the works.

Setting up and using Snowflake is a breeze; I won’t go into details as you can read about it in their documentation.

Cloning in Snowflake

One of the cool features of Snowflake is the ability to clone an object. The object can be a table, an entire schema or even an entire database! The best part is that this is a super fast operation as no physical copying of data is involved (until either database starts to change).

As part of our regular operations, we maintain multiple databases and schemas – for different aspects of development, testing, etc. At some specific points in time, we clone objects from one database or schema to another. 

The Sandbox Clone

As users were on-boarded, we created a sandbox database as a playground for them to start playing with the data. As part of my duties, I re-clone this database at the start of every week. This way, users have a fairly stable environment to work with while also periodically getting the latest data. 

Creating this clone database is super simple using a SQL statement like this:

CREATE OR REPLACE DATABASE sandbox CLONE devdb;

This statement will delete the old database named sandbox and create a new version of it as a clone of the current state of the devdb database. How simple is that? 

Well, too simple to cause mistakes … as I found out!

One Monday morning, this is what I typed into the Worksheet and without thinking, hit RUN:

CREATE OR REPLACE DATABASE devdb CLONE sandbox;

After I do the clone, I do spot checks on the sandbox to make sure that “new” tables that were created show up in it. This time, not only did the new tables not show up, but worse – they disappeared from the devdb!

When I realized my mistake, I went into panic mode. We were still in development and so had no backups yet – the traditional way to recover from such errors.   

Snowflake Time Travel

I needn’t have panicked. Snowflake has this very cool feature (definitely even cooler than cloning) called Time Travel. (If you ever were a Dr. Who fan, you will be thrilled!)

Figure2: Time Travel

Think of Time Travel has a way to access the state of an object at a particular point in time in the past … up to 90 days back for the Enterprise Edition. For example, if you want to check the status of an order from a day ago:

SELECT status FROM orders AT(offset => -24*60*60);

It is as though you have a live backup of all the data at your fingertips … sweet!

Note that the default data retention period is one (1) day. You must explicitly enable a larger retention period at table creation time (or change the default setting for your database) to avail of the full 90 days. 

The UNDROP command

I bet every database admin has wished for this command more than once in their career. And yes – Snowflake has an UNDROP command. If you drop a table, schema or even an entire database – you can undrop it. 

UNDROP works in conjunction with Time Travel to help restore the state of an object as it was at a particular time. 

Recovering the Snowflake Database

Back to my story … In our Snowflake system, we had not changed the default retention period – so it remained at the default  value of 1 day. However, since I detected my mistaken within the hour, it was still possible to recover. 

Here are the steps that helped with Snowflake database recovery of the original devdb database:

  1. Since the devdb is now garbage, move it away. DO NOT  delete it – that comes later:
    • ALTER DATABASE devdb RENAME TO devdb_old;
  2. The original devdb was dropped when I did the clone. Running the UNDROP command will restore the original:
    • UNDROP DATABASE devdb;
  3. We no longer need the devdb_old database; so we can drop it:

    • DROP DATABASE devdb_old;

That’s it! The database was restored in no time and I could now re-clone it. However, I learned my lesson. So instead of typing away the clone command in a Snowflake worksheet, I created a script to execute it.

Conclusion

Snowflake is a very user-friendly data warehouse, allowing developers to do a lot of the functions that typically require a database administrator. However, it is still possible to make mistakes as this article on “Snowflake Database Recovery” shows.