We frequently come across situations where we get a data source in JSON that we need to load into Snowflake. While Snowflake supports JSON data sources, there are some nuances of how the load process works that can take a while to understand and work around. The result is that we ended up spending more time than we initially planned for the tasks. Hopefully this post helps you get there faster!
Recently, I was tasked with loading some JSON from the Open FDA API (https://open.fda.gov/data/downloads/) into a single table. The data required was broken into 9 compressed JSON files…each of which being 600+ MB unzipped. All in all, there were 23,158,988 lines of JSON that needed to be loaded to the table…that’s a lot of data. I didn’t anticipate it at the time, but it’s no wonder that the size of these files would end up being the first issue that I encountered.
I had never handled loading JSON into a Snowflake table before, but after only a quick search within the Snowflake docs, I had found my solution. A script, specifically for loading JSON data into relational tables. Bingo. This solution involved the use of the PARSE_JSON function. PARSE_JSON interprets an input string as a JSON document, allowing you to access key/value pairs using dot notation, and load the values directly into the table.
So, I created my staging environment, staged my files and prepared them to be loaded to the table. Only there was a problem…the file size was too large to be accepted, and I was met with an error on load. Further research led me to believe that file sizes of between 100-250 MB of compressed size were perfectly suitable to be loaded. So then why wasn’t this working? The compressed file size of my JSON was less than 150 MB.
No big deal. As we all know, googling and reading docs is part of the job. So, I dug a little deeper into the Snowflake docs. On further investigation, it seemed that the problem was in the staging of my JSON. JSON is stored in the staging area as a string, and in one single column. Perhaps storing all of that data in a single column was the problem. I needed another method.
After another go at digging through documentation, I found the FLATTEN function. Flattening a column produces a lateral view of the data. Because JSON is staged in a single column, I thought that maybe this would be a great solution. It would allow me to individually load each value directly into the corresponding table columns, one row at a time. My syntax and logic were just sound, and this approach would work just fine under normal circumstances. But again, the command failed as I received my second load error due to excessive file size.
I proceeded to spend the next several hours scouring the Snowflake docs and community, looking for a solution. I tried combining the two previous methods, a method that involved transforming the JSON on load, and countless other functions. I even tried bypassing the CLI and loading the file directly through the Snowflake GUI. Yet here I was, getting the same file size error over and over.
What I ended up learned in the long run, is that although a single file of such size can be staged for upload, it cannot actually be copied into the table. It turns out that the max file size for copying JSON to a table is capped at just over 16.5 MB.
Given the amount of data I had to upload, I was looking at having to break the files up into ~ 375 different files in order to meet the 16.5 MB limit. I now had a different problem on my hands.
I opened the first of the 9 JSON files, which took several seconds to load in my IDE due to its sheer size. I scrolled to the bottom of the file…only 1.65 million lines, no big deal. If I started now, I could manually break this thing down into appropriately sized chunks at least within a year or two, right? Clearly, I was going to need an alternate method to accomplish this.
I began looking for a JSON splitting tool online and came across a couple of different options. Some were browser-based applications, while others were open-sourced projects. I trialed a couple of different applications, however none of them appeared to be made to handle JSON. They would all split the file, yes, but without any concern for where the split was happening. The integrity of my JSON was being compromised and the output was invalid JSON.
Finally, I stumbled across a python based JSON splitter, which ended up being the key that unlocked this entire process for me. And with that, let’s get to the stuff that you came here to read.
Step 1: Splitting Your Files
For this task, I was fortunate enough to stumble on JSON Splitter, an open source tool for breaking up large JSON files in to smaller, user specified sized files. This program was built in Python, which I didn’t have.
First, if you don’t already have it, download Python from https://www.python.org/. Once installed, you can ensure that the installation was successful by running the command py – version in your terminal.
JSON Splitter can be downloaded at https://github.com/jhsu98/json-splitter. Once installed, navigate to the directory that the splitter was installed in. For ease, I would suggest that the target file(s) that need to be split be stored within the same directory, or a new folder within that directory. Note: Your JSON file must be an array of objects, or multi-dimensional array to work!
Next, ensure you’re in the same directory in which the program is installed, and run the command py splitter.py.
If successful, you will be prompted to select the target JSON file that needs to be split. Again, the file must be an array of objects or multi-dimensional. The JSON data that I used was sampled from an API endpoint, which required me to manually alter it to fit the splitter criteria for use. Once the program accepts your file, you will be prompted for the file size that you wish your file to split into. I chose 15 MB just to ensure that I wouldn’t exceed the snowflake file size limit during the upload process. If successful, the newly created files will be created within the same directory as the splitter application.
Step 2: Stage Your Files
Once you’ve got all of your files split, it is time to stage them to be loaded into your snowflake table. To do this, you will need to log into your Snowflake environment using the SnowSQL CLI.
First, we must create our file format. Note that I am naming my format ‘json_file_format’ in this example.
Create file format json_file_format
Next, we need to create our staging environment using the file format that we just set up. I am naming my staging environment ‘json_stage’, and referencing the previously created file format.
Create stage if not exists json_stage file_format = json_file_format;
Now we have our staging environment set up to receive our split files. Depending on the size of your original JSON file, it is likely that you now have more than a handful of files that need to be staged. So what’s the best way to do this?
Simple. All we have to do is specify that we want to load ALL of files. The command to put files into your staging area is a little different depending on whether you’re a Mac or PC user.
Again, we need to specify that we want to upload ALL files within the directory. This is the perfect scenario for us to use the * character to load all the files.
This tells the program to load all files starting with ‘my_split_file_’ and any following characters, into our staging environment ‘@json_stage’.
If this works correctly, you should see a display of all the uploaded files. To check the files that were loaded, you can run the command list @json_stage; to get a complete list of files currently staged.
Stage 3: Load Your Data
Finally, it is time to load our data into the table. Again, this assumes that you already have a table set up and ready to go. I used fields of the VARCHAR data type in my example.
Copy into (col1, col2, col3)
From @json_stage t)
Let’s go over what is happening here.
First, we are using the ‘copy into’ clause to tell the program what fields from our table we want to load our data into. Next, we use a sub-query to get our data. The JSON must be parsed in order to be read, so we will use the ‘parse_json’ function. The staging area stores our JSON in one column, which we reference using the ‘$1’ syntax. We can now reference the values we want by using dot notation. Lastly, we reference where this JSON data is coming from…our JSON staging area.
That’s it! Our data can be queried using normal SQL syntax.