• Home
  • Team
  • What We Learned From Working With 100M JSON Blobs In Redshift Spectrum

What We Learned From Working With 100M JSON Blobs In Redshift Spectrum

By Nick Lemoing, Engineering Intern

If you’ve visited Noom’s landing page, you may have filled out the survey. We ask potential users several questions to assess their fit for the program. All of the data from that survey is valuable because it offers a window into what the user is looking to get from Noom.

Or, rather, it could offer a window into what the user is looking for. There are a few roadblocks to using this data to its maximum potential.

  • Sluggish JSON extraction from Redshift. All survey data responses are stored in a JSON blob called `survey_data`. We use Redshift as our analysis platform, and getting data out of a JSON blob using Redshift is not ideal (more on this later). Our current view only captures a subset of the questions we ask, and it still takes *over two hours* to run.
  • No question text information. Instead of living in a database or a CMS, all survey questions are defined in code to help engineers get experiments out more quickly. From a data perspective, the downside is that we have no idea what a question was asking; that information lives in code and doesn’t make it into the `survey_data` blob. All we have is a `question_id`, which is a short string identifying the question.
  • Continuously changing questions. The list of questions changes all the time because the Growth team is continuously running experiments. That means there’s no comprehensive list of the questions that are asked at a given time.

I’ve spent the last eight months as an intern at Noom. I’ve worked on the Growth and Data teams, so I’ve seen first-hand how survey data is handled. As my internship was coming to an end, I was asked if there was anything specific I wanted to pursue with my remaining time. That question can be surprisingly tricky to answer; it’s not often you get a blank slate. However, I realized this was the perfect opportunity to tackle the survey data problem. I outlined the problem and my proposed solution, pitched it to the team, and for the next few weeks, it was my main focus.

I had heard the term “data wrangler” before, though I can’t say I knew what it meant until I started this project. To wrangle the data into a usable format, we had to employ an interesting JSON transformation workaround, make use of Redshift Spectrum’s efficient JSON querying capabilities, and hijack our build system to extract some additional, hard-to-get data. 

Transforming the JSON

Redshift provides a few functions that can be used to get data out of a JSON blob. For objects, Redshift provides a function called `json_extract_path_text`. This function takes a JSON blob and a key as its arguments. We can get data out of the blob by calling `json_extract_path_text` for each key we’re interested in.

Redshift provides a few functions that can be used to get data out of a JSON blob. For objects, Redshift provides a function called `json_extract_path_text`. This function takes a JSON blob and a key as its arguments. We can get data out of the blob by calling `json_extract_path_text` for each key we’re interested in.

One of the limitations of this function is that we can’t tell Redshift to extract all of the keys from the JSON blob; we have to ask for each key we want explicitly. This is why we’re limited to just querying a subset of the keys: we have to settle for using only the ones we know about because we don’t have a comprehensive list of questions.

However, that’s not the only thing preventing us from extracting all the keys. It turns out that as the number of keys we want to extract increases, the query slows down considerably. This only really becomes noticeable when running on a large number of rows (~100M in our case) and a large number of keys. Until that point, Redshift can scale up considerably and mask some of those inefficiencies. However, we started to reach a point where our Redshift cluster was starved on CPU, and this query was at the top of the list in terms of CPU cost.

We hypothesized that `json_extract_path_text` re-parses the JSON blob each time it is called, even if it is called multiple times per row. In our case, we were making 65 calls to JSON functions using the same blob.

Caption: Query times plotted against the number of keys extracted. Times are expressed relative to the time it takes to extract just one key, which helps normalize the number of rows. For example, extracting 50 keys from 100K rows takes one and a half times as long as extracting one key does, while extracting 50 keys from 100M rows takes 30 times longer than one key.

We killed both of these birds with one stone. We affectionately refer to the stone in question as “composite user-defined function,” or composite UDF. Redshift allows users to define custom functions, but those functions are limited in that they can only return primitive types (strings, numbers, Booleans) and not composite types (tuples, lists, or maps). They also can’t produce multiple rows of output from a single input row.

However, we can return text, meaning we can write a UDF which transforms the questions to a JSON string with a more consistent schema. In this case, a consistent schema means no unexpected keys because we have to be explicit about each key we use. We can use the UDF to transform our object with arbitrary keys into an array that contains each key-value pair as an element. This helps ensure we capture all the keys present in the data, not just ones we specify beforehand. Another benefit is speed: The UDF is written in Python, not SQL, so we only have to parse the blob a single time using `json.loads` to perform this transformation.

{
  "user_id": 123,
  "survey_data": {
    "first_question": ["answer"],
    "multi_select": ["answer_part_1", "answer_part_2"],
    …
  }
}

Caption: Old JSON format. Each question_id is a key; adding new questions introduces unexpected keys.

{
  "user_id": 123,
  "survey_data": [
    { 
      "question_id": "first_question",
      "answer": "answer", 
      "answer_index": 0
    },
    { 
      "question_id": "multi_select",
      "answer": "answer_part_1", 
      "answer_index": 0 
    },
    { 
      "question_id": "multi_select",
      "answer": "answer_part_2",
      "answer_index": 1 
    },
    ...
  ]
}

Caption: New JSON format. Each array element has a question_id, answer, and answer_index key; there are no unexpected keys.

It’s not immediately clear how this helps, given that we would still need to use Redshift JSON functions to extract the output. However, now that the data is in a reliable schema, there’s another tool that can help: Redshift Spectrum.

Using Redshift Spectrum To Query JSON Efficiently

Redshift Spectrum is an extension of Redshift that allows you to treat files in S3 as database tables. It’s designed to work efficiently on tabular and nested data, meaning if we have JSON in S3, we can query it with Spectrum. What’s more, is that Spectrum explodes arrays so that each array element gets its own row. This is useful for our purposes because it allows us to create a table with four fields: `user_id`, `question_id`, `answer`, and `answer_index`.

Like with Redshift, Redshift Spectrum is only able to query keys it knows about. These have to be defined in the table’s schema. This is why the transformation step above was important. Not only did it help with the CPU problem, but it also helped put things in a format that Redshift Spectrum can read.

Redshift Spectrum’s downside is that it *only* works in S3, and our data is in Redshift. Adding this into the mix means having to unload all of our data to S3 and then read it back again using Spectrum. There’s significant overhead here, which is why this approach doesn’t make sense unless the CPU is a bottleneck. UDFs also come with their own set of constraints: the output size is limited to the maximum Redshift VARCHAR column size (65535 bytes), and UDF concurrency is capped at one-quarter of the cluster’s total concurrency. It’s important to profile both approaches on real-world data since there are several tradeoffs.

However, in our case, we found massive speed improvements when the number of keys we extracted from the JSON was large. Another advantage is that it can be done entirely through Redshift queries; see the appendix for more detail on how this can be accomplished.

Caption: Query times plotted against the number of keys extracted. The dataset had 100M rows. The Spectrum approach’s query time includes the composite UDF transformation, the unload to S3, and the Spectrum query.

We were able to produce a table with all of the survey data, not just a subset of the questions, and we were able to do so in a fraction of the time it took before. We were also able to perform each step within our existing Redshift cluster. However, there was still a major issue from the Data Analysis team’s perspective. They didn’t know what the questions were asking, nor did they have a list of the series of questions that were being asked at any given time.

Extracting Question Text

Extracting question text turned out to be the biggest rabbit hole of this project, if not the most technically complex. Even just distilling the problem was difficult because it sat at the intersection of the Growth and Data teams, and neither team was fully aware of the other’s constraints.

Question text was stored in the Growth codebase, making it difficult to extract. In addition, the text for a given question ID could change over time, so whatever solution we settled on also had to be automated and repeatable to capture both previous and future changes. We also did not want to saddle the Growth team with a solution that would break any time they refactored something.

With those concerns in mind, I began prototyping some solutions. This was new ground for me, and I had to go through several failed iterations before finding something that worked.

The Dead End

I played with Babel’s Abstract Syntax Tree (AST) tool for a related project, and when this project started, I thought it would be a good candidate for a solution. The AST tool reads a JavaScript file and produces a tree representation of the code. In theory, that tree contains information about the questions array, including question ID and text. If the array looks like this:

const questions = [
    {
        id: "question1",
        text: "What is your favorite color?"
    },
    ...
]

Then, the Babel AST will look like this:

{
  "type": "VariableDeclarator",
  "id": {
    "type": "Identifier",
    "name": "questions"
  },
  "init": {
    "type": "ArrayExpression",
    "elements": [
      {
        "type": "ObjectExpression",
        "properties": [
          {
            "type": "Property",
            "key": {
              "type": "Identifier",
              "name": "id"
            },
            "value": {
              "type": "Literal",
              "value": "question1"
            }
          },
          {
            "type": "Property",
            "key": {
              "type": "Identifier",
              "name": "text"
            },
            "value": {
              "type": "Literal",
              "value": "What is your favorite color?",
            }
          }
        ]
      },
      ...
    ]
  }
}

Notice both the question ID (“question1”) and the question text ("What is your favorite color?") are properties of the tree, so we can extract them programmatically.

Unfortunately, there’s a problem with this approach. Things are easy if everything is defined in the array. However, the file looked a lot more like this:

import { question1 } from "./question1"
import { question2 } from "./question2"

const questions = [
  question1,
  question2
];

Of course, we can resolve those imports, locate the definitions of `question1` and `question2`, and extract the data from there. That’s the rabbit hole I dove headfirst into. What I would later realize is that imports are only the tip of the complexity iceberg.

We had questions where the text was derived from a translation, which meant looking up the translation key.

addTranslationAtKey("question1", {
  en: "What is your favorite color?"
});

export const question1 = {
  id: question1,
  text: i18n.t("question1")
};

We had questions where the text was defined in a variable elsewhere in the code, which meant tracing through identifiers to find the original definition.

const text = "What is your favorite color?";

export const question1 = {
  id: question1,
  text
};

After writing code to handle each of these edge cases and more, I had a system that could read all the questions in the codebase, and it worked like a charm.

And then I tried to run it on a commit from six months ago.

There’s a problem with doing static analysis to extract specific data from a codebase; no matter what you do, it’ll always be tied in some way to the structure of the codebase. The best static analysis tools tend to work in very broad strokes. For example, static analysis can find unused code, enforce global syntax rules, and trace references.

However, using static analysis to determine a single variable’s runtime value is a bad idea since it is sensitive to changes in the code structure. In that six month period, we overhauled our translation framework, updated our build system, and generally violated several assumptions I had foolishly presumed to be invariants in the system. As I ran the code, I saw those assumptions break in a way that made me realize how flawed and fragile the old approach had been. 

In retrospect, this seems like a case of me having a shiny new hammer, and believing my problem was a nail. There were better tools out there; I just needed to find them.

The Fix

While writing my brittle static analysis rules, I realized that having some way of executing the code would make finding the array’s contents at runtime trivial. After realizing it’s difficult to transform React code into something that runs in Node, I abandoned that approach. I wasn’t sure how to execute things programmatically.

That changed when, after my original attempt, one of the engineers on the Growth team used a tool called Puppeteer for an unrelated task. Puppeteer provides an API for programmatically accessing webpages from a browser, including executing JavaScript code. The question array wasn’t in the global scope, so I couldn’t use Puppeteer to read it yet, but it was a start.

I knew I had to approach the problem from a different angle to solve it.  Instead of focusing on what a single commit looked like, I looked at several sample commits from the last year. One thing stood out as being consistent: the dev server.

Although our build system had changed in the last year, the command for running the dev server stayed the same since it was a script defined in the `package.json` file: `npm run dev` always spun up a local server. 

My next realization was that although the project structure had changed, the program’s entry point was predictable; it was always called `index.js`, in more or less the same directory. Meaning, I could write a custom entry point and overwrite `index.js`. When `npm run dev` was called, instead of building the full application, it would build the new entry point. The new entry point would make the questions array available at the global scope, which meant it was accessible via Puppeteer.

The new entry point:

// This path was also prone to change slightly but could be
// determined automatically and injected using a combination of
// command-line tools.
import questions from "./path/to/question/array";

window.questions = questions;

This meant I could store historical question data by running the following process:

git checkout $COMMIT
mv entry.js index.js
timeout 15 npm run dev & node puppeteerScript.js

For me, the beauty of this process was how simple it was compared to the static analysis solution. It was still tied to the code structure in that it needed to know where the entry point was, but this was far more predictable than trying to trace through several layers of functions and references statically. The result was that we were able to get all of the question data we needed in just a few lines of code.

In Sum

Once we ran the question data collection script, all that was left to do was join it with the table we built with the composite UDF. 

This project was exhilarating for a few reasons. First, and probably most importantly, was that it was something I was able to lead from start to finish. It wasn’t even on the Data team’s roadmap for this quarter, but after writing a tech doc that demonstrated the value of making this data more accessible, I received their full support to go through with it. As an intern, it can be rare to have the opportunity to lead impactful projects, and I’m grateful to the team for giving me that chance.

The second reason I enjoyed it was that it allowed me to work in multiple technical areas that are quite different. I familiarized myself with our frontend build system, used heavy data processing tools, and even though it was a dead-end, I got lots of exposure to static analysis.

Unfortunately, I won’t get to stick around to see how the data gets used; my internship wraps up soon. However, I will be returning to Noom full time once I graduate next year, and I’m looking forward to hitting the ground running when I get back.

Appendix

The Composite UDF, From Start To Finish

One of the coolest things about the composite UDF is that each step is done within Redshift. Before this effort, we had already set up an external database with Redshift Spectrum, but doing so is straightforward since Spectrum is automatically included in Redshift. 

The first step is to create the composite UDF with the CREATE OR REPLACE FUNCTION command. 

CREATE OR REPLACE FUNCTION survey_data_transformer(
user_id BIGINT,
survey_data VARCHAR(65535)
) 
RETURNS VARCHAR(65535)
IMMUTABLE AS $$

import json

data = json.loads(survey_data)
parsed_answers = []
for question_id, answers in data.items():
for answer_index, answer in enumerate(answers):
parsed_answers.append({
‘question_id’: question_id,
‘answer’: answer,
‘answer_index’: answer_index
})
result = {
‘user_id’: user_id,
‘answers’: answers
}
return json.dumps(result)

$$ LANGUAGE plpythonu

Once the UDF is defined, the data can be transformed and unloaded to s3 in one command. This is written as JSON-encoded text with one entry per line. Setting the max file size helps keep individual request times low and avoid timeouts.

UNLOAD (‘
SELECT survey_data_transformer(user_id, survey_answers)
FROM survey_data_raw_table
‘)
TO ‘{{ s3_uri }}’
IAM_ROLE ‘{{ iam_role }}’
MAXFILESIZE 128 mb
ALLOWOVERWRITE
GZIP

To read the data back, we first have to define an external database table and provide the schema. We first had to set up an external schema and database (both of which we called ‘s3’) using these steps

CREATE EXTERNAL TABLE s3.survey_data_parsed (
user_id BIGINT,
answers array<struct<
question_id: VARCHAR, 
answer: VARCHAR, 
answer_index: INT
>>
)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’
STORED AS
  INPUTFORMAT
    ‘org.apache.hadoop.mapred.TextInputFormat’
  OUTPUTFORMAT
    ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
LOCATION ‘{{ s3_uri }}’

Finally, we can query the data back using Redshift Spectrum’s nested data querying support.

SELECT
   data.user_id,
   a.question_id,
a.answer,
a.answer_index
FROM s3.survey_data_parsed data, data.answers a

Stay in touch

Get updates from the Noom team in your inbox.