How to Love jsonl — using JSON Lines in your Workflow

Alex Galea
3 min readJul 17, 2019

--

Why you should think about using JSON Line format in your data processing workflow. We’ll look at some jsonl examples and discuss how I use it day to day.

What is the json line format?

It’s a file type specification where each line is a JSON object. Just imagine a bunch of stacked up dictionaries. Here’s an example with 4 records:

{"name": "Gilbert", "wins": [["straight", "7♣"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

This is an appealing format for data streaming. Since each field has well defined key / value pairs, each record in the data stream can be handled in isolation. This helps ensure data integrity if something goes wrong.

The JSON Line specification is very simple, there’s only a few rules to consider:

  • UTF-8 encoding
  • Each line should be valid JSON
  • Use a newline character “\n” to separate records

JSON Lines as CSV

While I prefer to use key/value pairs for explicitly labelling values, the jsonl format could be used to record data in a traditional relational format. For example:

["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, false]
["Deloise", "2012A", 19, true]

Here each row is a JSON list, producing an equivalent format to CSV. Unlike CSV files, this has a universal format specification which is far less susceptible to parsing issues.

JSON Lines in the wild with Python

I use jsonl for dumping raw “source of truth” data. From there it can be loaded in and processed by any part of the application and, if needed, dumped into a relational format (e.g. Postgres, MySQL, CSV).

Here are python functions that can be used to write and read jsonl:

Let’s pick up and use these functions to process some data.

Say we’re testing a website to make sure pages are returning good status codes. We’ll loop over some URLs, making requests and then saving the response data to file.

import requests
import datetime
for url in [
'https://chordanalytics.ca/',
'https://github.com/agalea91',
'https://medium.com/chord-analytics',
]:
page = requests.get(url)
webpage_data = {
'page_url': page.url,
'status_code': page.status_code,
'date': datetime.datetime.now().isoformat(),
}
print(webpage_data)
dump_jsonl([webpage_data], 'out.jsonl', append=True)

Note that we are using .jsonl file extensions for the output (out.jsonl).

Now say you run this code with a daily cron job for a month, all the while appending data to out.jsonl. Then, one day you realize that you also wanted to save the page titles. With a relational database, you would need to alter the table and add a new column to store the field. With this infrastructure though, you only need to alter the records going forward:

from lxml.html import fromstringfor url in [
...
]:
...
title = fromstring(page.content).findtext('.//title')
if title:
webpage_data['title'] = title
...

This means we can make the implementation (getting the page title) immediately and deal with the processing later.

This processing might look something like the following, where we load the data into a Pandas DataFrame for analysis:

import pandas as pdwebpage_data = load_jsonl('out.jsonl')
db_data = []
db_cols = ['page_url', 'title', 'status_code', 'date']
for d in webpage_data:
db_data.append([])
for col in db_cols:
db_data[-1].append(d.get(col, float('nan')))

df = pd.DataFrame(db_data, columns=db_cols)

Where the DataFrame df might look something like this:

Notice the missing entries for title , which come from records where it wasn’t being saved (pre-implementation).

Summary

We learned about the JSON Lines data format and saw how it can be used as a truth of source data store for your applications and analysis pipelines.

Now get out there and start using jsonl in your workflow!

https://alexgalea.ca/

--

--