Convert CSV to JSON with jq
jq
the JSON processor is a new tool for command-line data mining. I’ve been using jq for a couple of months and it’s super useful.
In this post I present a simple example of converting CSV to JSON using jq. Parsing CSV with jq is important because while lots of tools will read a CSV file, most of those tools (eg Excel, Google Docs, your text editor) are not easy to use in automation scripts.
For example purposes, I am using sdss1738478.csv — a 200MB (about 1.7 million lines) CSV data set containing stellar object coordinates from Sloan Digital Sky and downloaded from the the Real-World Data Sets archive at Princeton.
Big Data is any thing which is crash Excel.
— DevOps Borat (@DEVOPS_BORAT)
January 8, 2013
Step 1: Visual examination of the raw data
Having downloaded sdss1738478.csv
, the first thing I want to do is take a look at the first few lines of the file to get a sense of what I am dealing with. For one thing I expect that the first line of the file will be a list of column headings which I can use to structure the JSON output I’m going to generate.
Use head
I use the head
command to examine the column header row plus a couple of lines of data. Seeing this helps me enormously to design the command that will convert this file to JSON.
$ head -n3 sdss1738478.csv
objectid (long),right ascension (float),declination (float),ultraviolet (double),green (double),red$
758882758529188302,263.4087,6.278961,24.5967464447021,23.471773147583,21.6918754577637,21.1872177124023,20.4352779388428
758882758529188203,263.4428,6.38233,26.7489776611328,25.1870555877686,21.7422962188721,22.282844543457,22.1947193145752
head
is also useful here because sdss1738478.csv
is over a million lines long. Using head
allows me to abstract away the complexity of working with a very large file and just concentrate on figuring out the column headings and the data format.
Step 2: Define an output format
Now that I have a sense of what the raw data looks like I can sketch out what I’d like the finished data set to look like — once it has been converted into JSON.
The simplest thing would be to deliver an array of JSON records (one for each line) containing labeled data fields.
The column headings obviously should be used as the data labels but I’m going to edit them slightly so that they’ll make naturally convenient keys for a JSON object.
Here’s some psuedocode to illustrate the format I’ve designed:
[
{
"id" : <data1>,
"ascension" : <data2>,
"declination" : <data3>,
"ultraviolet" : <data4>,
"green" : <data5>,
"red" : <data6>
},
...
]
Step 3: Write a jq
expression against a limited data set
Now that I have my format, I need to create a factory that takes a line of CSV as input and returns an array of formatted JSON records.
Here’s what that looks like with jq
. (Still truncating the data file with head
to make it easy to see what’s going on):
head -n3 sdss1738478.csv | \
jq --slurp --raw-input --raw-output \
'split("\n") | .[1:] | map(split(",")) |
map({"id": .[0],
"ascension": .[1],
"declination": .[2],
"ultraviolet": .[3],
"green": .[4],
"red": .[5]})'
Note I’m using the expression .[1:]
to skip the first line of the data file since that row just has the names of the column labels. The ability to omit lines directly from jq
means less need to dirty up raw data files by “pre-processing” them before parsing.
And here is what my output looks like:
[
{
"id": "758882758529188302",
"ascension": "263.4087",
"declination": "6.278961",
"ultraviolet": "24.5967464447021",
"green": "23.471773147583",
"red": "21.6918754577637"
},
{
"id": "758882758529188203",
"ascension": "263.4428",
"declination": "6.38233",
"ultraviolet": "26.7489776611328",
"green": "25.1870555877686",
"red": "21.7422962188721"
}
]
Step 4: Convert CSV to JSON!
At this point I am done testing so I no longer need to truncate the data file with head
. But this will generate around 13 million lines of output so I want to redirect that to a file!
jq --slurp --raw-input --raw-output \
'split("\n") | .[1:] | map(split(",")) |
map({"id": .[0],
"ascension": .[1],
"declination": .[2],
"ultraviolet": .[3],
"green": .[4],
"red": .[5]})' \
sdss1738478.csv > sdss1738478.json
Running the expression on the whole data file takes just under 1 minute on my 2013-era Macbook Air! That’s an awesome amount of power from a very short command.
# head -n4 people.csv
userid,fname,lname,address,city,code,bdate,direction,tnumber,eof
100477927,Sontag,Carrie,15944 NW 121st Ln ,Gainesville,22605,02/25/1972,252,2815817,
100477946,White,Charlie,5428 NW 67th St ,Archer,22618,05/18/1976,252,2154990,
100484246,Burnsed,Betty,6522 NW 106Th Pl ,Gainesville,22606,07/17/1957,,,
map({"userid": .[0],
"fname": .[1],
"lname": .[2],
"address": .[3],
"city": .[4],
"code": .[5],
"bdate": .[6],
"direction": .[7],
"tnumber": .[8],
"eof": .[9]})'
Note I’m using the expression .[1:]
to skip the first line of the data file since that row just has the names of the column labels. The ability to omit lines directly from jq
means less need to dirty up raw data files by “pre-processing” them before parsing.
# cat p1.sh
head -n3 people.csv | \
jq --slurp --raw-input --raw-output \
'split("\n") | .[1:] | map(split(",")) |
map({"userid": .[0],
"fname": .[1],
"lname": .[2],
"address": .[3],
"city": .[4],
"code": .[5],
"bdate": .[6],
"direction": .[7],
"tnumber": .[8],
"eof": .[9]})'
Results:
# bash ./p1.sh
[
{
"userid": "100477927",
"fname": "Sontag",
"lname": "Carrie",
"address": "15944 NW 121st Ln ",
"city": "Gainesville",
"code": "22605",
"bdate": "02/25/1972",
"direction": "252",
"tnumber": "2815817",
"eof": ""
},
{
"userid": "100477946",
"fname": "White",
"lname": "Charlie",
"address": "5428 NW 67th St ",
"city": "Archer",
"code": "22618",
"bdate": "05/18/1976",
"direction": "252",
"tnumber": "2154990",
"eof": ""
},
{
"userid": null,
"fname": null,
"lname": null,
"address": null,
"city": null,
"code": null,
"bdate": null,
"direction": null,
"tnumber": null,
"eof": null
}
]