Convert CSV to JSON with jq

jq the JSON processor is a new tool for command-line data mining. I’ve been using jq for a couple of months and it’s super useful.

In this post I present a simple example of converting CSV to JSON using jq. Parsing CSV with jq is important because while lots of tools will read a CSV file, most of those tools (eg Excel, Google Docs, your text editor) are not easy to use in automation scripts.

For example purposes, I am using sdss1738478.csv — a 200MB (about 1.7 million lines) CSV data set containing stellar object coordinates from Sloan Digital Sky and downloaded from the the Real-World Data Sets archive at Princeton.

Download sdss1738478.csv (200MB)

Step 1: Visual examination of the raw data

Having downloaded sdss1738478.csv, the first thing I want to do is take a look at the first few lines of the file to get a sense of what I am dealing with. For one thing I expect that the first line of the file will be a list of column headings which I can use to structure the JSON output I’m going to generate.

Use head

I use the head command to examine the column header row plus a couple of lines of data. Seeing this helps me enormously to design the command that will convert this file to JSON.

$ head -n3 sdss1738478.csv

objectid (long),right ascension (float),declination (float),ultraviolet (double),green (double),red$
758882758529188302,263.4087,6.278961,24.5967464447021,23.471773147583,21.6918754577637,21.1872177124023,20.4352779388428
758882758529188203,263.4428,6.38233,26.7489776611328,25.1870555877686,21.7422962188721,22.282844543457,22.1947193145752

head is also useful here because sdss1738478.csv is over a million lines long. Using head allows me to abstract away the complexity of working with a very large file and just concentrate on figuring out the column headings and the data format.

Step 2: Define an output format

Now that I have a sense of what the raw data looks like I can sketch out what I’d like the finished data set to look like — once it has been converted into JSON.

The simplest thing would be to deliver an array of JSON records (one for each line) containing labeled data fields.

The column headings obviously should be used as the data labels but I’m going to edit them slightly so that they’ll make naturally convenient keys for a JSON object.

Here’s some psuedocode to illustrate the format I’ve designed:

[
   {
      "id"          : <data1>,
      "ascension"   : <data2>,
      "declination" : <data3>,
      "ultraviolet" : <data4>,
      "green"       : <data5>,
      "red"         : <data6>
   },
   ...
]

Step 3: Write a jq expression against a limited data set

Now that I have my format, I need to create a factory that takes a line of CSV as input and returns an array of formatted JSON records.

Here’s what that looks like with jq. (Still truncating the data file with head to make it easy to see what’s going on):

head -n3 sdss1738478.csv | \
jq --slurp --raw-input --raw-output \
    'split("\n") | .[1:] | map(split(",")) |
        map({"id": .[0],
             "ascension": .[1],
             "declination": .[2],
             "ultraviolet": .[3],
             "green": .[4],
             "red": .[5]})'

Note I’m using the expression .[1:] to skip the first line of the data file since that row just has the names of the column labels. The ability to omit lines directly from jq means less need to dirty up raw data files by “pre-processing” them before parsing.

And here is what my output looks like:

[
  {
    "id": "758882758529188302",
    "ascension": "263.4087",
    "declination": "6.278961",
    "ultraviolet": "24.5967464447021",
    "green": "23.471773147583",
    "red": "21.6918754577637"
  },
  {
    "id": "758882758529188203",
    "ascension": "263.4428",
    "declination": "6.38233",
    "ultraviolet": "26.7489776611328",
    "green": "25.1870555877686",
    "red": "21.7422962188721"
  }
]

Step 4: Convert CSV to JSON!

At this point I am done testing so I no longer need to truncate the data file with head. But this will generate around 13 million lines of output so I want to redirect that to a file!

jq --slurp --raw-input --raw-output \
  'split("\n") | .[1:] | map(split(",")) |
      map({"id": .[0],
           "ascension": .[1],
           "declination": .[2],
           "ultraviolet": .[3],
           "green": .[4],
           "red": .[5]})' \
  sdss1738478.csv > sdss1738478.json

Running the expression on the whole data file takes just under 1 minute on my 2013-era Macbook Air! That’s an awesome amount of power from a very short command.

 

# head -n4 people.csv
userid,fname,lname,address,city,code,bdate,direction,tnumber,eof
100477927,Sontag,Carrie,15944 NW 121st Ln ,Gainesville,22605,02/25/1972,252,2815817,
100477946,White,Charlie,5428 NW 67th St ,Archer,22618,05/18/1976,252,2154990,
100484246,Burnsed,Betty,6522 NW 106Th Pl ,Gainesville,22606,07/17/1957,,,

map({"userid": .[0],
"fname": .[1],
"lname": .[2],
"address": .[3],
"city": .[4],
"code": .[5],
"bdate": .[6],
"direction": .[7],
"tnumber": .[8],
"eof": .[9]})'

Note I’m using the expression .[1:] to skip the first line of the data file since that row just has the names of the column labels. The ability to omit lines directly from jq means less need to dirty up raw data files by “pre-processing” them before parsing.

 

# cat p1.sh
head -n3 people.csv | \
jq --slurp --raw-input --raw-output \
'split("\n") | .[1:] | map(split(",")) |
map({"userid": .[0],
"fname": .[1],
"lname": .[2],
"address": .[3],
"city": .[4],
"code": .[5],
"bdate": .[6],
"direction": .[7],
"tnumber": .[8],
"eof": .[9]})'

 

Results:

# bash ./p1.sh
[
{
"userid": "100477927",
"fname": "Sontag",
"lname": "Carrie",
"address": "15944 NW 121st Ln ",
"city": "Gainesville",
"code": "22605",
"bdate": "02/25/1972",
"direction": "252",
"tnumber": "2815817",
"eof": ""
},
{
"userid": "100477946",
"fname": "White",
"lname": "Charlie",
"address": "5428 NW 67th St ",
"city": "Archer",
"code": "22618",
"bdate": "05/18/1976",
"direction": "252",
"tnumber": "2154990",
"eof": ""
},
{
"userid": null,
"fname": null,
"lname": null,
"address": null,
"city": null,
"code": null,
"bdate": null,
"direction": null,
"tnumber": null,
"eof": null
}
]

5 (3)
Article Rating (3 Votes)
Rate this article
Attachments
There are no attachments for this article.
Comments (1)
Comment By Chris - Fri, Apr 14th, 2023 10:53 AM
Hello, Thank you for this very helpful article. I followed your steps and was able to parse my data, However, I would like to add the contents under 1 "header" object. Something like below. Do you know how this could be done by jq? Thanks! { "data": [ { "id": "758882758529188302", "ascension": "263.4087", "declination": "6.278961", "ultraviolet": "24.5967464447021", "green": "23.471773147583", "red": "21.6918754577637" }, { "id": "758882758529188203", "ascension": "263.4428", "declination": "6.38233", "ultraviolet": "26.7489776611328", "green": "25.1870555877686", "red": "21.7422962188721" } ] }
Full Name
Email Address
Security Code Security Code
Related Articles RSS Feed
Unix - Examples for grep command
Viewed 2927 times since Fri, Jun 8, 2018
Epoch & Unix Timestamp Conversion Tools
Viewed 55526 times since Fri, Jun 22, 2018
Unix - eval command example
Viewed 1906 times since Fri, Jun 8, 2018
Unix - Find command examples
Viewed 4462 times since Fri, Jun 8, 2018
Display basic information about Physical Volumes
Viewed 3274 times since Sun, Jun 3, 2018
bash for do done AIX
Viewed 1734 times since Mon, Jun 4, 2018
Convert JSON to CSV with bash script
Viewed 12355 times since Mon, Jan 20, 2020
Transform XML to CSV Format | Unix String Pattern Manipulation The Ugly Way
Viewed 5055 times since Sun, Jan 9, 2022
O’Reilly’s CD bookshelf
Viewed 12170 times since Wed, Jun 27, 2018
Convert CSV to JSON with bash script
Viewed 5023 times since Mon, Jan 20, 2020