Convert JSON to CSV with bash script
A tale of two formats: Converting JSON to CSV
So this just happened:
- “Peter, can you send me a CSV of those search results?”
- “…yeah, sure…give me a minute”
We’d been looking up what metadata about podcasts we could get from querying iTunes. I the hacker, he the hustler.
I was playing around with the iTunes Search API using curl:
curl "https://itunes.apple.com/search?term=<some search term>&media=podcast&limit=200"
This gives you a JSON-formatted result. I thought it’d be easy transforming this data into CSV. I mean, of course there’s a simple-to-use tool out there for this, right?
WRONG!
It should be noted that I expected this to take about 2 minutes and then I’d be back doing what I do.
I initiated my usual procedure: Problem -> Google -> Solution (yeah that’s how we do it these days). After not too much digging I found a couple of possible solutions. Let’s have a look at them.
jsonv
A Bash command line tool for converting JSON to CSV
Seemed promising, but required the user to specify every single field to include in the result. This was suboptimal, I just wanted all the fields (why isn’t this the default?). Installed it anyway, but couldn’t get it to play well with the data from iTunes. My patience ran out.
jq
Here’s a popular and powerful tool for manipulating JSON data and there’s even a function in there for converting results into CSV. Score!
.. but here’s one of the first examples I found:
<input data> | jq -c ‘.results[] | {name, party, cash: .cash_on_hand} | select(.cash | tonumber > 1000000)’
To the trained eye, this may look straightforward except for the missing CSV conversion. To me however, at this point, it might’ve just as well been written in Wingdings.
I remembered having used jq sometime in the past, but now I was painfully reminded I’d forgotten how to use it. So how did I modify the command to fit my data? Well, my 2 minutes were long gone and I was starting to get frustrated. Still hoping to find that one simple solution I decided to move on.
json2csv
The name sure implies it can do the job and the example from the GitHub page looks simple:
cat input.json | json2csv -k user.name,remote_ip > output.csv
..but wait.. do I need to specify the fields?! Installed it, tried it, couldn’t get it to play well with my data, gave up.
Now I was seriously fed up. FINE, I’ll dig into jq.
I’ll spare you the gory details of my going through the manual. Instead, let’s look at my first attempt:
<itunes data> | jq -r '.results[] | [.[] | tostring] | @csv'
jq is built around filters. You specify what filters to use from left to right and connect them through the common pipe symbol ‘|’ to create a new filter. Here are the subfilters used above:
(1) .results[](2) [.[] | tostring](3) @csv
iTunes returns a JSON object with 2 fields: resultCount and results. In results we find an array with a JSON object for each search result.
A filter with a dot + a field name extracts the value of that field, so in filter (1) we extract the array using .results, then append brackets [] to unpack the array and return each object as a separate entity. This is what makes it possible to get one result per row in the final result.
Using the filter .[] we can extract the values for all fields in an object and return the results on separate rows. However, doing so would be counter-productive since we would no longer be able to differentiate values from different objects. To counteract this we’re grouping values in arrays by wrapping the filter in brackets [.[]] . If we were to apply filter (3) now we would get errors complaining about invalid CSV rows. This is due to to some fields having arrays as values in this particular dataset. An example is genre: [‘Podcasting’, ‘Education’, ‘Business’]. We can get around this by converting all values to strings, which leads us to filter (2). It’s important we pipe these subfilters together inside the brackets to get this to work.
In the last filter (3) we convert each array to a CSV row, but there is one final problem to consider. @csv consider values like “podcast” as strings including the quotes. The quotes are then escaped and extra quotes are added so the final result is “\”podcast\””. For this reason we add the flag -r to jq which drops one set of quotes. Victory!
This story could have ended here, but after coming this far, I made a heartbreaking discovery. Each search result object wasn’t including the same set of data. Fields could be present in some objects, but not in others. Also, in a JSON object the order of the fields is not relevant.. and in the iTunes data, the order can differ from object to object, even if they have the same set of fields…
………
………
………
………
…so it turns out I need to specify which fields to use and in what order! F**K!
Now, do I go back to trying jsonv/json2csv again? No way. I know better now, let’s do this with jq. Here’s the final solution at the very end of my journey:
<itunes data> | jq -r '.results[] | [.artistName, .collectionName, .feedUrl, .releaseDate, .country, .primaryGenreName, .genres | tostring] | @csv'
I’m not even going to try explaining this command, by now I hope you’re able to interpret it and if need be save the data to a file.
Lesson: learn powerful tools and learn them properly. It will make life as a developer a lot easier.
Peter out.