A Guide To Command-Line Data Manipulation — Smashing Magazine
[ad_1]
Allow me to preface this article by saying that I’m not a terminal person. I don’t use Vim. I find sed
, grep
, and awk
convoluted and counter-intuitive. I prefer seeing my files in a nice UI. Despite all that, I got into the habit of reaching for command-line interfaces (CLIs) when I had small, dedicated tasks to complete. Why? I’ll explain all of that below. In this article, you’ll also learn how to use a CLI tool named Miller to manipulate data from CSV, TSV and/or JSON files.
Contents
Why Use The Command Line?
Everything that I’m showing here can be done with regular code. You can load the file, parse the CSV data, and then transform it using regular JavaScript, Python, or any other language. But there are a few reasons why I reach out for command-line interfaces (CLIs) whenever I need to transform data:
- Easier to read.
It is faster (for me) to write a script in JavaScript or Python for my usual data processing. But, a script can be confusing to come back to. In my experience, command-line manipulations are harder to write initially but easier to read afterward. - Easier to reproduce.
Thanks to package managers like Homebrew, CLIs are much easier to install than they used to be. No need to figure out the correct version of Node.js or Python, the package manager takes care of that for you. - Ages well.
Compared to modern programming languages, CLIs are old. They change a lot more slowly than languages and frameworks.
What Is Miller?
The main reason I love Miller is that it’s a standalone tool. There are many great tools for data manipulation, but every other tool I found was part of a specific ecosystem. The tools written in Python required knowing how to use pip
and virtual environments; for those written in Rust, it was cargo
, and so on.
On top of that, it’s fast. The data files are streamed, not held in memory, which means that you can perform operations on large files without freezing your computer.
As a bonus, Miller is actively maintained, John Kerl really keeps on top of PRs and issues. As a developer, I always get a satisfying feeling when I see a neat and maintained open-source project with great documentation.
Installation
- Linux:
apt-get install miller
or Homebrew. - macOS:
brew install miller
using Homebrew. - Windows:
choco install miller
using Chocolatey.
That’s it, and you should now have the mlr
command available in your terminal.
Run mlr help topics
to see if it worked. This will give you instructions to navigate the built-in documentation. You shouldn’t need it, though; that’s what this tutorial is for!
How mlr
Works
Miller commands work the following way:
mlr [input/output file formats] [verbs] [file]
Example: mlr --csv filter '$color != "red"' example.csv
Let’s deconstruct:
--csv
specifies the input file format. It’s a CSV file.filter
is what we’re doing on the file, called a “verb” in the documentation. In this case, we’re filtering every row that doesn’t have the fieldcolor
set to"red"
. There are many other verbs likesort
andcut
that we’ll explore later.example.csv
is the file that we’re manipulating.
Operations Overview
We can use those verbs to run specific operations on your data. There’s a lot we can do. Let’s explore.
Data
I’ll be using a dataset of IMDb ratings for American TV dramas created by The Economist. You can download it here or find it in the repo for this article.
Note: For the sake of brevity, I’ve renamed the file from mlr --csv head ./IMDb_Economist_tv_ratings.csv
to tv_ratings.csv
.
Above, I mentioned that every command contains a specific operation or verb. Let’s learn our first one, called head
. What it does is show you the beginning of the file (the “head”) rather than print the entire file in the console.
You can run the following command:
`mlr --csv head ./tv_ratings.csv`
And this is the output you’ll see:
titleId,seasonNumber,title,date,av_rating,share,genres
tt2879552,1,11.22.63,2016-03-10,8.489,0.51,"Drama,Mystery,Sci-Fi"
tt3148266,1,12 Monkeys,2015-02-27,8.3407,0.46,"Adventure,Drama,Mystery"
tt3148266,2,12 Monkeys,2016-05-30,8.8196,0.25,"Adventure,Drama,Mystery"
tt3148266,3,12 Monkeys,2017-05-19,9.0369,0.19,"Adventure,Drama,Mystery"
tt3148266,4,12 Monkeys,2018-06-26,9.1363,0.38,"Adventure,Drama,Mystery"
tt1837492,1,13 Reasons Why,2017-03-31,8.437,2.38,"Drama,Mystery"
tt1837492,2,13 Reasons Why,2018-05-18,7.5089,2.19,"Drama,Mystery"
tt0285331,1,24,2002-02-16,8.5641,6.67,"Action,Crime,Drama"
tt0285331,2,24,2003-02-09,8.7028,7.13,"Action,Crime,Drama"
tt0285331,3,24,2004-02-09,8.7173,5.88,"Action,Crime,Drama"
This is a bit hard to read, so let’s make it easier on the eye by adding --opprint
.
mlr --csv --opprint head ./tv_ratings.csv
The resulting output will be the following:
titleId seasonNumber title date av_rating share genres
tt2879552 1 11.22.63 2016-03-10 8.489 0.51 Drama,Mystery,Sci-Fi
tt3148266 1 12 Monkeys 2015-02-27 8.3407 0.46 Adventure,Drama,Mystery
tt3148266 2 12 Monkeys 2016-05-30 8.8196 0.25 Adventure,Drama,Mystery
tt3148266 3 12 Monkeys 2017-05-19 9.0369 0.19 Adventure,Drama,Mystery
tt3148266 4 12 Monkeys 2018-06-26 9.1363 0.38 Adventure,Drama,Mystery
tt1837492 1 13 Reasons Why 2017-03-31 8.437 2.38 Drama,Mystery
tt1837492 2 13 Reasons Why 2018-05-18 7.5089 2.19 Drama,Mystery
tt0285331 1 24 2002-02-16 8.5641 6.67 Action,Crime,Drama
tt0285331 2 24 2003-02-09 8.7028 7.13 Action,Crime,Drama
tt0285331 3 24 2004-02-09 8.7173 5.88 Action,Crime,Drama
Much better, isn’t it?
Note: Rather than typing --csv --opprint
every time, we can use the --c2p
option, which is a shortcut.
Chaining
That’s where the fun begins. Rather than run multiple commands, we can chain the verbs together by using the then
keyword.
Remove columns
You can see that there’s a titleId
column that isn’t very useful. Let’s get rid of it using the cut
verb.
mlr --c2p cut -x -f titleId then head ./tv_ratings.csv
It gives you the following output:
seasonNumber title date av_rating share genres
1 11.22.63 2016-03-10 8.489 0.51 Drama,Mystery,Sci-Fi
1 12 Monkeys 2015-02-27 8.3407 0.46 Adventure,Drama,Mystery
2 12 Monkeys 2016-05-30 8.8196 0.25 Adventure,Drama,Mystery
3 12 Monkeys 2017-05-19 9.0369 0.19 Adventure,Drama,Mystery
4 12 Monkeys 2018-06-26 9.1363 0.38 Adventure,Drama,Mystery
1 13 Reasons Why 2017-03-31 8.437 2.38 Drama,Mystery
2 13 Reasons Why 2018-05-18 7.5089 2.19 Drama,Mystery
1 24 2002-02-16 8.5641 6.67 Action,Crime,Drama
2 24 2003-02-09 8.7028 7.13 Action,Crime,Drama
3 24 2004-02-09 8.7173 5.88 Action,Crime,Drama
Fun Fact
This is how I first learned about Miller! I was playing with a CSV dataset for https://details.town/ that had a useless column, and I looked up “how to remove a column from CSV command line.” I discovered Miller, loved it, and then pitched an article to Smashing magazine. Now here we are!
Filter
This is the verb that I first showed earlier. We can remove all the rows that don’t match a specific expression, letting us clean our data with only a few characters.
If we only want the rating of the first seasons of every series in the dataset, this is how you do it:
mlr --c2p filter '$seasonNumber == 1' then head ./tv_ratings.csv
Sorting
We can sort our data based on a specific column like it would be in a UI like Excel or macOS Numbers. Here’s how you would sort your data based on the series with the highest rating:
mlr --c2p sort -nr av_rating then head ./tv_ratings.csv
The resulting output will be the following:
titleId seasonNumber title date av_rating share genres
tt0098887 1 Parenthood 1990-11-13 9.6824 1.68 Comedy,Drama
tt0106028 6 Homicide: Life on the Street 1997-12-05 9.6 0.13 Crime,Drama,Mystery
tt0108968 5 Touched by an Angel 1998-11-15 9.6 0.08 Drama,Family,Fantasy
tt0903747 5 Breaking Bad 2013-02-20 9.554 18.95 Crime,Drama,Thriller
tt0944947 6 Game of Thrones 2016-05-25 9.4943 15.18 Action,Adventure,Drama
tt3398228 5 BoJack Horseman 2018-09-14 9.4738 0.45 Animation,Comedy,Drama
tt0103352 3 Are You Afraid of the Dark? 1994-02-23 9.4349 2.6 Drama,Family,Fantasy
tt0944947 4 Game of Thrones 2014-05-09 9.4282 11.07 Action,Adventure,Drama
tt0976014 4 Greek 2011-03-07 9.4 0.01 Comedy,Drama
tt0090466 4 L.A. Law 1990-04-05 9.4 0.1 Drama
We can see that Parenthood, from 1990, has the highest rating on IMDb — who knew!
Saving Our Operations
By default, Miller only prints your processed data to the console. If we want to save it to another CSV file, we can use the >
operator.
If we wanted to save our sorted data to a new CSV file, this is what the command would look like:
mlr --csv sort -nr av_rating ./tv_ratings.csv > sorted.csv
Convert CSV To JSON
Most of the time, you don’t use CSV data directly in your application. You convert it to a format that is easier to read or doesn’t require additional dependencies, like JSON.
Miller gives you the --c2j
option to convert your data from CSV to JSON. Here’s how to do this for our sorted data:
mlr --c2j sort -nr av_rating ./tv_ratings.csv > sorted.json
Case study: Top 5 Athletes With Highest Number Of Medals In Rio 2016
Let’s apply everything we learned above to a real-world use case. Let’s say that you have a detailed dataset of every athlete who participated in the 2016 Olympic games in Rio, and you want to know who the 5 with the highest number of medals are.
First, download the athlete data as a CSV, then save it in a file named athletes.csv
.
Let’s open up the following file:
mlr --c2p head ./athletes.csv
The resulting output will be something like the following:
id name nationality sex date_of_birth height weight sport gold silver bronze info
736041664 A Jesus Garcia ESP male 1969-10-17 1.72 64 athletics 0 0 0 -
532037425 A Lam Shin KOR female 1986-09-23 1.68 56 fencing 0 0 0 -
435962603 Aaron Brown CAN male 1992-05-27 1.98 79 athletics 0 0 1 -
521041435 Aaron Cook MDA male 1991-01-02 1.83 80 taekwondo 0 0 0 -
33922579 Aaron Gate NZL male 1990-11-26 1.81 71 cycling 0 0 0 -
173071782 Aaron Royle AUS male 1990-01-26 1.80 67 triathlon 0 0 0 -
266237702 Aaron Russell USA male 1993-06-04 2.05 98 volleyball 0 0 1 -
382571888 Aaron Younger AUS male 1991-09-25 1.93 100 aquatics 0 0 0 -
87689776 Aauri Lorena Bokesa ESP female 1988-12-14 1.80 62 athletics 0 0 0 -
Optional: Clean Up The File
The CSV file has a few fields we don’t need. Let’s clean it up by removing the info
, id
, weight
, and date_of_birth
columns.
mlr --csv -I cut -x -f id,info,weight,date_of_birth athletes.csv
Now we can move to our original problem: we want to find who won the highest number of medals. We have how many of each medal (bronze, silver, and gold) the athletes won, but not the total number of medals per athlete.
Let’s compute a new value called medals
which corresponds to this total number (bronze, silver, and gold added together).
mlr --c2p put '$medals=$bronze+$silver+$gold' then head ./athletes.csv
It gives you the following output:
name nationality sex height sport gold silver bronze medals
A Jesus Garcia ESP male 1.72 athletics 0 0 0 0
A Lam Shin KOR female 1.68 fencing 0 0 0 0
Aaron Brown CAN male 1.98 athletics 0 0 1 1
Aaron Cook MDA male 1.83 taekwondo 0 0 0 0
Aaron Gate NZL male 1.81 cycling 0 0 0 0
Aaron Royle AUS male 1.80 triathlon 0 0 0 0
Aaron Russell USA male 2.05 volleyball 0 0 1 1
Aaron Younger AUS male 1.93 aquatics 0 0 0 0
Aauri Lorena Bokesa ESP female 1.80 athletics 0 0 0 0
Ababel Yeshaneh ETH female 1.65 athletics 0 0 0 0
Sort by the highest number of medals by adding a sort
.
mlr --c2p put '$medals=$bronze+$silver+$gold' \
then sort -nr medals \
then head ./athletes.csv
Respectively, the resulting output will be the following:
name nationality sex height sport gold silver bronze medals
Michael Phelps USA male 1.94 aquatics 5 1 0 6
Katie Ledecky USA female 1.83 aquatics 4 1 0 5
Simone Biles USA female 1.45 gymnastics 4 0 1 5
Emma McKeon AUS female 1.80 aquatics 1 2 1 4
Katinka Hosszu HUN female 1.75 aquatics 3 1 0 4
Madeline Dirado USA female 1.76 aquatics 2 1 1 4
Nathan Adrian USA male 1.99 aquatics 2 0 2 4
Penny Oleksiak CAN female 1.86 aquatics 1 1 2 4
Simone Manuel USA female 1.78 aquatics 2 2 0 4
Alexandra Raisman USA female 1.58 gymnastics 1 2 0 3
Restrict to the top 5 by adding -n 5
to your head
operation.
mlr --c2p put '$medals=$bronze+$silver+$gold' \
then sort -nr medals \
then head -n 5 ./athletes.csv
You will end up with the following file:
name nationality sex height sport gold silver bronze medals
Michael Phelps USA male 1.94 aquatics 5 1 0 6
Katie Ledecky USA female 1.83 aquatics 4 1 0 5
Simone Biles USA female 1.45 gymnastics 4 0 1 5
Emma McKeon AUS female 1.80 aquatics 1 2 1 4
Katinka Hosszu HUN female 1.75 aquatics 3 1 0 4
As a final step, let’s convert this into a JSON file with the --c2j
option.
Here is our final command:
mlr --c2j put '$medals=$bronze+$silver+$gold' \
then sort -nr medals \
then head -n 5 ./athletes.csv > top5.json
With a single command, we’ve computed new data, sorted the result, truncated it, and converted it to JSON.
[
{
"name": "Michael Phelps",
"nationality": "USA",
"sex": "male",
"height": 1.94,
"weight": 90,
"sport": "aquatics",
"gold": 5,
"silver": 1,
"bronze": 0,
"medals": 6
}
// Other entries omitted for brevity.
]
Bonus: If you wanted to show the top 5 women, you could add a filter
.
mlr --c2p put '$medals=$bronze+$silver+$gold' then sort -nr medals then filter '$sex == "female"' then head -n 5 ./athletes.csv
Respectively, you would end up with the following output:
name nationality sex height sport gold silver bronze medals
Katie Ledecky USA female 1.83 aquatics 4 1 0 5
Simone Biles USA female 1.45 gymnastics 4 0 1 5
Emma McKeon AUS female 1.80 aquatics 1 2 1 4
Katinka Hosszu HUN female 1.75 aquatics 3 1 0 4
Madeline Dirado USA female 1.76 aquatics 2 1 1 4
Conclusion
I hope this article showed you how versatile Miller is and gave you a taste of the power of command-line tools. Feel free to scourge the internet for the best CLI next time you find yourself writing yet another random script.
Resources
Further Reading on Smashing Magazine
(yk, il)
[ad_2]