This is Joe on Data.

Shell Analytics: command line tools for data analysis

Shell Analytics is when you use command line tools to analyse or manipulate data. Command line tools are an invaluable tool for working with data, specifically files or command line programs which output useful data. Not only can you quickly preview the first or last few lines of a file quickly and easily (head/tail), you can extract it (cat/grep/>), transform it (sed/awk), organize and count it (sort/uniq). Specifically, I am talking about command line tools on Linux using Bash, although they are available in Mac OSX. I am unsure of their availability or usability in Windows, so let me know if you have any insight there.

Here are a few use cases for shell analytics:

  1. Previewing the contents of a file. Often times when getting new data files and importing them into a database or for use in MapReduce jobs I will check the first few lines of a file to verify the file contents and structure or see if I need to remove a header line. What makes a tool like head uniquely qualified for this job is that it only reads the first few lines of a file, which is important when you have a multi-gigabyte file that would make most text editors explode. And it is fast no matter the size of the file.
  2. Looking for specific data. Whether you’re searching through log files looking for lines containing a particular session id or error message. Or you are trying to find a record in your input data that caused an error or issue in your MapReduce because of a bad data format. grep and regular expressions are your friend (although they don’t always do you you think they should be doing).
  3. Extracting subsets of data. Sometimes it is useful to run a MapReduce job on a smaller subset of input data to test and verify it’s functioning as you need or expect. Using the head command you can extract the first 1000 lines or whatever you need out of your larger data set. Also, it is sometimes useful to extract or suppress specific lines into a smaller data set using grep for further analysis or processing. Not only can you extract specific lines, but also specific parts of a line (sed) or specific columns in a delimited file (awk).
  4. Aggregation and counting. Using the uniq command (usually in combination other commands to format and organize the data) you can aggregate unique lines and count occurrences of values or number of unique values.

I/O Pipes

The killer feature is that multiple commands can be chained together with the pipe character (|) forming a data processing pipeline allowing complex data transformation operations in a single command. It is also fast because intermediate stages are not written to or read from disk (there are, however, exceptions to this rule). Where a pipeline gets performance bound is primarily by being disk bound from the reading of input files from disk or writing of output files to disk. Also, some operations on the data are time consuming and can be CPU bound for things like complex regular expression extraction or data compression/decompression.

Data pipelining (chaining multiple commands using a pipe) is possible because of two special file descriptors in Linux (and Unix-based systems); standard in and standard out. There is a third special file descriptor, standard error, but you don’t typically use it as data for data processing unless you are trapping and handling errors in a certain way. When a program is run, each of the special file descriptors are made available to the program as standard file handles accessible in most programming languages. Standard in can be read from just like any file and standard out can be written two just like any file. You can take advantage of these standard file descriptors in your own programs and use them in concert with other command line tools.

The command line tools discussed here make specific use of these standard file descriptors to pass data to and receive data from other commands through I/O pipes. When you execute a command in Bash by naming one command with its arguments, appending the pipe character, and naming another command with its arguments, you implicitly connect the standard output of the first command to the standard input of the second command. Here is a simple example that reads the contents of a file and prints out each line in the file containing the characters “joe” (note: examples of executing commands in Bash will show the Bash prompt $ followed by the command(s) as you would enter and execute them in Bash):

 $ cat file.txt | grep 'joe'

The first command in a pipeline will always be a command which writes some data to standard out. The most common for performing shell analytics is the cat command, which reads the contents of a named file and writes it to standard out (it can do more, but that is the basics). Also useful is the head and tail commands, which read the first or last respectively N lines (default is 10) from a file (or standard in).

Useful Commands

Below are a few of the most useful commands for shell analytics which take advantage of reading/writing data from standard in/standard out. I will focus on use cases for shell analytics, but they have many other uses. To learn more about a command, a great resource is the command’s man page. Man is a command to read the manual (i.e. documentation) of another command. To read the man page for the cat command run “man cat“. Also helpful is Google. If you search for the command name and what you are trying to use it for, you can usually find some helpful information.

Each command will show example usage. I will leave it up to the reader to experiment with real data and variations on the commands and the combinations which can be made.

Cat

The cat command is like glue. The most common use is to read one or more files and write them to standard out. Most shell analytics commands will start with cat although it is not always necessary. A lot of commands you would use can read in a file on their own without needing cat. However, I like to use it because I can start by testing a data pipeline by using “head file.txt | ...” then it is easy to change head to cat to run it on the whole file.

Example: print the contents of a file.

 $ cat file.txt

Head / Tail

The head and tail commands both print N number of lines from either the head or tail of a file or standard in. The default number of lines is 10, but that can be overridden with the -n argument. There are a  number of other useful arguments you can find in the man page.

Example: print last 100 lines from a file.

 $ tail -n 100 file.txt

Grep

The grep command is like a filter. It will read a file or standard input and print lines matching (or not matching when using the -v flag) the supplied text or regular expression.

Example: print lines containing the string “joe” from the file file.txt

 $ grep "joe" file.txt

Example: print all non-empty lines except those starting with # from file.txt

 $ grep -v '^ *\(#.*\)\?$' file.txt

Notes: It is usually good to wrap regular expressions for grep or sed in single quotes to avoid inadvertently referencing Bash variables. Also, the single biggest annoyance I find with regular expressions, particularly from the shell, is knowing or finding what characters to escape or not escape. It sometimes takes trial and error to get the regular expressions right.

Sed

The sed command is a stream editor for filtering or transforming text. I almost always use it for extracting or replacing parts of a line. For that you will use the expression  s/REGEX/REPLACEMENT/[flags]. Your regex can match partial lines, in which case the replacement will only replace the matched substring, leaving the remaining line as it was. Or it can match hole lines and the whole line will be replaced. See the sed manual for more detailed documentation.

Example: replace all occurrences of the string “joe” with the string “bob”.

$ sed 's/joe/bob/g' file.txt

Example: extract the number of milliseconds (a number followed by “ms”) from a log file.

$ sed 's/^.*\([0-9]\+\)ms.*$/\1/' file.log

Awk

The awk command is actually a feature rich programming language. However I only use it to parse and extract fields from CSV files. For more complicated processing I would move up to using Python rather than a more complicated awk script. The key for extracting fields is to specify the field separator with the -F argument. Then you can print fields by index to extract a single field or a subset of fields, maybe putting them in a different order. See the awk manual for more detailed documentation.

Example: print the 3rd field from a CSV.

$ cat file.csv | awk -F ',' '{print $3}'

Example: output a CSV of fields 6 then 3.

 $ cat file.csv | awk -F ',' '{print $6","$3}'

Sort

The sort command simply sorts lines from standard in or 1 or more files. This can be helpful to view or use sorted data, but I primarily use it for sorting lines for use by the uniq command (described in next section).

Example: sort a list of names

 $ sort names.txt

Uniq

The uniq command will aggregate matching adjacent lines from the input file or standard input. This is useful if you want to count the number of unique lines in a file (by piping output to the wc command, which counts lines, words, and/or characters). Also very useful is the -c flag, which also counts the number of occurrences of each unique line.

Example: count number of unique names.

 $ sort names.txt | uniq | wc -l

Example: print each uniq name and the number of times it occurs.

 $ sort names.txt | uniq -c

I/O Redirection

Normally you don’t want to just view the printed output of a series of piped commands, you want to save the data so that you can share it or view it later or so you can use it in another program. Maybe you want to import the data into a relational database like PostgreSQL or MySQL, maybe you want to use it as part of a MapReduce job, or maybe you want to create graphs from it using gnuplot or LibreOffice.

In order to capture the output of a command or series of piped commands, you use the > or >> I/O redirection symbols. The > symbol will redirect standard output of the preceding command to the file named after it, overwriting the file if it existed. The >> symbol will also redirect standard output of the proceeding command to the file named after, however if the file already exists the data will be appended to the file.

Example: write sorted, unique, names to the file unique_names.txt.

 $ sort names.txt | uniq > unique_names.txt

There are a number of other useful I/O redirection uses and syntaxes, however this is most of what you need for shell analytics.

Summary

The command line tools available on Linux and other systems can be powerful, flexible, and fast tools for performing data analysis or preparing data for further analysis. Through the use if I/O pipes and I/O redirection, one can perform complex data operations more quickly and easily than through other methods.

Post a Comment

Your email is kept private. Required fields are marked *