Poor Man’s Data Munging in the Linux Shell

This is just a brief cheat sheet style of post. While there are many sophisticated tools for processing data out there, sometimes, it may be more convenient or quicker to just use the tools the Linux shell provides (Please note that I am working in bash and don’t know how much of this applies to other shells.). In the following, I primarily post a brief overview of some commands I found useful for the quick and dirty munging of data.


# Filtering/Selecting Data
########################

# Filter by row/line:
grep EXPRESSION FILE

# Filter by column:
awk FILE '{print $3, $5}'

# Filter by column, with custom separator, e.g., ',':
awk -F, FILE '{print $3, $5}'

# Filter by column, with custom output separator, e.g., ',':
awk FILE 'BEGIN {OFS=","} {print $3, $5}'

# Transformation
##############

# Add line numbers:
nl -w1 FILE
# Note this is 1 (one) not a lowercase 'L'.

# Add line numbers with custom separator, e.g., ',':
nl -s, -w1 FILE

# Sort lexicographically:
sort FILE

# Sort numerically:
sort -n FILE

# "Transpose" a file, more precisely output content as single row:
tr -d '\n' < FILE

# Get first n (e.g. 10) rows:
head -n 10 FILE

# Get last n (e.g. 10) rows:
tail -n 10 FILE

# Omit the first n (e.g. 10) rows (Get all but the first n rows.):
tail -n +10 FILE

# Omit the last n (e.g. 10) rows (Get all but the last n rows.):
head -n -10 FILE

# head and tail also work character-wise (Beware of the trailing '\n'.):
echo "ABCDE" | tail -c 2
#> E
echo "ABCDE" | tail -c +2
#> BCDE
echo "ABCDE" | head -c 2
#> AB
echo "ABCDE" | head -c -2
#> ABCD

# Add leading zeros:
printf "%03d" 7

# Remove leading zeros:
printf "%d" 007

# Output
######

# Repeat string (e.g. "foo\n") n times (e.g., 10):
printf 'foo\n%.0s' $(seq 1 1 10)

# Concatenation
#############

# Concatenate two files (add FILE2 to the end of FILE1):
cat FILE1 FILE2

# Concatenate two files column-wise:
paste FILE1 FILE2

# Concatenate two files column-wise, with custom separator, e.g., ',':
paste -d, FILE1 FILE2

# Simple Calculations
#################

# Integer operations can be done directly in the shell:
echo $((1+2))

# For floating point operations the "bc" tool can be used:
echo "1.23 * 2" | bc

# Simple Statistics
#################

# Calculate min, mean, median, max, sd, for data in column 3:
awk FILE '{print $3}' | Rscript -e 'd<-scan("stdin", quiet=TRUE); cat(min(d), mean(d), median(d), max(d), sd(d), sep=" ")'

“Rscript” as seen in the last example can generally be used for invoking R.

Another way for invoking R that I found useful is in shell scripts via here files. The following snippet shows a simple script that iterates over all *.raw.data files in a directory, creates a box plot, and stores the boxplot in a *.png file:


#!/bin/bash

for f in DIRECTORY/*.raw.data

do

R --slave --vanilla --no-save --quiet <<R_END
data <- read.table("$f")
png("${f}.png")
boxplot(data)
dev.off()
R_END

done

Of course, this example assumes that the data in the raw data files is already in a format that can be read by R. Something like this can be handy, e.g., if you want to get a quick overview of a bigger number of data files.

Advertisement
This entry was posted in Snippets and tagged . Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.