Do You Even Search, Bro?

I recently came across a software engineer I respect greatly who is unfamiliar with the basics of grep (I know, right? Blew my mind). This is for him, hopefully it will help others. If you’re already familiar with this black magic || want to see a cool implementation, check out Grep is a Magical Beast.

Grep’s syntax is fairly simple:

grep [options] search_string file|dir

Grep supports regex as the search string — whip it out if you’re versed in its beauty. If you’re using a multiword search string, you must wrap the string in single quotes. Search all directories by listing *. as dir. Search subdirectories by restricting the path and adding an -r flag:

grep -r 'Carl' ~/Dropbox/LlamasHats/*

The [extremely] basic flags are:

Searching for ___? The flag for you
Text in subfolders -r
Whole words only -w
Case-insensitive text -i
File names only -l
Number of occurrences only -c

Is your mind blown yet? No? Check out ack. The abundance of CL switches will blow your mind. The syntax of ack is the same as that of grep. Their documentation is great.

The main difference I’ve found between grep and ack is speed. Here is a great post by Perl Monks on reasons to switch.

Using find is great for filename searches. find‘s syntax is a bit different than grep’s

find file|dir [options] search_string

A generic search:

find . -iname '*.sty'

What Does This Wobbly “d” Do?

What is the difference between \(\frac{d}{dx}\) and \(\frac{\partial}{\partial x}\)?

…is a question I get surprisingly often when tutoring friends.

Short version: The difference is all about dependency!

The “regular d” in \(\frac{d}{d x}\) denotes ordinary differentiation: assumes all variables are dependent on \(x\) (\(\rightarrow\) envoke chain/product rule to treat the other variables as functions of \(x\)).

The “wobbly d” in \(\frac{\partial}{\partial x}\) denotes partial differentiation and assumes that all variables are independent of \(x\).

You can make a straight d from a wobbly d by using a beautiful thing:

\(\frac{df}{dx} = \frac{\partial f}{\partial x} + \frac{\partial f}{\partial y}\frac{dy}{dx} + \frac{\partial f}{\partial z}\frac{dz}{dx}\)

You can remember it this way: the partial derivative is partially a “d”, or the the “wobbly d” is partial to nemself and bends nir neck down to look at nirs reflection \(\partial\).

Sidenote: Ne/nem/nir/nirs/nemself is a pretty swell gender neutral pronoun set!

Longer version: Before we start – what is a derivative, anyway?

The derivative of a function at a chosen point describes the linear approximation of the function near that input value. Recall the trusty formula: \(\Delta y = m\Delta x\), where m is the slope and \(\Delta\) represents the change in the variable? For a (real-valued) function of a (real) single variable, the derivative at that point is = tangent line to the graph at that point.

The beauty of math is: \(m = \frac{\Delta y}{\Delta x} \equiv\) “How much one quantity (the function) is changing in response to changes in another quantity (\(x\)) at that point (it’s input, assuming \(y(x)\)).”

So, it makes sense that derivative of any constant is 0, since a constant (by definition) is constant \(\rightarrow\) unchanging!

\(\frac{d(c)}{dx}=0\), where \(c\) is any constant.

What about everything that isn’t a constant?

That means, \(\frac{d(x^2)}{dx} = 2x\), since \(x^2\) is dependent on \(x\). \(\frac{d}{dx}\) denotes ordinary differentiation, i.e. all variables are dependent on the given variable (in this case, \(x\)).

But what about \(\frac{d(y)}{dx}\)? Looking at this equation, we immediately assume \(y\) is a function of \(x\). Otherwise, it makes no sense. \(\frac{d(y)}{dx} \equiv \frac{d(y(x))}{dx}\)

On the other hand, \(\frac{\partial (y)}{\partial x}\) denotes partial differentiation. In this case, all variables are assumed to be independent.

\(\frac{\partial (y)}{\partial x} = 0\)

Let’s compare them with an example. \(f(x,y) = ln(x)sec(y) + y\)

\(\frac{\partial f}{\partial x} = \frac{sec(y)}{x}\)

\(\frac{df}{dx}\) implies that \(y\) is dependent(a function of) \(x\), i.e. \(f = (ln(x)sec(y(x)) + y(x))\)

\(\frac{df}{dx} = \frac{dy}{dx}(ln(x)tan(y(x))sec(y(x)) + 1) + \frac{sec(y(x))}{x}\)

As you can see, \(\frac{\partial f}{\partial x} \neq \frac{df}{dx}\).

By the way, to use LaTex in Blogger include the following before </head> in your Template (source):

<script type="text/x-mathjax-config"> MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\(','\)']]}}); </script> 
<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"> </script>

English to Morse Translator

Today, I wanted to code an efficient letter to Morse code translator in Python and whipped this up. I’ve found that a familiarity with many of Python’s lesser-known built-in functions is quite useful in situations such as this!

Just for fun, I encourage the reader to use this script to leave their comments in Morse.

#!usr/bin/python
#Catherine Ray
"""
morse.py {string} [{string} ... ]:
 Translates string of english letters and spaces to morse code.
"""
import string
import sys
if len(sys.argv)>=2:
 if sys.argv[1].isalpha():
  text = ''.join(sys.argv[1:]).lower()
  morse = ["01","1000","1010","100","0", "0010", "110", "0000", "00", "0111", "101", "0100", "11", "10", "111", "0110", "1101", "010", "000", "1", "001", "0001", "011", "1001", "1011", "1100"]  letter = map(chr, range(97, 123))
  LETTER_TO_MORSE = dict(zip(letter, morse))
  morse_out = [LETTER_TO_MORSE[x] for x in text]  print ' '.join(morse_out).replace("1","-").replace("0",".")
 else:
  "Restrict yourself to the english alphabet."
else: 
 print "Enter a string to translate, friend!"

Playing in Vim

Today’s dose of code will be less of a program and more of an advanced vim tutorial.

By the power invested in me by my favored plain text editor, I created a todo system. All tasks are stored in a plain text file, their notes are folded, and they are sorted by priority.

Each entry looks like this, ordered by priority:: [0 = Immediate, inf = Soon]

Priority Task
        - Notes
        - Notes
        - Notes
        - ...

For example, if my text file looks like this:

3 Send Email to Dr. Suess
        - Hello Sir, you are cool.
        - Send from junk email account
2 Braille Contractions +B
        - The anecdotes for how I recall each contraction
        - See draft post for details

Only the following will be displayed.

3 Send Email to Dr. Suess
2 Braille Contractions +B

This is done via vim trickery. :set fdm=indent will create folds which you can open with zo and close with zc.

3 Send Email to Dr. Suess
+--  2 lines: - Hello Sir, you are cool.--------------------------
2 Braille Contractions +B
+--  2 lines: - The anecdotes for how I recall each contraction---

Next step is being able to prioritize lines. The naive way to this is :sort. However, this doesn’t preserve folds! What shall we do? We could manually yank and put, but that’s no fun.

Let’s see if we can come up with an atrocious command that sorts tasks by priority while preserving folds.

First, we replace the line delimiters with something not found in the text, in our case %%.
Using :set list, we see:

3 Send Email to Dr. Suess$
^I- Hello Sir, you are cool.$
^I- Send from junk email account$
^I$
2 Braille Contractions +B$
^I- The anecdotes for how I recall each contraction$
^I- See draft post for details$

We can replace the delimiters nt- with %%, using :%s/nt-/ %%, invoke :sort, and return the delimiters to their rightful places :%s/ %%/rt-/g

Sidenote: I use r instead of n in the previous command because r will show up in the editor immediately, while n will show up as ^@ until you exit with :set nolist.

This is combined into the following, which sorts our document and eliminates blank lines, leaving their corresponding folds unchanged!

:%s/nt-/ %%/ | :sort | :%s/%%/rt-/g

Do not fear! You can alias commands in vim with :command. Note that user defined commands must be capitalized. The following command will create the alias :Todo.

:command Todo :%s/nt-/ %%/|:sort|:%s/%%/rt-/g

In conclusion, although this is a fun exercise in vim – use a bash script for your todo list needs!

Consequent Challenge

Today, as I was writing a post on sorting in vim, I issued myself a challenge.

The challenge: without using a bash script, write a one liner that reads through all the lines in a file, sorts them and printed these sorted lines to stdout. Do so in under a minute without using the internet.

Before you continue, I encourage you to try it yourself. Get out a timer … Ready? Go!

In about 35 seconds, I had this:

print 'n'.join(sorted([line.strip() for line in open("file.txt")]))

Although this one-liner works, the filename is hardcorded. Quick fix (finished at 52 seconds):

import sys; print 'n'.join(sorted([line.strip() for line in open(sys.argv[1])]))

I find that self-issuing pseudorandom timed challenges is a fun way to train yourself to work under pressure. ‘Tis one of the many ways to gamify everyday life!

Sidenote: Do not fall into the trap of using one-liners in actual code. The Pythonic way to do this is:

import sys
try:
    with open(sys.argv[1]) as f:
        for line in f:
            print 'n'.join(sorted(line.strip())) 
except IOError:
    print "File does not exist."

Popular Weekdays

The code on my blog will range in quality from “I’m waiting in line and have 10 minutes to code” to “I’ve been working on this all day.”

Let’s with begin some quick, semi-hardcoded scripts, shall we?

# dayofweek.py
import datetime
from sys import argv
from calendar import day_abbr #import day_name for full name
"""
Counts the amount of hits per weekday, given a file of data corresponding to "year month day amount-of hits"
"""
#starts with a zero counter array (each index corresponds to a weekday, 0:Sunday, 1:Monday, ...)
weekcount = [0]*7 #hits per day
argv = argv[1:] #get rid of script declaration in args
#make sure string has year, month, day, amount-of-hits-that-day
if len(argv) % 4 == 0:
  for c in range(0, len(argv),4):
    #increment slice so we are only looking at one set of (year, month, day, amount-of-hits-that-day) at one time
     if (a.isdigit() for a in argv[0+c:4+c]):
         year, month, day, count = int(argv[0+c]), int(argv[1+c]), int(argv[2+c]), int(argv[3+c]);
         #get index of day that the given date corresponds to
         daynum = int(datetime.date(year, month, day).strftime("%w"))
         #add the amount-of-hits-that-day to the counter
         weekcount[daynum] = weekcount[daynum] + count;
  #display days from most to least popular
  print 'n'.join([day_abbr[(sorted(range(7), key=lambda k: weekcount[k]))[d]] for d in range(7)])

Okay, we need some random data to test this bad boy on. Naively, we could create a random data set of valid days as follows:

# randomday.py
import random as r
f=[]for x in xrange(12):
 year = r.choice(range(1997, 2013))
 month = r.choice(range(1, 12))
 day = r.choice(range(1, 28))
 hits = r.choice(range(6666,9999))
 f.extend((year, month, day, hits))
print ' '.join(map(str,f))

An example of this output is:

2008 9 15 7311 2007 7 1 9812 2011 6 9 7721 2003 7 21 6736 2010 9 13 7776 1997 9 14 8776 1999 7 14 8617 2012 9 4 8208 2006 11 26 9689 2004 11 10 8952 1997 7 19 7799 2007 9 15 7858

We can feed these test values into our original script with a quick bash one-liner:

python dayofweek.py `python randomday.py`

We will receive the most and least popular weekdays in this random set of data:

Wed
Fri
Sat
Mon
Tue
Sun
Thu

Introduction to Hive

Let’s say we have a plain text file, erdos.txt, with the following contents:

Paul Erdos 0
Chris Godsil 1
Leo Moser 1
Hanfried Lenz 2 

Let’s make a table to query this information in HiveQL, using the appropriate types*:

hive> CREATE TABLE erdos (
    > firstname STRING, 
    > surname STRING, 
    > number INT)
    > row format delimited fields terminated by '40' lines terminated by 'n'; 

Currently, we must use ’40’ to represent a space. It is hive-friendly to use the octal number corresponding to the character (in our case, a single space) that is our field terminator.

Let’s load our data into our shiny new table.

hive> LOAD DATA LOCAL INPATH 'erdos.txt'
    > INTO TABLE erdos; 

Sweet! We’ve populated the erdos table. Let’s make another one with comments corresponding to a given Erdos number, and an equation that generates said number!

This file, cool.txt, looks like this:

0,HOLY CRAP YOU ARE ERDOS,e^(i*pi) + 1
1,What did Erdos smell like?,i^2
2,I will bet you put that in your resume,(1+i)(1-i)

Let’s make the table…

hive> CREATE TABLE cool (
    > number INT, 
    > comment STRING 
    > fact STRING)
    > row format delimited fields terminated by ',' lines terminated by 'n';

Let’s get the firstname and comment from these tables.

hive> SELECT erdos.firstname, cool.comment         
    > from cool join erdos   
    > on cool.number = erdos.number;
Paul HOLY CRAP YOU ARE ERDOS
Chris What did Erdos smell like?
Leo What did Erdos smell like?
Hanfried I will bet you put that in your resume

I want to add another entry to the erdos table for Ronald Graham. But I forgot what attributes I have in the erdos table.

DESCRIBE erdos;
OK
firstname    string    
surname    string    
number    int  

Okay, now I know to insert the information for Graham in this format: Ronald Graham 1
Unlike Impala, the syntax

INSERT INTO erdos VALUES ("Ronald", "Graham", "1")

is not currently supported in Hive. If we don’t want to switch to the imapala-shell, we have other options to get the job done**.

Let’s rerun the join query.

Paul HOLY CRAP YOU ARE ERDOS
Chris What did Erdos smell like?
Leo What did Erdos smell like?
Hanfried I will bet you put that in your resume
Ronald What did Erdos smell like?

Okay, cool. Ronald is included.
I want to partition our erdos table by whether the person is alive or dead. An easy way to do this is to separate the lines corresponding to the live and dead people into two different files.
(If you love making things more autonomous than necessary as much as I do, you can write a Java program to scrape Wikipedia to see if a given person was dead or not, and a bash script to transfer lines from erdos.txt into dead.txt and alive.txt appropriately. I encourage you to write this class-script combo; I may reveal mine in a later post. More info on partitioning tables with Hive.)

We now have two files:
dead.txt

Paul Erdos 0
Leo Moser 1
Hanfried Lenz 2

alive.txt

Chris Godsil 1
Ronald Graham 1

Let’s drop the current erdos table and create a partitioned table.

hive> DROP TABLE erdos;
hive> CREATE TABLE erdos (
    > firstname STRING, 
    > surname STRING, 
    > number INT)
    > PARTITIONED BY (exists BOOLEAN)
    > row format delimited fields terminated by '40' lines terminated by 'n';

Then, load the each file into its corresponding partition.

hive> LOAD DATA LOCAL INPATH 'dead.txt'
    > OVERWRITE INTO TABLE erdos
    > PARTITION (exists=false);
hive> LOAD DATA LOCAL INPATH 'alive.txt'
    > INTO TABLE erdos
    > PARTITION (exists=true);

Did it work? Let’s check…

hive> show partitions erdos;
OK
exists=false
exists=true

We can see that the partition is treated as an attribute of the erdos table.

hive> DESCRIBE erdos;                      
OK
firstname    string    
surname    string    
erdos    int    
exists    boolean

That’s all for today folks. I’ll end with the types currently supported by Hive.

*

Numeric Types

  • TINYINT
  • SMALLINT
  • INT
  • BIGINT
  • FLOAT
  • DOUBLE
  • DECIMAL (Note: Only available starting with Hive 0.11.0)

Date/Time Types

  • TIMESTAMP (Note: Only available starting with Hive 0.8.0)
  • DATE (Note: Only available starting with Hive 0.12.0)
  • Misc Types
  • BOOLEAN
  • STRING
  • BINARY (Note: Only available starting with Hive 0.8.0)

Complex Types

  • arrays: ARRAY<data_type>
  • maps: MAP<primitive_type, data_type>
  • structs: STRUCT<col_name : data_type [COMMENT col_comment], …>
  • union: UNIONTYPE<data_type, data_type, …>

**

(1) Put ‘Ronald Graham 1’ into a file, temp.txt, and load that file.

hive> quit;
echo Ronald Graham 1 > temp.txt
hive> LOAD DATA LOCAL INPATH 'temp.txt'
    > INTO TABLE erdos;

(2) Replace the file ‘erdos.txt’ with ‘Ronald Graham 1’

hive> quit;
echo Ronald Graham 1 > erdos.txt
hive> LOAD DATA LOCAL INPATH 'erdos.txt'
    > INTO TABLE erdos;

(3) Append ‘Ronald Graham 1’ to erdos.txt and reload the entire file.

hive> quit;
echo Ronald Graham 1 >> erdos.txt
hive> LOAD DATA LOCAL INPATH 'erdos.txt'
    > OVERWRITE INTO TABLE erdos;

I prefer method 3 for small files.

Small Math Puzzles Make My Day

I was recently hanging out with some friends, and one of them brought out an old math problem sheet. This problem sheet was briefly passed around and then put away again. One of the problems was a cute math puzzle. This problem was…

Find the sum of the first 1234 elements in following sequence
1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, …


Sidenote:
There is a difference between a sequence and a series. The difference is:
a sequence is a list of numbers (1, 2, 1, 2, 2, …)
a series is the sum of a sequence (1 + 2 + 1 + 2 + 2 + … )


Before proceeding, I encourage you to solve the problem yourself.

The first thing I recognized was that the positions of “1” in the sequence corresponded to the triangle numbers.
1,
2, 1,
2, 2, 1,
2, 2, 2, 1,
2, 2, 2, 2, 1,

After the sheet was put away, the little problem stuck in my head. When I got the opportunity, I used my phone to run my quickly written script.

x=1234
for n in range(0, x-1):
  if (n*(n+1))/2 > x: 
    tri =n-1
    print x*2-tri 
    break

This finds the number of triangle numbers less than 1234 (which corresponds to the number of 1s in the sequence), then subtracts this number from 1234*2.

Grep Is a Magical Beast (ft. HiveQL and Impala)

Grep is a magical beast which can be used to make your bash scripts excellent. This post will give you a taste of its utility. Let’s say I have a file, temp.txt which contains two lines:

don‘t forget to be awesome
so long and thanks for all the fish!

I’d like to execute one command if a word exists in the file, and a different command if the word doesn’t exist in the file. This is well-suited for grep: for example,

grep -q -Rw 'awesome' temp.txt && echo found || echo not found

echo prints to standard out, surrounding quotes are not required
-R will recursively search through files if no match is found in the current directory
-i will look for matching lines
-w will look for matching words

If we run the command in terminal, we get the result found as expected.
If we run the same command, but change w to i,

grep -q -Ri 'awesome' temp.txt && echo found || echo not found

we get the result not found because there is no line that contains only “awesome.” If we had a file with the following contents, we’d get the result found:

don‘t forget to be awesome
so long and thanks for all the fish!
awesome

Now that you are beginning to appreciate grep, let’s say I have a file, tensors.txt, containing this information:

electric field,1.0
polarization,1.0
tau,0.0
stress,2.0
strain,2.0

But where does grep come in? Let’s say you want to load information from tensors.txt into the Hive table “EXAMPLETABLE”, but you are unsure about whether the table already exists or not. This calls for… *superman noises* … a bash script!

This script will query Hive, and ask politely for a list of its current tables. When Hive replies via stdout, the script will take notes into some temporary file which we can search later.

#!/bin/bash
function createtable {
    #creates new table if table does not already exist.
    NEWTABLE=$1
    TMP=$DIR"tempfile.txt"
    #check to see that table is non-existant
    hive -S -e "SHOW TABLES" > $TMP #see explanation of flags below
    echo `grep -q -i $NEWTABLE $TMP && echo table already exists || echo we must create a new table` 
}
#Then, we call the function...
TABLENAME=EXAMPLETABLE
createtable $TABLENAME

-S runs hive in silent mode
-e runs the following string as an external query

Okay, awesome, we can print the status of the table’s existence to stdout. But friend, this is just the beginning, we can replace the echo line with:

grep -q -i $TABLENAME $TMP && echo Table already exists. Please check your tablename. ||
    hive -S -e "CREATE TABLE $NEWTABLE(
    name STRING,
    rank FLOAT
    )
    row format delimited fields terminated by ',' lines terminated by 'n'"

So the entire script now looks like

#!/bin/bash
function createtable {
    #creates new table if table does not already exist.
    NEWTABLE=$1
    TMP=$DIR"tempfile.txt"
    #check to see that table is non-existant 
    hive -S -e "SHOW TABLES" > $TMP 
    grep -q -i $TABLENAME $TMP && echo Table already exists. Please check your tablename. ||
        hive -S -e "CREATE TABLE $NEWTABLE(
        name STRING,
        rank FLOAT
    )
    row format delimited fields terminated by ',' lines terminated by 'n'"
}
TABLENAME=EXAMPLETABLE
createtable $TABLENAME
hive -S -e "LOAD DATA LOCAL INPATH 'tensors.txt' INTO TABLE $TABLENAME"

This function will only create a new table if the table does not already exist.

Want to go faster? Let’s run pieces of this script via Impala. We don’t have to make major alterations to our bash script; most HiveQL SELECT and INSERT statements run unmodified in Impala. Impala does not currently support CREATE statements, so we must CREATE in Hive. To be safe, we will run REFRESH in the impala-shell so that it reflects all recent changes.

The equivalent of

hive -S -e "SHOW TABLES" > $TMP

in Impala is

impala-shell -B --quiet -q "SHOW TABLES" -o $TMP

-B returns results in plain text
–quiet runs impala-shell in quiet mode
-q runs the following string as an external query

The entire script looks like this:

#!/bin/bash
function createtable {
    #creates new table if table does not already exist.
    NEWTABLE=$1
    TMP=$DIR"tempfile.txt"
    #check to see that table is non-existant 
    impala-shell -B --quiet -q "SHOW TABLES" -o $TMP
    grep -q -i $TABLENAME $TMP && echo Table already exists. Please check your tablename. ||
        hive -S -e "CREATE TABLE $NEWTABLE(
        name STRING,
        rank FLOAT
    )
    row format delimited fields terminated by ',' lines terminated by 'n'"
    impala-shell --quiet -q "refresh" 
}
TABLENAME=EXAMPLETABLE
createtable $TABLENAME
impala-shell --quiet -q "LOAD DATA INPATH 'tensors.txt' INTO TABLE $TABLENAME"

Note that there is no LOCAL in the Impala LOAD statement. Impala does not currently support loading information from files outside of an HDFS location.

tl;dr, grep is efficient and slightly magical when implemented properly.

mySQL: LIKE vs. REGEXP

Using LIKE in a query is an order of magnitude faster than using REGEXP.
The downside is that LIKE doesn’t offer the control and generality that REGEXP does.

For example, let’s say I’m querying my table for all distinct IP addresses that begin with ‘3.14’
The entries in the ‘ip’ column of my table look like this:

ip
3.14.15.92
3.14.455566677889000.000
3.14twasbrilligandtheslithy^&*)
3.14tobeornottobethatisthequestion%$@

With LIKE, I can say

SELECT DISTINCT ip FROM TABLENAME
WHERE ip LIKE '3.14%';

This will not only match ‘3.14.15.92’
It will also match ‘3.14.455566677889000.000’, ‘3.14twasbrilligandtheslithy^&*)’ and ‘3.14tobeornottobethatisthequestion%$@’;

With REGEXP, I can be more specific, restricting the search to only valid IP addresses, with the same query header, I replace the WHERE statement with:

 WHERE ip REGEXP '^3.14.d{1,3}.d{1,3}

This will only match ‘3.14.15.92’, and disregard the invalid IP addresses.

Although it is slower, REGEXP will give you more control over your query results. However, if you are worried about speed, I suggest going with LIKE and cleaning your results as a separate process.
This will only match ‘3.14.15.92’, and disregard the invalid IP addresses.