Job Title: Universe Debugger (the Importance of Implementation)

During a late-night conversation with Austin Russell, he stated that we are living in a simulation. Together, we discussed the job description of a Universe Debugger, and created a fun way to project the world onto coding principles and concepts.

If the universe is a simulation, we can debug the universe.

We can infer the types and scoping of variables and enforce them with conditional actions. This is already the case. We have a scope of acceptable behaviors, defined by law and sociocultural norms. And, if our law enforcement worked in 100% of cases, objects that have unacceptable attributes are collected and assigned to a designated local scope until their behavior has changed.

We are part of a tensor of rank n containing bits. We exist only as information. The notion of self as a constant thing is incorrect, for we are not imaginary anecdotal constants.

Since we exist as part of the simulation, we are in a white-box environment. The only way to be a good white-box tester is to understand the program at a fundamental level. This is why we are scientists. We are universe debuggers. We find problems and create patches. We add features to improve efficiency.

The problem is that we will never be able to fully test the universe. Our unit tests are only applicable within a finite scope.

Fortunately, the universe is empirically tested by actually being run. This allows us to discover new bugs and improve previous patches.

This perspective of existence demonstrates the importance of implementing your ideas. Writing a bug report is useful, but it is only useful for the purpose of writing future patches. The patches are what improve the program, the patches allow it to grow and go on.

We are tempted to abandon uninteresting bugs and ignore bugs that do not directly affect our immediate environment.

As scientists and engineers, it is easy to get trapped in patent wars, or keep our research locked up in our ivory towers. Public bug reports are left unresolved, and patches are hidden away in local closed-source distributions.

We must take initiative in implementing our ideas. We have a responsibility to work through the tedium which inevitably appears in any interesting project: to finish what we’ve started. We must transition from learning to thinking to doing.

It is our duty to debug the universe, one patch at a time.

Installing CoffeeScript on Ubuntu 13.04

Unfortunately, the current CoffeeScript docs do not support installation on the latest Ubuntu distro. To get around this, we must manually install the dependencies. Don’t worry, I’ve done most of the work for you.

Create a plain text file, vi installcoffee.sh and insert the following:

sudo apt-get update
sudo apt-get install git-core curl build-essential openssl libssl-dev
git clone https://github.com/joyent/node.git && cd node
./configure
make 
sudo make install
cd
curl http://npmjs.org/install.sh | sudo sh
sudo npm install -g coffee-script

Save Esc + : + w and run the script . installcoffee.sh

What I Learned During My Summer Internship at Cloudera

The purpose of this post is to link you to a post I wrote for Cloudera’s blog:  What I Learned During My Summer Internship at Cloudera

Reprint:

Catherine Ray, a Summer Intern at Cloudera this year, was kind enough to summarize her experiences for you below. Best of luck in your new field, Catherine!

I’m currently 16 and a rising senior at George Mason University, majoring in Computational Physics. (The full title is Computational and Data Sciences with a concentration in Physics.).

I had a wonderful time working on my project. In short, I worked on an Apache Hadoop-based downloads tracking system. In this system, raw downloads logs are ingested via Apache Flume into HDFS, then parsed with an MR Job into a Cloudera Impala-friendly format. I had the opportunity to collaborate with one of our teams in New York to pull the whole system together. To fully utilize the data contained in the logs, I created a Java library that finds the organizational information associated with a given IP address. I also helped to create dashboards that use queries against the collected data to analyze it and produce sales leads.

As my internship came to an end, I was able to use a skill I developed through one of my many hobbies: making YouTube videos. Specifically, when faced with the task of creating my intern presentation, I ditched my PowerPoint and made a video in order to explain my project in a more engaging format. The video below describes the system I created this summer.

Before I began my internship, I worried that I would encounter a specific obstacle in enjoying experiences that has repetitively appeared in my past. I worried that I wouldn’t be taken seriously due to my age. After meeting my fellow interns and conversing with my mentor, my fears were quickly assuaged. The Cloudera community was truly accommodating; I was treated as the other interns were treated: just like a full-time employee.

My experience at Cloudera revealed a perspective previously hidden to me. (I also had amazing discussions with brilliant people over lunch and in the hallways on a regular basis.) Here, it is well known that one can find success at being both a scientist and engineer. The best data scientist has both curiosity and passion of a scientist faced with an unsolved problem, and the methodology and efficiency of an engineer given the task of implementing an effective solution. This realization has convinced me to pursue graduate studies in computer science, instead of narrowing my future studies to the applications of computer science in physics.

I also learned coding tricks and new ways of thinking in programming languages I thought I knew well. I fell in love with regular expressions; I learned the practices of documenting code and creating readable source. Outside of computer science, I learned how to ride a RipStick (a skateboard variation), I learned the art of collaboration, and I experienced a strong sense of community. My mentor was happy to answer any questions I had in detail, and our code review sessions completely changed the way I think about object-oriented programming.

I can’t thank my mentor (Aditya Acharya) enough for the time he devoted to answering all of my questions; I learned an incredible amount from him. The members of the teamswith which I worked were similarly kind, accommodating, and resourceful. My fellow interns were extremely friendly and helpful — competition did not taint our interactions.

All in all, a very successful summer.

Introduction to Hive

Let’s say we have a plain text file, erdos.txt, with the following contents:

Paul Erdos 0
Chris Godsil 1
Leo Moser 1
Hanfried Lenz 2 

Let’s make a table to query this information in HiveQL, using the appropriate types*:

hive> CREATE TABLE erdos (
    > firstname STRING, 
    > surname STRING, 
    > number INT)
    > row format delimited fields terminated by '40' lines terminated by 'n'; 

Currently, we must use ’40’ to represent a space. It is hive-friendly to use the octal number corresponding to the character (in our case, a single space) that is our field terminator.

Let’s load our data into our shiny new table.

hive> LOAD DATA LOCAL INPATH 'erdos.txt'
    > INTO TABLE erdos; 

Sweet! We’ve populated the erdos table. Let’s make another one with comments corresponding to a given Erdos number, and an equation that generates said number!

This file, cool.txt, looks like this:

0,HOLY CRAP YOU ARE ERDOS,e^(i*pi) + 1
1,What did Erdos smell like?,i^2
2,I will bet you put that in your resume,(1+i)(1-i)

Let’s make the table…

hive> CREATE TABLE cool (
    > number INT, 
    > comment STRING 
    > fact STRING)
    > row format delimited fields terminated by ',' lines terminated by 'n';

Let’s get the firstname and comment from these tables.

hive> SELECT erdos.firstname, cool.comment         
    > from cool join erdos   
    > on cool.number = erdos.number;
Paul HOLY CRAP YOU ARE ERDOS
Chris What did Erdos smell like?
Leo What did Erdos smell like?
Hanfried I will bet you put that in your resume

I want to add another entry to the erdos table for Ronald Graham. But I forgot what attributes I have in the erdos table.

DESCRIBE erdos;
OK
firstname    string    
surname    string    
number    int  

Okay, now I know to insert the information for Graham in this format: Ronald Graham 1
Unlike Impala, the syntax

INSERT INTO erdos VALUES ("Ronald", "Graham", "1")

is not currently supported in Hive. If we don’t want to switch to the imapala-shell, we have other options to get the job done**.

Let’s rerun the join query.

Paul HOLY CRAP YOU ARE ERDOS
Chris What did Erdos smell like?
Leo What did Erdos smell like?
Hanfried I will bet you put that in your resume
Ronald What did Erdos smell like?

Okay, cool. Ronald is included.
I want to partition our erdos table by whether the person is alive or dead. An easy way to do this is to separate the lines corresponding to the live and dead people into two different files.
(If you love making things more autonomous than necessary as much as I do, you can write a Java program to scrape Wikipedia to see if a given person was dead or not, and a bash script to transfer lines from erdos.txt into dead.txt and alive.txt appropriately. I encourage you to write this class-script combo; I may reveal mine in a later post. More info on partitioning tables with Hive.)

We now have two files:
dead.txt

Paul Erdos 0
Leo Moser 1
Hanfried Lenz 2

alive.txt

Chris Godsil 1
Ronald Graham 1

Let’s drop the current erdos table and create a partitioned table.

hive> DROP TABLE erdos;
hive> CREATE TABLE erdos (
    > firstname STRING, 
    > surname STRING, 
    > number INT)
    > PARTITIONED BY (exists BOOLEAN)
    > row format delimited fields terminated by '40' lines terminated by 'n';

Then, load the each file into its corresponding partition.

hive> LOAD DATA LOCAL INPATH 'dead.txt'
    > OVERWRITE INTO TABLE erdos
    > PARTITION (exists=false);
hive> LOAD DATA LOCAL INPATH 'alive.txt'
    > INTO TABLE erdos
    > PARTITION (exists=true);

Did it work? Let’s check…

hive> show partitions erdos;
OK
exists=false
exists=true

We can see that the partition is treated as an attribute of the erdos table.

hive> DESCRIBE erdos;                      
OK
firstname    string    
surname    string    
erdos    int    
exists    boolean

That’s all for today folks. I’ll end with the types currently supported by Hive.

*

Numeric Types

  • TINYINT
  • SMALLINT
  • INT
  • BIGINT
  • FLOAT
  • DOUBLE
  • DECIMAL (Note: Only available starting with Hive 0.11.0)

Date/Time Types

  • TIMESTAMP (Note: Only available starting with Hive 0.8.0)
  • DATE (Note: Only available starting with Hive 0.12.0)
  • Misc Types
  • BOOLEAN
  • STRING
  • BINARY (Note: Only available starting with Hive 0.8.0)

Complex Types

  • arrays: ARRAY<data_type>
  • maps: MAP<primitive_type, data_type>
  • structs: STRUCT<col_name : data_type [COMMENT col_comment], …>
  • union: UNIONTYPE<data_type, data_type, …>

**

(1) Put ‘Ronald Graham 1’ into a file, temp.txt, and load that file.

hive> quit;
echo Ronald Graham 1 > temp.txt
hive> LOAD DATA LOCAL INPATH 'temp.txt'
    > INTO TABLE erdos;

(2) Replace the file ‘erdos.txt’ with ‘Ronald Graham 1’

hive> quit;
echo Ronald Graham 1 > erdos.txt
hive> LOAD DATA LOCAL INPATH 'erdos.txt'
    > INTO TABLE erdos;

(3) Append ‘Ronald Graham 1’ to erdos.txt and reload the entire file.

hive> quit;
echo Ronald Graham 1 >> erdos.txt
hive> LOAD DATA LOCAL INPATH 'erdos.txt'
    > OVERWRITE INTO TABLE erdos;

I prefer method 3 for small files.

Small Math Puzzles Make My Day

I was recently hanging out with some friends, and one of them brought out an old math problem sheet. This problem sheet was briefly passed around and then put away again. One of the problems was a cute math puzzle. This problem was…

Find the sum of the first 1234 elements in following sequence
1, 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, …


Sidenote:
There is a difference between a sequence and a series. The difference is:
a sequence is a list of numbers (1, 2, 1, 2, 2, …)
a series is the sum of a sequence (1 + 2 + 1 + 2 + 2 + … )


Before proceeding, I encourage you to solve the problem yourself.

The first thing I recognized was that the positions of “1” in the sequence corresponded to the triangle numbers.
1,
2, 1,
2, 2, 1,
2, 2, 2, 1,
2, 2, 2, 2, 1,

After the sheet was put away, the little problem stuck in my head. When I got the opportunity, I used my phone to run my quickly written script.

x=1234
for n in range(0, x-1):
  if (n*(n+1))/2 > x: 
    tri =n-1
    print x*2-tri 
    break

This finds the number of triangle numbers less than 1234 (which corresponds to the number of 1s in the sequence), then subtracts this number from 1234*2.

Grep Is a Magical Beast (ft. HiveQL and Impala)

Grep is a magical beast which can be used to make your bash scripts excellent. This post will give you a taste of its utility. Let’s say I have a file, temp.txt which contains two lines:

don‘t forget to be awesome
so long and thanks for all the fish!

I’d like to execute one command if a word exists in the file, and a different command if the word doesn’t exist in the file. This is well-suited for grep: for example,

grep -q -Rw 'awesome' temp.txt && echo found || echo not found

echo prints to standard out, surrounding quotes are not required
-R will recursively search through files if no match is found in the current directory
-i will look for matching lines
-w will look for matching words

If we run the command in terminal, we get the result found as expected.
If we run the same command, but change w to i,

grep -q -Ri 'awesome' temp.txt && echo found || echo not found

we get the result not found because there is no line that contains only “awesome.” If we had a file with the following contents, we’d get the result found:

don‘t forget to be awesome
so long and thanks for all the fish!
awesome

Now that you are beginning to appreciate grep, let’s say I have a file, tensors.txt, containing this information:

electric field,1.0
polarization,1.0
tau,0.0
stress,2.0
strain,2.0

But where does grep come in? Let’s say you want to load information from tensors.txt into the Hive table “EXAMPLETABLE”, but you are unsure about whether the table already exists or not. This calls for… *superman noises* … a bash script!

This script will query Hive, and ask politely for a list of its current tables. When Hive replies via stdout, the script will take notes into some temporary file which we can search later.

#!/bin/bash
function createtable {
    #creates new table if table does not already exist.
    NEWTABLE=$1
    TMP=$DIR"tempfile.txt"
    #check to see that table is non-existant
    hive -S -e "SHOW TABLES" > $TMP #see explanation of flags below
    echo `grep -q -i $NEWTABLE $TMP && echo table already exists || echo we must create a new table` 
}
#Then, we call the function...
TABLENAME=EXAMPLETABLE
createtable $TABLENAME

-S runs hive in silent mode
-e runs the following string as an external query

Okay, awesome, we can print the status of the table’s existence to stdout. But friend, this is just the beginning, we can replace the echo line with:

grep -q -i $TABLENAME $TMP && echo Table already exists. Please check your tablename. ||
    hive -S -e "CREATE TABLE $NEWTABLE(
    name STRING,
    rank FLOAT
    )
    row format delimited fields terminated by ',' lines terminated by 'n'"

So the entire script now looks like

#!/bin/bash
function createtable {
    #creates new table if table does not already exist.
    NEWTABLE=$1
    TMP=$DIR"tempfile.txt"
    #check to see that table is non-existant 
    hive -S -e "SHOW TABLES" > $TMP 
    grep -q -i $TABLENAME $TMP && echo Table already exists. Please check your tablename. ||
        hive -S -e "CREATE TABLE $NEWTABLE(
        name STRING,
        rank FLOAT
    )
    row format delimited fields terminated by ',' lines terminated by 'n'"
}
TABLENAME=EXAMPLETABLE
createtable $TABLENAME
hive -S -e "LOAD DATA LOCAL INPATH 'tensors.txt' INTO TABLE $TABLENAME"

This function will only create a new table if the table does not already exist.

Want to go faster? Let’s run pieces of this script via Impala. We don’t have to make major alterations to our bash script; most HiveQL SELECT and INSERT statements run unmodified in Impala. Impala does not currently support CREATE statements, so we must CREATE in Hive. To be safe, we will run REFRESH in the impala-shell so that it reflects all recent changes.

The equivalent of

hive -S -e "SHOW TABLES" > $TMP

in Impala is

impala-shell -B --quiet -q "SHOW TABLES" -o $TMP

-B returns results in plain text
–quiet runs impala-shell in quiet mode
-q runs the following string as an external query

The entire script looks like this:

#!/bin/bash
function createtable {
    #creates new table if table does not already exist.
    NEWTABLE=$1
    TMP=$DIR"tempfile.txt"
    #check to see that table is non-existant 
    impala-shell -B --quiet -q "SHOW TABLES" -o $TMP
    grep -q -i $TABLENAME $TMP && echo Table already exists. Please check your tablename. ||
        hive -S -e "CREATE TABLE $NEWTABLE(
        name STRING,
        rank FLOAT
    )
    row format delimited fields terminated by ',' lines terminated by 'n'"
    impala-shell --quiet -q "refresh" 
}
TABLENAME=EXAMPLETABLE
createtable $TABLENAME
impala-shell --quiet -q "LOAD DATA INPATH 'tensors.txt' INTO TABLE $TABLENAME"

Note that there is no LOCAL in the Impala LOAD statement. Impala does not currently support loading information from files outside of an HDFS location.

tl;dr, grep is efficient and slightly magical when implemented properly.

mySQL: LIKE vs. REGEXP

Using LIKE in a query is an order of magnitude faster than using REGEXP.
The downside is that LIKE doesn’t offer the control and generality that REGEXP does.

For example, let’s say I’m querying my table for all distinct IP addresses that begin with ‘3.14’
The entries in the ‘ip’ column of my table look like this:

ip
3.14.15.92
3.14.455566677889000.000
3.14twasbrilligandtheslithy^&*)
3.14tobeornottobethatisthequestion%$@

With LIKE, I can say

SELECT DISTINCT ip FROM TABLENAME
WHERE ip LIKE '3.14%';

This will not only match ‘3.14.15.92’
It will also match ‘3.14.455566677889000.000’, ‘3.14twasbrilligandtheslithy^&*)’ and ‘3.14tobeornottobethatisthequestion%$@’;

With REGEXP, I can be more specific, restricting the search to only valid IP addresses, with the same query header, I replace the WHERE statement with:

 WHERE ip REGEXP '^3.14.d{1,3}.d{1,3}

This will only match ‘3.14.15.92’, and disregard the invalid IP addresses.

Although it is slower, REGEXP will give you more control over your query results. However, if you are worried about speed, I suggest going with LIKE and cleaning your results as a separate process.
This will only match ‘3.14.15.92’, and disregard the invalid IP addresses.

Avoid SQL Injection in JDBC

Let’s say I’m trying to insert some information into a mySQL table using JDBC.

In mySQL, the type varchar(225) is equivalent to the type String in Java.

Our example table will be of the format:

name varchar(255)
field varchar(255)
university varchar(255)
alive varchar(255)

After I import,

import java.sql.*;

I’ll have to open a connection (I suggest doing this outside of the try-catch).

public class EXAMPLEINSERT {

  public static void insertIntoTable(String jdbcURL, String USER, String PASS) {

    Connection conn = null;
    Statement stmt = null;
    try{
        // Open a connection
        conn = DriverManager.getConnection(jdbcURL, USER, PASS);
      
        // Execute insert
        stmt = conn.createStatement();
        String tableName = "TABLENAME";

        // At this point, I might be tempted to do the following ***
        String insertStatement = String.format("INSERT INTO " + tableName + " VALUES ("%s", "%s", "%s", "%s")", name, field, university, alive);
        stmt.executeUpdate(insertStatement);
        // But this is wrong!
    }
    conn.close();
    } catch(Exception e) {
        System.err.println(e.getMessage());
    }
  }
} 

But will this stand up to an attack? What if I set

String university = "university'); DROP TABLE TABLENAME;--";

(The ;– tells the table that everything after “;” is not part of the query.)

Then my entire table will be deleted! We need to sanitize our inputs.

To avoid SQL injection, (such as the Bobby Tables post from xkcd), we must prepare the statement before execution:

public class EXAMPLEINSERT {

  public static void insertIntoTable(String jdbcURL, String USER, String PASS) {

    Connection conn = null;
    Statement stmt = null;
    // add Prepared statement outside of try-catch
    PreparedStatement prepareStatement = null;
    try{
        // Open a connection
        conn = DriverManager.getConnection(jdbcURL, USER, PASS);
      
        // Execute insert
        stmt = conn.createStatement();
        String tableName = "TABLENAME";

        // Avoid temptation and deliver yourself from the evils of coding Java as if it were Python.
        String template = "INSERT INTO " + tableName + " (name, field, university, alive) VALUES (?, ?, ?, ?)";

        statement = conn.prepareStatement(template);
        statement.setString(1, name);
        statement.setString(2, field);
        statement.setString(3, university);
        statement.setBoolean(4, alive);

        statement.executeUpdate();
        //More hardcoding? Slightly, but also more robust.
    }
    conn.close();
    } catch(Exception e) {
        System.err.println(e.getMessage());
    }
  }
}