Tuesday, December 22, 2009

The most important part of Google's open letter

Today, Google published a previously internal email. It talks about the important of openness within the company. I thought it was extremely well-written, and brought up a lot of great and thought-provoking points. However, one part in particular really stuck out to me:

So if you are trying to grow an entire industry as broadly as possible, open systems trump closed. And that is exactly what we are trying to do with the Internet. Our commitment to open systems is not altruistic. Rather it's good business, since an open Internet creates a steady stream of innovations that attracts users and usage and grows the entire industry.


Generally, when a company professes its love for openness or charity or something similar, it does so with an airy "because we're so nice" attitude, and must be read with a grain of salt. This paragraph (and the explanations that followed) were extremely refreshing. There are very few philanthropic claims in this letter; it explains exactly why Google sees value in openness, and how it helps their business.

I really enjoyed reading this piece, but it was this aspect that allowed it to be enjoyable. Without it, it's just another fluff piece. With it, it becomes a powerful explanation of Google's actions and intents.

Wednesday, December 16, 2009

Snide remarks against women in technology

"It's this culture of attacking women that has especially got to stop. Whenever I post a video of a female technologist there invariably are snide remarks about body parts and other things that simply wouldn't happen if the interviewee were a man."
- Blogger Robert Scoble, responding to threats against tech author Kathy Sierra

Dreaming of Floyd-Warshall

Last night, I was studying the Floyd-Warshall algorithm. I finished at around 11:30, read a little of The Hobbit (I'm rereading it in preparation for the movie), then went to bed.

I dreamed that I was sitting inside a table cell. I was looking up above and watching the empty cells above me get filled in one at a time. The last cell in a row would fill with some gibberish, and then the first cell in the next row would fill. The filling was getting closer and closer to me. Eventually, my cell was filled (I remember feeling very stressed at this point). I looked forward, and watched my neighbors fill. I looked under me, and watched the cells beneath me fill.

Suddenly, I was off on the sidelines, watching another table (that I was not a part of) get filled. Soon after this, the dream ended.

I can only conclude that, in my dream, I was a member of one of the cells in one of the tables of the Floyd-Warshall algorithm. Once the table I was in was filled, the algorithm incremented k and started on a new table.

I notice that I never saw a third table. Perhaps I was in table k == n - 1, and the table in front of me was the last one. Or perhaps this particular version of Floyd-Warshall was optimized (as it should be) to use O(|V^2|) space, and my table (and thus, my life) was deleted from memory after its successor was filled.

Or perhaps I just have really deep-rooted issues, and dreaming about being inside an algorithm is nature's way of warning me that I should see someone.

Sunday, December 13, 2009

The Knuth-Morris-Pratt Algorithm in my own words

Over the past few days, I've been reading various explanations of the Knuth-Morris-Pratt string searching algorithms. For some reason, none of the explanations were doing it for me. I kept banging my head against a brick wall once I started reading "the prefix of the suffix of the prefix of the...".

Finally, after reading the same paragraph of CLRS over and over for about 30 minutes, I decided to sit down, do a bunch of examples, and diagram them out. I now understand the algorithm, and can explain it. For those who think like me, here it is in my own words. As a side note, I'm not going to explain why it's more efficient than naïve string matching; that's explained perfectly well in a multitude of places. I'm going to explain exactly how it works, as my brain understands it.

The Partial Match Table



The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn't quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.

Here's the partial match table for the pattern "abababca":


char: | a | b | a | b | a | b | c | a |
index:| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value:| 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |


If I have an eight-character pattern (let's say "abababca" for the duration of this example), my partial match table will have eight cells. If I'm looking at the eighth and last cell in the table, I'm interested in the entire pattern ("abababca"). If I'm looking at the seventh cell in the table, I'm only interested in the first seven characters in the pattern ("abababc"); the eighth one ("a") is irrelevant, and can go fall off a building or something. If I'm looking at the sixth cell of the in the table... you get the idea. Notice that I haven't talked about what each cell means yet, but just what it's referring to.

Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.

Proper prefix: All the characters in a string, with one or more cut off the end. "S", "Sn", "Sna", and "Snap" are all the proper prefixes of "Snape".

Proper suffix: All the characters in a string, with one or more cut off the beginning. "agrid", "grid", "rid", "id", and "d" are all proper suffixes of "Hagrid".

With this in mind, I can now give the one-sentence meaning of the values in the partial match table:

The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.

Let's examine what I mean by that. Say we're looking in the third cell. As you'll remember from above, this means we're only interested in the first three characters ("aba"). In "aba", there are two proper prefixes ("a" and "ab") and two proper suffixes ("a" and "ba"). The proper prefix "ab" does not match either of the two proper suffixes. However, the proper prefix "a" matches the proper suffix "a". Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.

Let's try it for cell four. Here, we're interested in the first four characters ("abab"). We have three proper prefixes ("a", "ab", and "aba") and three proper suffixes ("b", "ab", and "bab"). This time, "ab" is in both, and is two characters long, so cell four gets value 2.

Just because it's an interesting example, let's also try it for cell five, which concerns "ababa". We have four proper prefixes ("a", "ab", "aba", and "abab") and four proper suffixes ("a", "ba", "aba", and "baba"). Now, we have two matches: "a" and "aba" are both proper prefixes and proper suffixes. Since "aba" is longer than "a", it wins, and cell five gets value 3.

Let's skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern "abababc". Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren't going to be any matches; all the suffixes will end with the letter "c", and none of the prefixes will. Since there are no matches, cell seven gets 0.

Finally, let's look at cell eight, which is concerned with the entire pattern ("abababca"). Since they both start and end with "a", we know the value will be at least 1. However, that's where it ends; at lengths two and up, all the suffixes contain a c, while only the last prefix ("abababc") does. This seven-character prefix does not match the seven-character suffix ("bababca"), so cell eight gets 1.

How to use the Partial Match Table



We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:

If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.

Let's say we're matching the pattern "abababca" against the text "bacbababaabcbab". Here's our partial match table again for easy reference:


char: | a | b | a | b | a | b | c | a |
index:| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value:| 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |


The first time we get a partial match is here:


bacbababaabcbab
-|
-abababca


This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we don't get to skip ahead any. The next partial match we get is here:


bacbababaabcbab
----|||||
----abababca


This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4] or 5 - 3 or 2) characters:


// x denotes a skip

bacbababaabcbab
----xx|||
------abababca


This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2] or 3 - 1 or 2) characters:


// x denotes a skip

bacbababaabcbab
------xx|
--------abababca


At this point, our pattern is longer than the remaining characters in the text, so we know there's no match.

Conclusion



So there you have it. Like I promised before, it's no exhaustive explanation or formal proof of KMP; it's a walk through my brain, with the parts I found confusing spelled out in extreme detail. If you have any questions or notice something I messed up, please leave a comment; maybe we'll all learn something.

Saturday, December 12, 2009

Unit tests need to be fast

If you have a unit test suite, you need to be able to run it fast. Like, in under a minute (if you're on a huge system with tons of tests, then each individual component's tests should run this fast).

Many programmers develop a reflex of hitting Command+S (or Ctrl+S) every few lines of written code. This is possible because saving a file takes under a second. Ideally, programmers should develop a similar reflex with unit tests. Maybe not every few lines, but after every "chunk" of work a programmer produces produce (a new or altered function, a new instance variable in a class, etc.), he or she should be running the test suite, even if they know it's going to fail. This is a fantastic rhythm to get into; it provides instant concrete feedback, and a visible goal to strive for at all times.

Unfortunately, it's pretty much impossible if your unit tests take minutes to run. No programmer is going to start a five-minute test run if they know in advance it's going to fail, and without doing that, no programmer can really get into the full unit testing rhythm.

To put it in simple terms, unit tests are one of the most important tools in the "fail fast" toolbox, but if the unit tests themselves don't "fail fast," there's no way they can do their job properly