Code Kata 3

KataFour: Data Munging

Martin Fowlergave me a hard time forKataTwo, complainingthat it was yet another single-function, academic exercise. Which, or course,it was. So this week let’s mix things up a bit.

Here’s an exercise in three parts to do with real world data.Try hard not to read ahead—do each part in turn.

Part One: Weather Data

Inweather.datyou’ll find daily weather data forMorristown, NJ for June 2002. Download this text file, then write a program tooutput the day number (column one) with the smallest temperature spread (themaximum temperature is the second column, the minimum the third column).

Part Two: Soccer League Table

The filefootball.datcontains the results from the EnglishPremier League for 2001/2. The columns labeled ‘F’ and ‘A’ contain the totalnumber of goals scored for and against each team in that season (so Arsenalscored 79 goals against opponents, and had 36 goals scored against them). Writea program to print the name of the team with the smallest difference in ‘for’and ‘against’ goals.

Part Three: DRY Fusion

Take the two programs written previously and factor out as muchcommon code as possible, leaving you with two smaller programs and some kind ofshared functionality.

Kata Questions

·Towhat extent did the design decisions you made when writing the originalprograms make it easier or harder to factor out common code?

·Wasthe way you wrote the second program influenced by writing the first?

·Isfactoring out as much common code as possible always a good thing? Did thereadability of the programs suffer because of this requirement? How about themaintainability?

KataFive – Bloom Filters

There are manycircumstances where we need to find out if something is a member of a set, andmany algorithms for doing it. If the set is small, you can use bitmaps. Whenthey get larger, hashes are a useful technique. But when the sets get big, westart bumping in to limitations. Holding 250,000 words in memory for a spellchecker might be too big an overhead if your target environment is a PDA orcell phone. Keeping a list of web-pages visited might be extravagant when youget up to tens of millions of pages. Fortunately, there’s a technique that canhelp.

Bloom filtersare a 30-year-old statistical way of testing for membership in a set. Theygreatly reduce the amount of storage you need to represent the set, but at aprice: they’ll sometimes report that something is in the set when it isn’t (butit’ll never do the opposite; if the filter says that the set doesn’t containyour object, you know that it doesn’t). And the nice thing is you can controlthe accuracy; the more memory you’re prepared to give the algorithm, the fewerfalse positives you get. I once wrote a spell checker for a PDP-11 which storeda dictionary of 80,000 words in 16kbytes, and I very rarely saw it let thoughan incorrect word. (Update: I must have mis-remembered these figures, becausethey are not in line with the theory. Unfortunately, I can no longer read the8” floppies holding the source, so I can’t get the correct numbers. Let’s justsay that I got a decent sized dictionary, along with the spell checker, all inunder 64k.)

Bloom filtersare very simple. Take a big array of bits, initially all zero. Then take thethings you want to look up (in our case we’ll use a dictionary of words).Produce ‘n’ independent hash values for each word. Each hash is a number whichis used to set the corresponding bit in the array of bits. Sometimes there’llbe clashes, where the bit will already be set from some other word. Thisdoesn’t matter.

To check to seeof a new word is already in the dictionary, perform the same hashes on it thatyou used to load the bitmap. Then check to see if each of the bitscorresponding to these hash values is set. If any bit is not set, then younever loaded that word in, and you can reject it.

The Bloom filterreports a false positive when a set of hashes for a word all end upcorresponding to bits that were set previously by other words. In practice thisdoesn’t happen too often as long as the bitmap isn’t too heavily loaded withone-bits (clearly if every bit is one, then it’ll give a false positive onevery lookup). There’s a discussion of the math in Bloom filters atwww.cs.wisc.edu/~cao/papers/summary-cache/node8.html.

So, this kata isfairly straightforward. Implement a Bloom filter based spell checker. You’llneed some kind of bitmap, some hash functions, and a simple way of reading inthe dictionary and then the words to check. For the hash function, rememberthat you can always use something that generates a fairly long hash (such asMD5) and then take your smaller hash values by extracting sequences of bitsfrom the result. On a Unix box you can find a list of words in /usr/dict/words(or possibly in /usr/share/dict/words). For others, I’ve put a word list up atpragprog.com/katadata/wordlist.txt.

Play with usingdifferent numbers of hashes, and with different bitmap sizes.

Part two of theexercise is optional. Try generating random 5-character words and feeding themin to your spell checker. For each word that it says it OK, look it up in theoriginal dictionary. See how many false positives you get.

KataSix: Anagrams

Back to non-realistic coding this week (sorry, Martin). Let’ssolve some crossword puzzle clues.

In England, I used to waste hour uponhour doing newspaper crosswords. As crossword fans will know, English crypticcrosswords have a totally different feel to their American counterparts: mostclues involve punning or word play, and there are lots of anagrams to workthrough. For example, a recent Guardiancrosswordhad:

  Down:
    ..
    21. Most unusual form of arrest (6)

The hint is the phrase ‘form of,’ indicating that we’re lookingfor an anagram. Sure enough ‘arrest’ has six letters, and can be arrangednicely into ‘rarest,’ meaning ‘most unusual.’ (Other anagrams include raster,raters, Sartre, and starer)

A while back we had a thread on the Ruby mailing list aboutfinding anagrams, and I’d like to resurrect it here. The challenge is fairlysimple: given a file containing one word per line, print out all thecombinations of words that are anagrams; each line in the output contains allthe words from the input that are anagrams of each other. For example, yourprogram might include in its output:

  kinship pinkish
  enlist inlets listen silent
  boaster boaters borates
  fresher refresh
  sinks skins
  knits stink
  rots sort

If you run this on the word listhereyou should find 2,530 sets of anagrams(a total of 5,680 words). Running on a larger dictionary (about 234k words) onmy OSX box produces 15,048 sets of anagrams (including all-time favorites suchas actaeonidae/donatiaceae, crepitant/pittancer, and (for those readers inFlorida) electoral/recollate).

For added programming pleasure, find the longest words that areanagrams, and find the set of anagrams containing the most words (so “parsleyplayers replays sparely” would not win, having only four words in the set).

Kata Objectives

Apart from having some fun with words, this kata should make youthink somewhat about algorithms. The simplest algorithms to find all theanagram combinations may take inordinate amounts of time to do the job. Workingthough alternatives should help bring the time down by orders of magnitude. Togive you a possible point of comparison, I hacked a solution together in 25lines of Ruby. It runs on the word list from my web site in 1.5s on a 1GHz PPC.It’s also an interesting exercise in testing: can you write unit tests toverify that your code is working correctly before setting it to work on thefull dictionary.

KataSeven: How’d I Do?

The last couple of kata have been programming challenges; let’smove back into mushier, people-oriented stuff this week.

This kata is about reading code critically—our own code. Here’sthe challenge. Find a piece of code you wrote last year sometime. It should bea decent sized chunk, perhaps 500 to 1,000 lines long. Pick code which isn’tstill fresh in your mind.

Now we need to do some acting. Read through this code threetimes. Each time through, pretend something different. Each time, jot downnotes on the stuff you find.

·Thefirst time through, pretend that the person who wrote this code is the bestprogrammer you know. Look for all the examples of great code in the program.

·Thesecond time through, pretend that the person who wrote this code is the worstprogrammer you know. Look for all the examples of horrible code and bad design.

·Thethird (and last) time though, pretend that you’ve been told that this codecontains serious bugs, and that the client is going to sue you to bankruptcyunless you fix them. Look for every potential bug in the code.

Now look at the notes you made. What is the nature of the goodstuff you found? Would you find similar good stuff in the code you’re writingtoday. What about the bad stuff; are similar pieces of code sneaking in to yourcurrent code too. And finally, did you find any bugs in the old code? If so,are any of them things that that you’d want to fix now that you’ve found them.Are any of them systematic errors that you might still be making today?

Moving Forward By Looking Back

Perhaps you’re not like me, but whenever I try this exercise Ifind things that pleasantly surprise me and things that make me cringe inembarrassment. I find the occasional serious bug (along with more frequent lessbut serious issues). So I try to make a point of looking back at my code fairlyfrequently.

However, doing this six months after you write code is not thebest way of developing good software today. So the underlying challenge of thiskata is this: how can we get into the habit of critically reviewing the codethat we write, as we write it? And can we use the techniques of reading codewith different expectations (good coder, bad coder, and bug hunt) when we’rereviewing our colleagues code?

KataEight: Conflicting Objectives

Why do we write code?At one level, we’re trying to solve some particular problem, to add some kindof value to the world. But often there are also secondary objectives: the codehas to solve the problem, and it also has to be fast, or easy to maintain, orextend, or whatever. So let’s look at that.

For this kata,we’re going to write a program to solve a simple problem, and we’re going towrite it with three different sub-objectives. Our program is going do processthedictionarywe used inprevious kata, this time looking for all six letter words which are composed oftwo concatenated smaller words. For example:

 al + bums =>albums

 bar + ely =>barely

 be + foul =>befoul

 con + vex =>convex

 here + by =>hereby

 jig + saw =>jigsaw

 tail + or =>tailor

 we + aver =>weaver

Write theprogram three times.

·The first time, make program asreadableas youcan make it.

·The second time, optimize theprogram to run fastfastas youcan make it.

·The third time, write asextendiblea programas you can.

Now look back atthe three programs and think about how each of the three subobjectivesinteracts with the others. For example, does making the program as fast aspossible make it more or less readable? Does it make easier to extend? Doesmaking the program readable make it slower or faster, flexible or rigid? Anddoes making it extendible make it more or less readable, slower or faster? Areany of these correlations stronger than others? What does this mean in terms ofoptimizations you may perform on the code you write?

KataNine: Back to the CheckOut

Back to the supermarket. This week, we’ll implement the code fora checkout system that handles pricing schemes such as “apples cost 50 cents,three apples cost $1.30.”

Way back inKataOnewe thought about how to model thevarious options for supermarket pricing. We looked at things such as “three fora dollar,” “$1.99 per pound,” and “buy two, get one free.”

This week, let’s implement the code fora supermarket checkout that calculates the total price of a number of items. Ina normal supermarket, things are identified using Stock Keeping Units, or SKUs.In our store, we’ll use individual letters of the alphabet (A, B, C, and soon). Our goods are priced individually. In addition, some items aremultipriced: buynof them, and they’ll cost youycents. For example, item ‘A’ might cost 50 cents individually,but this week we have a special offer: buy three ‘A’s and they’ll cost you$1.30. In fact this week’s prices are:

  Item   Unit      Special
         Price     Price
  --------------------------
    A     50       3 for 130
    B     30       2 for 45
    C     20
    D     15

Our checkout accepts items in any order, so that if we scan a B,an A, and another B, we’ll recognize the two B’s and price them at 45 (for atotal price so far of 95). Because the pricing changes frequently, we need tobe able to pass in a set of pricing rules each time we start handling acheckout transaction.

The interface to the checkout should look like:

   co = CheckOut.new(pricing_rules)
   co.scan(item)
   co.scan(item)
       :    :
   price = co.total

Here’s a set of unit tests for a Ruby implementation. The helpermethod price lets you specify a sequence of items using a string, calling thecheckout’s scan method on each item in turn before finally returning the totalprice.

  class TestPrice < Test::Unit::TestCase
    def price(goods)
      co = CheckOut.new(RULES)
      goods.split(//).each { |item| co.scan(item) }
      co.total
    end
    def test_totals
      assert_equal(  0, price(""))
      assert_equal( 50, price("A"))
      assert_equal( 80, price("AB"))
      assert_equal(115, price("CDBA"))
      assert_equal(100, price("AA"))
      assert_equal(130, price("AAA"))
      assert_equal(180, price("AAAA"))
      assert_equal(230, price("AAAAA"))
      assert_equal(260, price("AAAAAA"))
      assert_equal(160, price("AAAB"))
      assert_equal(175, price("AAABB"))
      assert_equal(190, price("AAABBD"))
      assert_equal(190, price("DABABA"))
    end
    def test_incremental
      co = CheckOut.new(RULES)
      assert_equal(  0, co.total)
      co.scan("A");  assert_equal( 50, co.total)
      co.scan("B");  assert_equal( 80, co.total)
      co.scan("A");  assert_equal(130, co.total)
      co.scan("A");  assert_equal(160, co.total)
      co.scan("B");  assert_equal(175, co.total)
    end
  end

There are lots of ways of implementing this kind of algorithm;if you have time, experiment with several.

Objectives of the Kata

To some extent, this is just a fun little problem. Butunderneath the covers, it’s a stealth exercise in decoupling. The challengedescription doesn’t mention the format of the pricing rules. How can these bespecified in such a way that the checkout doesn’t know about particular itemsand their pricing strategies? How can we make the design flexible enough sothat we can add new styles of pricing rule in the future?

点赞