# Norvigs spell correct in Haskell

Peter Norvig has an old post about writing a spell checker which I’ve always loved the succinctness of. I was on a plane for a few hours and wanted to see how this translated from python to haskell. This is the tale of the journey which I did not intend to take.

## Keeping the same structure

First step was going through and recreating the main functionality which is mainly lists and sets, so, no big deal.

## Splitting

We split the raw text into words

``````words :: Text -> [Text]
words = T.split (not . Char.isAsciiLower) . T.toLower``````

This code is the intuitive answer to the problem above, however it’s very slow. We’ll look at performance later in this post.

## Counter / Bag

In python, the `Counter` is implemented as a multiset / “bag”. We’ll create our own with a Map from `Text` (word) to `Int` (count)

``````type Counter = Map Text Int

toCounter :: [Text] -> Hist
toCounter = Map.fromListWith (+) . fmap (,1)

words <- toCounter . words <\$> readFile "big.txt"``````

There is also a package called `multiset` but I didn’t know this because the wifi on the plane didn’t work.

## Probability and Correction

In order to guess the right way of correcting we need to have probabilities based on the corpus’ word counts.

``````prob :: Counter -> Text -> Int
prob counter word = occurences `div` totalWords
where
occurences = fromMaybe 0 \$ Map.lookup t counter
totalWords = Map.size ms

correction :: Counter -> Text -> Text
correction counter word = maximumBy (\a b -> p a `compare` p b) \$ candidates counter word
where p = prob counter``````
``````candidates :: Counter -> Text -> Set Text
candidates counter word = detect
[ known counter \$ Set.singleton t
, known counter (edits1 word)
, known counter (edits2 word)
, Set.fromList [t]
]

detect :: [Set Text] -> Set Text
detect = fromMaybe Set.empty . head . filter (not . Set.null)

known :: Counter -> Set Text -> Set Text
known counter = Set.filter (\w -> Map.member w counter)``````

## Edits / Permutations

I initially squished all the logic into single list-comprehensions, but you’ll see I’ve split the heavier functions out.

``````edits1 :: Text -> [Text]
edits1 w = nub' \$ mconcat [transposes', deletes', replaces', inserts]
where
alphabet    = fmap T.singleton ['a'..'z']
splits      = zip (T.inits w) (T.tails w)
deletes'    = deletes splits
transposes' = transposes splits
replaces'   = replaces splits
inserts     = [l <> c <> r | (l,r) <- splits, c <- alphabet]``````

The `splits` gets its own type for cleanliness:

``type Splits = [(Text, Text)]``

Instead of `if R` or `if len(R)<1` and such like we have in python, I used a `guard` to skip over splits with contents fitting a certain criteria (e.g (l,r) where r is not empty)

``````unSplit :: (Monad f, Alternative f) => (Text, Text) -> f (Text,Text)
unSplit = unSplitWith (/= "")

unSplitWith :: (Monad f, Alternative f) => (Text -> Bool) -> (Text, Text) -> f (Text,Text)
unSplitWith f (l, r) = guard (f r) >> pure (l, r)

-- | swap the 1st and 2nd letters across our list of splits ("derp" -> "edrp")
transposes :: Splits -> [Text]
transposes splits =
[l <> swap' r | x <- splits, (l,r) <- unSplitWith (\a -> T.length a > 1) x]
where
swap' w = T.intercalate "" [two, one', rest]
where
two  = T.take 1 \$ T.drop 1 w
one'  = T.take 1 w
rest = T.tail \$ T.tail w

-- | remove a letter across all splits "derp" -> ["drp","dep","der"]
deletes :: Splits -> [Text]
deletes splits =
[l <> T.tail r | x <- splits, (l,r) <- unSplit x]

-- | try replacing a letter with one from the alphabet in each spot. This one is very large
replaces :: Splits -> [Text]
replaces splits = [l <> c <> T.tail r | x <- splits, (l,r) <- unSplit x, c <- alphabet]``````

I think this comes out reasonably concise.

``````edits2 :: Text -> [Text]
edits2 w = nub' [ e2 | e1 <- edits1 w, e2 <- edits1 e1 ]

nub' :: [Text] -> [Text]
nub' = Set.toList . Set.fromList``````

## Performance

The performance of the implementation I came to is… really bad. The time taken to guess even short words was ~4 seconds. This was unacceptable considering the python version is nearly instant.

After asking around on irc and slack, two main problems were pointed out.

• The `words` function was extremely inefficient (thanks to @mwutton for pointing this out)
• The Map and Set in `containers` package are not optimized for this sort of bagging. (thanks to @yaron)

In order to speed up the `words` implementation, we just shove the logic into `Data.Text`’s implementation (which is nasty). This buys us ~1 second off the ridiculous 4 seconds.. So, I went further.

Since I wasn’t using any order-specific functions on Sets or Maps I just replaced the `containers` dependency with `unordered-containers` and changed the import statements to use them. Bam! This nearly halved the time! But it’s still real bad at 1 second.

I used the `profiteur` tool to visualize the performance issues a bit while going through this process, which just basically confirmed that Set/Map operations and `words` were awful, like we already knew. It seems as though python’s `Counter` shouldn’t be all that different than ours (an unordered hash set) but the haskell version lags behind. I kept the code as intuitive as I knew how and it wasn’t quite enough for this type of problem.

## Lessons

1. Always use `unordered-containers` unless you for some reason need to keep the ordering of your data structures.
2. Sometimes pre-processing is worth the effort. You can try as hard as you want to optimize the function, but at some point you have to call it a loss.

I’d welcome any comments about how this could be improved further. The result was not encouraging but despite this, I did learn some things along the way.

The full code is here