Wednesday, February 22, 2012

Monkeyman rising


During the last couple of months, I got so repulsed by having to use Blogger for blogging, that all blogging activity came to a grinding halt. I seriously searched for other solutions, including Tumblr, Jekyll and Middleman, but none of them seemed to cut it in the end.

Middleman actually came close. I love it. It's awesome. Until it's not. And then you need to be a Ruby programmer to truly understand what's going on. Having to learn Ruby just to write a blog just didn't feel right.

So then, on February 3, I tweeted this:


A friend a mine send me a link to Jekyll. Other than that, it stayed quiet. That's when I decided I had waited long enough. I had spent countless hours trying to twist Middleman's arm to do what I wanted to do, all in vain. Enough is enough.

So three days later, I went like:


In all honesty, it didn't do much yet. It generated something from the commandline. But it didn't have any of the fancy stuff yet. (LESS support, live preview and all of that.)

This morning I completed a version that is pretty usable, I'd say. I feel safe to start moving some of my existing sites over to a solution based on that version of Monkeyman. Behold: the Monkeyman:

Monday, December 19, 2011

DTD Resolution Be Gone!

I just read Cay Horstmanns post on The Sordid Tale of XML Catalogs. I could agree more: it's a mess. Not to so long ago, I had to write some software for pulling down XML content from a web site. I used Nathan Hamblan's Dispatch library and Scala's built-in support for traversing an XML document the easy way. And immediately noticed that parsing the document did take way more time than it should. It turns out that upon parsing the document, the parser pulled down a whole slew of DTDs and entity definitions. My first response was to find a way to bake a Catalog based resolver in somewhere. But then - I'm not particularly font of the current implementations. (Are they even maintained.) And since this was pure production, I had to make it working as quickly as possible. So this is what I did to get out of the mess: First of all, I defined a trait:
trait NoDtdResolution extends XMLLoader[Node] {

  override def parser = {
    val f = SAXParserFactory.newInstance()
    f.setNamespaceAware(false)
    f.setValidating(false)
    val result = f.newSAXParser()
    val reader = result.getXMLReader
    reader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
    result
  }

}
Normally, in Scala, you would load an XML document like this:
XML.load(...)
However, if you do it like that, you will get a parser that downloads the DTDs as well. And I don't want that. With the trait I created, I could load an XML document like this:
val loader = new NoBindingFactoryAdapter with NoDtdResolution
loader.load(...)
... without the DTDs being resolved. (I guess you could do something similar to install a catalog.) Which it great, but unfortunately, it doesn't make Dispatch aware of it. So, in dispatch, if you use the <> operator, it will still download the XML file and parse with DTD loading activated. So I needed another operator: one that uses a NoBindingFactoryAdapter without DTD resolution:
trait ImplicitXmlHandlers extends ImplicitHandlerVerbs {
  implicit def handlerToXmlHandlers(r: HandlerVerbs) = new XmlHandlers(r)
  implicit def requestToXmlHandlers(r: Request) = new XmlHandlers(r)
}

object XmlHandlers extends ImplicitXmlHandlers

class XmlHandlers(subject: HandlerVerbs) {

  private val tolerantAdapter = new NoBindingFactoryAdapter with NoDtdResolution

  def <|>[T](fn: NodeSeq => T) = subject >>~ { reader =>
    fn(tolerantAdapter.load(reader))
  }

}
With that in place, I can now use Dispatch without worrying about DTDs getting loaded:
Http(url(...) <|> {
  xml =>
    ...
})
... where I was using this before:
Http(url(...) <> {
  xml =>
    ...
})

Sunday, December 4, 2011

Generic Rotation in Scala

My son is preparing a talk on cryptography. I told him I would write some code for demo purposes, and we decided to start with a simple substitution cipher, one that - according to history - was used by Julius Ceasar. It's a really simple mechanism: you just rotate an alphabet and replace every character in your message with the character in the rotated alphabet.

I figured it would be really easy to do in Scala. I mean, rotation on a collection should be fairly easy to implement, right? Well, it is, but not if you want to write it in such a way that the output collection is of the same type as the input collection. I mean, you could define the input parameter as a Seq[T], but then - if you would pass in a String - you would not get a String as a result. So what do you do?

After some poking around, I eventually managed to write a rotation function that behaves as you would expect.

def rotate[A,C](coll: C, number: Int)(
  implicit c2i: C => Iterable[A], cbf: collection.generic.CanBuildFrom[C,A,C]
): Option[C] = 
  if (coll.hasDefiniteSize) {
    val positions = number % coll.size match {
      case i if i < 0 => i + coll.size
      case i => i
    }
    val builder = cbf()
    builder ++= coll.drop(positions)
    builder ++= coll.take(positions)
    Some(builder.result())
  } else None

Since rotation only works for collections of a definite size, and since not all collections have a definite size, it returns a None in case the collection does not have a definitive size. By default, it rotates to the left. By passing a negative number, you rotate right.

scala> rotate(List(1, 2, 3, 4), 2)
res15: Option[List[Int]] = Some(List(3, 4, 1, 2))

scala> rotate("abcdefghi", 2)    
res16: Option[java.lang.String] = Some(cdefghiab)

scala> rotate("abcdefghi", -2) // Rotate right
res17: Option[java.lang.String] = Some(hiabcdefg)

Sunday, October 9, 2011

Huffman coding done in Scala

Huffman coding is an entropy based encoding algorithm. Take an arbitrary language, and you will find that certain characters are used way more often than others. Huffman coding allows you to leverage that; it allows you to use less bits for characters that occur more often, and more bits for the exceptions.

Having Huffman encoding for Preon has been on my wish list for a while, but since all I do is Scala these days, I just couldn't resist trying to do it in Scala. And it works:

class DecodingException(depth: Int) extends Exception

class HuffmanCodec(in: Map[Char,Double]) {

  private val tree = {
    val list = { for ((char, weight) <- in) yield CharacterNode(char, weight) } toList
    def createTree(list: List[Node]): Node = list match {
      case top :: Nil => top
      case first :: second :: Nil => ChoiceNode(first.push(true), second.push(false))
      case multiple => 
        val first :: second :: others = multiple.sortBy(_.weight)
        createTree(ChoiceNode(first.push(true), second.push(false)) :: others)
    }
    createTree(list)
  }
  
  private val index = {
    tree.all.collect{ 
      case node: CharacterNode => (node.char, node.code)
    }.toMap
  }

  def decode(in: Seq[Boolean], length: Int, builder: StringBuilder = new StringBuilder): (String, Seq[Boolean]) = {
    if (length > 0) {
      val (char, remaining) = tree.decode(in)
      builder += char
      decode(remaining, length - 1, builder)
    } else {
      (builder.result, in)
    }
  }

  def encode(in: String): Seq[Boolean] = {
    in.flatMap(index(_))
  }
  
  private abstract sealed class Node {
    
    def decode(in: Seq[Boolean]): (Char, Seq[Boolean])

    def weight: Double
    
    def push(code: Boolean): Node
    
    def all: List[Node]
    
  }

  private case class CharacterNode(char: Char, weight: Double, code: List[Boolean] = Nil) extends Node {
    
    def decode(in: Seq[Boolean]): (Char, Seq[Boolean]) = (char, in)
      
    def push(code: Boolean) = CharacterNode(char, weight, code :: this.code)
    
    val all = List(this)
    
    override def toString = "CharacterNode(%s)".format(char)

  }

  private case class ChoiceNode(left: Node, right: Node, code: List[Boolean] = Nil) extends Node {
    
    val weight = left.weight + right.weight

    def push(code: Boolean) = ChoiceNode(left.push(code), right.push(code), code :: this.code)

    def decode(in: Seq[Boolean]): (Char, Seq[Boolean]) = in.headOption match {
      case Some(true) => left.decode(in.tail)
      case Some(false) => right.decode(in.tail)
      case None => throw new DecodingException(code.length)
    }

    val all = this :: (left.all ++ right.all)
    
    override def toString = "ChoiceNode(%s, %s)".format(left, right)

  }

}

With these definitions, creating a new instance of a Codec is as easy this:

scala> val codec = new HuffmanCodec(Map('a' -> 2.0, 'b' -> 3.0, 'c' -> 0.5, 'd' -> 0.7))
codec: HuffmanCodec = HuffmanCodec@52c54b3b

In the above case, the only characters we can encode are characters 'a' to 'd'. They are passed in as a map, with the character as the key and the weight as the value. (The weight says something about the relative presence of that particular character compared to all of the other characters. You could normalize everything to be fractions on a scale running from 0 to 1 that would all add up to 1, but as long as the numbers correctly reflect relative appearance, it's all good.)

The underlying tree that is getting constructed based on these arguments looks like this:

That means you can encode the 'b' character with just a single bit, and the a character with just two bits:

scala> codec.encode("b")
res3: Seq[Boolean] = Vector(false)

scala> codec.encode("a")
res4: Seq[Boolean] = Vector(true, false)

As you can see, the values are encoded as booleans for representing bit values. That's certainly not ideal, but the fastest way to get something working for now. (Ideally, bit should be slammed on top of some existing BitBuffer/BitStream abstractions.)

If you want to decode something, then in its current incarnation, you should pass in a the number of characters you expect to be encoded by the Sequence of Booleans you passed in.

scala> codec.decode(Seq(true, false), 1)
res8: (String, Seq[Boolean]) = (a,List())

Saturday, September 10, 2011

Automatic preview for reStructuredText

I have to admit, I'm a markup junkie. I do believe that with some other folks that markup picks up where punctuation left off. That's why I - when still at Sun - always fought management to be able to use DocBook instead of StarOffice. I like the WYGIWYM (What You Get Is What You Mean) paradigm; I'm not a huge fan of WYSIWYG - probably mostly because I almost never get what I want, and I don't want to get constantly reminded of that.

Now, I still like DocBook, in a way, but the last couple of years I started to see there is a point in having markup that is a little easier on the eyes. I like Markdown, but I have to say: I like reStructuredText even more, and that's probably - again - since I can be much more specific. (This is not only verbatim text, it's Scala source code.)

The other day, I stumbled across Marked. It looks like an excellent tool, and not all that expensive. I would have gotten it if only it would have offered support for reStructuredText as well - but it doesn't. Still, I couldn't stand not having the same kind of authoring experience when writing reStructuredText.

(If you don't know what Marked is: it's a tool that automatically updates a preview of the Markdown document you're authoring. It works with any editor. As soon as you hit 'Save', it will update the preview.)

Meet Rested, Marked for reStructuredText

Good news. There is a way to have the same convenience as what Marked is giving you, even if you're writing reStructuredText. Admittedly, there is a little more work you need to do, but once you get it done, the net effect is pretty much the same. (Or better, since now you can also write reStructuredText.)

Perhaps it's best to first check the video, and then read on.



What do you need?

First of all, you need to install a couple of gems:

sudo gem install guard
sudo gem install rb-fsevent
sudo gem install guard-shell
sudo gem install guard-livereload

Guard is the tool that is monitoring your filesystem for changes. It allows you to set up some rules for responding to changes in a DSL.

Once you installed these gems, next you need to add the rules for your DSL. First you need to get a basic Guardfile. In my case, I started typing a couple of commands, to get the basic file in place, but you might as well just write your own and ignore the commands.

guard init
guard init shell
guard init livereload

I made some changes, and in the end my file looked like this:

guard 'shell' do
  watch('(.*).rst') {|m| `rst2html --link-stylesheet --stylesheet-path=/Users/wilfred/workspace/rested/style.css #{m[0]} target/#{m[1]}.html` }
end

guard 'livereload' do
  watch(%r{(target/).+\.(css|js|html)})
end

So, what does this script do? Well, if you start guard, it will start monitoring the directory in which the file is placed for files with an .rst extension. In case they get modified, it will run the script for turning the file into an HTML file.

The other portion is even more interesting. It monitors the target directory for changes, and inform your browser of the changes. That is, you do need to install a Livereload plugin into your browser, and make sure it's activated for the page that you're looking at, but then for every change to the HTML it will reload the page.

Conclusion

I felt much better about using reStructuredText when I found out there is a way you can simulate Marked behavior with just a bunch of tools. With this in place, I got even more convinced that I don't need a Scrivener license any longer. Just a text editor and a couple of ruby tools, Emacs and a browser is all I need.

Thursday, August 18, 2011

Fighting filesystem entropy

I have never been good at organizing my files, with one exception: source project checkouts always go into a ~/workspace folder. That's about the only rule I managed to live by.

Unfortunately, the same also goes for managing my physical files. I tried many things. When all seemed lost, I read David Allen's GTD, which sparked a glimpse of hope. It seemed all you needed was a simple filing cabinet, a system and make sure you used a filing system without hanging file folders. I decided to live by the book (literally) and spent days finding non-hanging file folders and a cabinet for it. After searching for a couple of days, I concluded that these things don't exist in Western Europe, and settled for a hanging file folder archiving system.

However, I got the best archiving system I could find, tested the drawer carefully (since - according to David - it should be fun opening your archive) and got a label printer in order to start with a really, really tidy archive. When all was done, I set up some folders, and for a month it looked marvelous. But then entropy struck. On the plus side, I think I can now safely label my archive with one label that applies perfectly to its contents: miscellaneous.

So clearly I need to improve my archiving habits. And this time I will prevail! Why? I will tell you why.

Reasons why this time it will just work  

  1. Simple physical archive: One box for the physical paper. Whenever a letter comes in, I will eventually just dunk it into a box that is filled with - well - miscellaneous stuff. I don't ever want to look back at it. I should in fact never need it (more on that later), but I will worry about the mess when I really need to dig in. As you can see, I am already almost in that state, so this is something that seems doable. (As long as you don't set your own goals to high, everything is achievable.)
  2. Digitize: When stuff comes in, I will either digitize it or throw it away. 
  3. Automate: So, when now everything is digitized, I can automate it. I can do automation, me. 
Or will it?

Obviously, as with everything, there will be challenges. The first challenge is digitizing stuff. If that's taking a lot of work, then it will fail. I'm sure there is a law for it. If something is taking to much effort, you will give up eventually. But this is an easy problem, that you can solve with just throwing some money at it. I got myself the Fujitsu ScanSnap Pro: scanning is now simple, fast, and everything gets turned into a searchable PDF automatically. 

But there is another problem. If data is scanned, then what. I mean, it is no digitized, but I want more from it then simply having it on my disk somewhere. So what do I want?

What do I want?
  1. I want to be able to retrieve it quickly
  2. I want to have access to it forever, unless I delete it
  3. I want to be able to hide it from prying eyes of others
  4. I don't want to have a big honking folder called miscellaneous 
Retrieving it quickly

What are the different ways to retrieve something? 
  1. Search on a timeline (when did this happen?)
  2. Search on keyword
  3. Search on tag
  4. Search based on the file name (and location)
A couple of thoughts on what I think I am going to do. I think I will start relying on OpenMeta. It's an open standard for attaching tags to files on Mac OSX, with a pretty big list of tools supporting it. When a document is scanned, I want to set up the document pipeline in such a way that I need to add tags to it immediately, and make that all I need to do. For that, I could make the ScanSnap either call Tagger or Tagit. That makes sure my files are always getting stored with tags included. (For storing files from other applications, I could potentially use DefaultFolder X.


With tags included, I can now search files by tags, and that's nice, but still my file is getting stored in one location. (And remember, some tools don't know about these tags, and won't be allow me to quickly navigate there if all files are stored in a single location.)

I'm thinking about using Tag Folders for that. Tag Folders seems to be able to at least present a virtual file system layer over tagged files, in which a folder corresponds to a tag or (?) combinations of tags. I'm not sure if the file itself is actually ever physically moved. Perhaps it is. I don't know. (Let me know if you do.)



Access forever

There are different ways to accomplish this. I've heard people uploading documents to Evernote and then deleting it immediately. I'm just not familiar enough with Evernote to be able to judge if that's a good idea. An alternative approach would be to use DropBox. However, in both of these cases, your data is out there on the Internet in an unencrypted form. Evernote and DropBox would be able to read it. I don't expect that to happen just like that, but we have seen a breach of DropBox security in the passed year, so I just want to be a little careful. 

For now, I think I will just keep relying on Arq for storing data on S3. Yes, I know about the problems Amazon has been facing the last couple of months, and so I need to make sure I also do my own regular backups. (I tend to think of Arq as an Internet based RAID configuration. The redundant disk might fail, and your data might still be gone. However, if your own disk fails, there is a fair chance you will be able to get a recent copy.) And with Arq, your data will only be visible to you. (Unless somebody hacks Arq's encryption scheme. Again, fairly unlikely.)


It's not perfect though. Like, I will only be able to access my data if 1) I have access to computer that allows me to install Arq, and 2) I waited until all data got synchronized. I would rather also be able to read my data online. So perhaps I should copy (at least some of it) over to other Internet based solutions (Evernote, Google Docs, DropBox) as well. With the tags in place, I would expect that I at least should be able to set up some rules for it to do it automatically. 

Hiding

As I said, I'm using Arq to keep copies of the most important documents on the Internet. Arq encrypts your data, so that's all good. However, the data on my disk is still readable by anyone. That stinks. 

There are a number of things I'm contemplating, but haven't decided upon yet:
  1. Get Lion and use FileVault 2
  2. Use the original FileFault; don't have any experience, it might be good, it might be bad. I saw some comments people preferred Knox instead
  3. Use Knox
Option 1 seems like the most attractive option, for now. The disadvantage of using option 3 is that the tools that deal with tags probably don't know how to deal with the disk vaults defined by Knox. 

Preliminary conclusion

None of this works completely yet. It's all work in progress, but I'm getting the feeling this might get me somewhere. I will start working on it, and update this post (or post a new entry) if there is any progress. So far, I am pleasantly surprised by the potential solutions that are out there. Hopefully, in a few weeks, I will be perfectly organized. 

Wednesday, May 4, 2011

Scala FNumberClassifier

Just read Neal Ford's article on functional languages on IBM DeveloperWorks. Couldn't resist adding a Scala version, as a comment:

object ScalaFNumberClassifier {

  def isFactor(number: Int, potentialFactor: Int) = 
    number % potentialFactor == 0

  def factors(number: Int) = 
    (1 to number) filter (number % _ == 0)

  def sum(factors: Seq[Int]) =
    factors.foldLeft(0)(_ + _)

  def isPerfect(number: Int) = 
    sum(factors(number)) - number == number

  def isAbundant(number: Int) = 
    sum(factors(number)) - number > number

  def isDeficiend(number: Int) = 
    sum(factors(number)) - number < number

}

(Tried to stay as closely as possible to his Java version. Obviously, you don't need to implement sum yourself in Scala. It's already there.)