Thursday, August 18, 2011

Fighting filesystem entropy

I have never been good at organizing my files, with one exception: source project checkouts always go into a ~/workspace folder. That's about the only rule I managed to live by.

Unfortunately, the same also goes for managing my physical files. I tried many things. When all seemed lost, I read David Allen's GTD, which sparked a glimpse of hope. It seemed all you needed was a simple filing cabinet, a system and make sure you used a filing system without hanging file folders. I decided to live by the book (literally) and spent days finding non-hanging file folders and a cabinet for it. After searching for a couple of days, I concluded that these things don't exist in Western Europe, and settled for a hanging file folder archiving system.

However, I got the best archiving system I could find, tested the drawer carefully (since - according to David - it should be fun opening your archive) and got a label printer in order to start with a really, really tidy archive. When all was done, I set up some folders, and for a month it looked marvelous. But then entropy struck. On the plus side, I think I can now safely label my archive with one label that applies perfectly to its contents: miscellaneous.

So clearly I need to improve my archiving habits. And this time I will prevail! Why? I will tell you why.

Reasons why this time it will just work  

  1. Simple physical archive: One box for the physical paper. Whenever a letter comes in, I will eventually just dunk it into a box that is filled with - well - miscellaneous stuff. I don't ever want to look back at it. I should in fact never need it (more on that later), but I will worry about the mess when I really need to dig in. As you can see, I am already almost in that state, so this is something that seems doable. (As long as you don't set your own goals to high, everything is achievable.)
  2. Digitize: When stuff comes in, I will either digitize it or throw it away. 
  3. Automate: So, when now everything is digitized, I can automate it. I can do automation, me. 
Or will it?

Obviously, as with everything, there will be challenges. The first challenge is digitizing stuff. If that's taking a lot of work, then it will fail. I'm sure there is a law for it. If something is taking to much effort, you will give up eventually. But this is an easy problem, that you can solve with just throwing some money at it. I got myself the Fujitsu ScanSnap Pro: scanning is now simple, fast, and everything gets turned into a searchable PDF automatically. 

But there is another problem. If data is scanned, then what. I mean, it is no digitized, but I want more from it then simply having it on my disk somewhere. So what do I want?

What do I want?
  1. I want to be able to retrieve it quickly
  2. I want to have access to it forever, unless I delete it
  3. I want to be able to hide it from prying eyes of others
  4. I don't want to have a big honking folder called miscellaneous 
Retrieving it quickly

What are the different ways to retrieve something? 
  1. Search on a timeline (when did this happen?)
  2. Search on keyword
  3. Search on tag
  4. Search based on the file name (and location)
A couple of thoughts on what I think I am going to do. I think I will start relying on OpenMeta. It's an open standard for attaching tags to files on Mac OSX, with a pretty big list of tools supporting it. When a document is scanned, I want to set up the document pipeline in such a way that I need to add tags to it immediately, and make that all I need to do. For that, I could make the ScanSnap either call Tagger or Tagit. That makes sure my files are always getting stored with tags included. (For storing files from other applications, I could potentially use DefaultFolder X.

With tags included, I can now search files by tags, and that's nice, but still my file is getting stored in one location. (And remember, some tools don't know about these tags, and won't be allow me to quickly navigate there if all files are stored in a single location.)

I'm thinking about using Tag Folders for that. Tag Folders seems to be able to at least present a virtual file system layer over tagged files, in which a folder corresponds to a tag or (?) combinations of tags. I'm not sure if the file itself is actually ever physically moved. Perhaps it is. I don't know. (Let me know if you do.)

Access forever

There are different ways to accomplish this. I've heard people uploading documents to Evernote and then deleting it immediately. I'm just not familiar enough with Evernote to be able to judge if that's a good idea. An alternative approach would be to use DropBox. However, in both of these cases, your data is out there on the Internet in an unencrypted form. Evernote and DropBox would be able to read it. I don't expect that to happen just like that, but we have seen a breach of DropBox security in the passed year, so I just want to be a little careful. 

For now, I think I will just keep relying on Arq for storing data on S3. Yes, I know about the problems Amazon has been facing the last couple of months, and so I need to make sure I also do my own regular backups. (I tend to think of Arq as an Internet based RAID configuration. The redundant disk might fail, and your data might still be gone. However, if your own disk fails, there is a fair chance you will be able to get a recent copy.) And with Arq, your data will only be visible to you. (Unless somebody hacks Arq's encryption scheme. Again, fairly unlikely.)

It's not perfect though. Like, I will only be able to access my data if 1) I have access to computer that allows me to install Arq, and 2) I waited until all data got synchronized. I would rather also be able to read my data online. So perhaps I should copy (at least some of it) over to other Internet based solutions (Evernote, Google Docs, DropBox) as well. With the tags in place, I would expect that I at least should be able to set up some rules for it to do it automatically. 


As I said, I'm using Arq to keep copies of the most important documents on the Internet. Arq encrypts your data, so that's all good. However, the data on my disk is still readable by anyone. That stinks. 

There are a number of things I'm contemplating, but haven't decided upon yet:
  1. Get Lion and use FileVault 2
  2. Use the original FileFault; don't have any experience, it might be good, it might be bad. I saw some comments people preferred Knox instead
  3. Use Knox
Option 1 seems like the most attractive option, for now. The disadvantage of using option 3 is that the tools that deal with tags probably don't know how to deal with the disk vaults defined by Knox. 

Preliminary conclusion

None of this works completely yet. It's all work in progress, but I'm getting the feeling this might get me somewhere. I will start working on it, and update this post (or post a new entry) if there is any progress. So far, I am pleasantly surprised by the potential solutions that are out there. Hopefully, in a few weeks, I will be perfectly organized. 


Wilfred Springer said...

Remember I said I wanted to have the ability to file specific types of documents to specific locations? I started using Hazel for it now. Workflow is now ScanSnap -> Tagger -> Hazel. More on that later.