| Comments

what?! a session at oscon about generating microsoft office document formats?  has he gone mad?  that is what i wanted to know so i sat in on michael koziarski's session on this topic.

recognize the name?  you should if you are a rails developer...koz is part of the rails core team.  great guy, smart dude.  fun to listen to -- very dynamic.

anyhow, i went to this session as the description read that microsoft developers had been able to generate well formatted office documents for a long time, so why not bring some love to the open source world.  i hopped in about five minutes late to the session and didn't hear the setup, but the gist of it was the problem statement i gathered of needing to generate document formats (word, pdf, odf, xls, etc.) based on data and in the format the user chooses.  oh yeah, and through rails.

michael walked us through the research he did, citing various aspects of choices:

    • ms office apis: only on windows, com, they aren't .net developers, complicated api's, no pdf support, design changes may be challenged
    • pdf-writer: expensive design experience, no .doc conversion support
    • rich text format: archaic meta information, no .doc support, expensive design changes
    • LaTeX: see rich text format, oh an no pdf support
    • HTML: conversion utilities were horrible
    • ODF: open standard, x-plat tools, simplified design lifecycle.

guess which one won -- odf.

michael discussed the two primary problems needing to be resolved with using ODF: creation and conversion.  the creation was the simpler and required understanding of the structure of an ODF document (which is a zip file with manifests and content much like openxml docs from office 2007).  he notes that there is one folder called "Configurations2" that he has no idea what it does, but deleting it did not affect the doc :-).  although the ODF format is xml, it involves over 24 namespaces which presented some challenges in creation (i.e., why didn't they just use Builder in Ruby?) for simple elements (an image requires 4 different namespaces alone). 

after they perfected the creation mechanism, they needed conversion -- on the fly.  this proved to be surprisingly difficult.  why?  well there is no command line output in openoffice (the design tool) for creating the converted documents.  so thus, introduce UNO.  it is a bridge to openoffice for various languages -- essentially a COM interface.  but then came michael's other problem...no ruby bindings.  argh.  but there was a python one.

michael pointed out that he could have taken the time to create/finish a ruby binding, but why.  he merely had to understand the implementation of the python one and make it work.  there were issues with the python one, namely that it required X11...to be running...with a logged in user.  yikes.  because of this they had to do some re-architecting of the solution.  it sounds quite slick and echoed software/services messaging.  basically the odf docs were initially stored in amazon s3 storage.  then using amazon sqs (queuing service) another server would pull them out and do the conversion and put them back in the desired format.  michael articulated one issue with this as the amazon s3 servers are replicated across multiple servers...so when they uploaded an object from new zealand to s3, the server in northern california might request something that isn't replicated yet -- so they had to build in a retry mechanism as well.

at the end of the day their system works.  admittedly michael has some reservations of scale, but it works for the business problem at hand.  i emailed michael afterwards and asked if he considered using the odf converter for openxml that is open sourced to provide interop...not sure if that would help or not.  apparently openoffice is producing an odftoolkit as well that has "conversion" in their diagram that might be beneficial for odf development in the future.

that wrapped up day 1 for me today.  there are a few parties tonight, maybe i'll go, likely not.  stay tuned tomorrow for microsoft news at oscon as bill hilf is a part of the keynote...er i mean general session.