Automating document conversion in Linux using JODConverter/OOo

Problem: Need to automatically convert an existing Microsoft Word document to a PDF on the fly.

My solution to this is using OpenOffice.org and JODConverter and call the JODConverter webservice from PHP. I tried searching for a quite a while for other ways to do this, and they were either hardly as easy, or didn’t look half as good when they came out as PDFs. Of course you can convert a lot more than just DOC -> PDF. At latest glance the JODConverter homepage lists the following possible conversions:

* DOC to PDF, DOC to ODT, DOC to RTF
* XLS to PDF, XLS to ODS, XLS to CSV
* PPT to PDF, PPT to ODP, PPT to SWF
* ODT to PDF, ODT to DOC, ODT to RTF
* ODS to PDF, ODS to XLS, ODS to CSV
* ODP to PDF, ODP to PPT, ODP to SWF

In any case, let’s see how to do this in 5 steps:

1. Remove all traces of whatever OpenOffice.org installations you may already have on your system. In my case, this was as simple as

[vic@ares:~$] sudo yum remove openoffice.org*

2. Download and install the newest version of OOo from http://openoffice.org

[vic@ares:~$] wget http://spout.ussg.indiana.edu/openoffice/stable/2.4.0/OOo_2.4.0_LinuxIntel_install_wJRE_en-US.tar.gz
[...]
[vic@ares:~$] tar zxf OOo_2.4.0_LinuxIntel_install_wJRE_en-US.tar.gz
[vic@ares:~$] ls
[...]
[vic@ares:~$] cd OOH680_m12_native_packed-1_en-US.9286/
[vic@ares:~/OOH680_m12_native_packed-1_en-US.9286$] sudo ./setup
Checksumming...
Extracting ...
93128 blocks
Done.
Using /var/tmp/install_21191/usr/java/jre1.6.0_04/bin/java
java version "1.6.0_04"
Java(TM) SE Runtime Environment (build 1.6.0_04-b12)
Java HotSpot(TM) Server VM (build 10.0-b19, mixed mode)

Running installer
/var/tmp/install_21191/usr/java/jre1.6.0_04/bin/java -DHOME=/home/vic -DJRE_FILE=jre-6u4-linux-i586.rpm -jar JavaSetup.jar
System locale: en_US
Root privileges
OS: Linux
Mode: installation

If you have a way to display an X-server, at this point, the installation will open up some fancy windows and allow you to install stuff. Make sure that ‘Headless application support’ is enabled if you do a custom installation.

3. Write a short chkconfig script so the headless server can automatically start at boot time.

#!/bin/bash
# openoffice.org  headless server script
#
# chkconfig: 2345 80 30
# description: headless openoffice server script
# processname: openoffice

OOo_HOME=/opt/openoffice.org2.4
SOFFICE_PATH=$OOo_HOME/program/soffice
PIDFILE=$OOo_HOME/openoffice-server.pid

case "$1" in
    start)
    if [ -f $PIDFILE ]; then
      echo "OpenOffice headless server has already started."
      exit
    fi
      echo "Starting OpenOffice headless server"
      $SOFFICE_PATH -headless -accept="socket,host=127.0.0.1,port=81000;urp;" -nofirststartwizard & > /dev/null 2>&1
      touch $PIDFILE
    ;;
    stop)
    if [ -f $PIDFILE ]; then
      echo "Stopping OpenOffice headless server."
      killall -9 soffice && killall -9 soffice.bin
      rm -f $PIDFILE
      exit
    fi
      echo "Openoffice headless server is not running, foo."
      exit
    ;;
    *)
    echo "Usage: $0 {start|stop}"
    exit 1
esac
exit 0

Save it as /etc/init.d/openoffice. Now we need to add it to chkconfig.

[vic@ares:/etc/init.d$] sudo chkconfig --add openoffice
[vic@ares:/etc/init.d$] sudo chkconfig --list openoffice
openoffice      0:off   1:off   2:on    3:on    4:on    5:on    6:off
[vic@ares:/etc/init.d$] sudo service openoffice start
Starting OpenOffice headless server
[vic@ares:/etc/init.d$]

Feel free to grep through ps aux to confirm that the soffice process did indeed start.

4. Read Usage as a Web Application on Art of Solving and download jodconverter-tomcat-2.2.1.zip and extract it somewhere. Make sure that either JAVA_HOME or JRE_HOME are set and then start jodconverter/bin/startup.sh. If you realllllly feel like it, you can create another chkconfig script for this. Since I’m lazy, I just added it to my previous openoffice headless server script like so:

#!/bin/bash
# openoffice.org  headless server script
#
# chkconfig: 2345 80 30
# description: headless openoffice server script
# processname: openoffice
#
# Author: Vic Vijayakumar
#

OOo_HOME=/opt/openoffice.org2.4
SOFFICE_PATH=$OOo_HOME/program/soffice
PIDFILE=$OOo_HOME/openoffice-server.pid
JOD_HOME=/wwwroot/apps/jodconverter

case "$1" in
    start)
    if [ -f $PIDFILE ]; then
      echo "OpenOffice headless server has already started."
      exit
    fi
      echo "Starting OpenOffice headless server"
      $SOFFICE_PATH -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard & > /dev/null 2>&1
      $JOD_HOME/bin/startup.sh & > /dev/null 2>&1
      touch $PIDFILE
    ;;
    stop)
    if [ -f $PIDFILE ]; then
      echo "Stopping OpenOffice headless server."
      killall -9 soffice && killall -9 soffice.bin
      $JOD_HOME/bin/catalina.sh stop > /dev/null 2>&1
      rm -f $PIDFILE
      exit
    fi
      echo "Openoffice headless server is not running, foo."
      exit
    ;;
    *)
    echo "Usage: $0 {start|stop}"
    exit 1
esac
exit 0

5. Time to write the code to connect to the jodconverter service and convert our documents for us. Here’s my code:

<?php
// prepare the file download (in my case, these actually come from a database record)
$file = '/path/to/input/file';

// instantiate the document converter class.
require_once 'HTTP/Request.php';
class Converter{
  var $url = 'http://localhost:8080/converter/service';
  function convert($input, $input_file_type, $output_file_type){
    $request = new HTTP_Request($this->url);
    $request->setMethod("POST");
    $request->addHeader("Content-Type", $input_file_type);
    $request->addHeader("Accept", $output_file_type);
    $request->setBody($input);
    $request->sendRequest();
    return $request->getResponseBody();
  }
}

// do whatever else we need to do to make the magic happen
$converter = new Converter();
$input_file_type = 'application/vnd.oasis.opendocument.text';
$output_file_type = 'application/pdf';
$output_file = '/path/to/' . basename($file, get_extension($file)) . 'pdf';

$output = $converter->convert(file_get_contents($file), $input_file_type, $output_file_type);
file_put_contents($output_file, $output);

// and now replace the file variable with what we just created
$file = $output_file;
$download_filename = basename($file);

// required for IE, otherwise Content-Disposition is ignored
if(ini_get('zlib.output_compression')) ini_set('zlib.output_compression', 'Off');

// and now pipe the file out to the customer
header("Expires: Mon, 26 Jul 1997 05:00:00 GMT"); // some day in the past
header("Pragma: public");
header("Expires: 0");
header("Cache-Control: must-revalidate, post-check=0, pre-check=0");
header("Cache-Control: private", false);
header('Content-Description: File Transfer');
header("Content-Type: application/octet-stream");
header("Content-Transfer-Encoding: Binary");
header("Content-Length: " . filesize($file));
header('Content-Disposition: attachment; filename="' . $download_filename . '"');
set_time_limit(0);
readfile($file);
exit;
?>

Modify as you need, of course.

Reference: JODConverter Online Guide


About this entry