Getting Images out of Office 2007 docx files

June 26th, 2007

Office 2007 uses a new document format called Office Open XML or OOXML for short. They typically have the extension .docx to differentiate them from other Office documents which are .doc If you open one of these files in notepad you will notice that it is almost completely unreadable and you will wonder how this format could possibly be better.

The same file saved in both Office 2003 doc format and in Office 2007 docx format will be much smaller in docx format. Why is this? Its because docx files are really just zip files! Simply take any .docx file and rename it to .zip and then extract it. You will then have several directories and XML files which make up the docx file. In these XML files you can see the different styles used as well as your text in the document.xml file.

Apart from having an XML file of your text the extracted data will also contain a Media folder if any was included. This folder will contain all images and media that was added to the Word document making it extremely easy to copy all the images used in a docx to something else. Whats even more amazing then just having a folder of all the images available by only just unzipping it is that the images will be their original size, all that changes is that they are given a new name such as image1.jpg. Try it for yourself. Create a new Word doc, drag in several images. Then resize the images down and save the file as a docx. Then unzip the docx and look in the media folder. All images will be in folder and will be the real size instead of the scaled down one shown in the Word doc.

It should also be pointed out that using docx files make for smaller attachments to emails since they are already zipped up. And since they are zipped up there is really no need to compress them further unless you use something with better compression then zip or the default docx compression. If anyone ever emails you a zipped docx file you can let them know they are just wasting their time :)

Listing Files with Python

April 26th, 2007

Getting a list of files in a directory and its sub directories is very simple using the built in os module in Python. In this example the file can be run with the directory to explore as the command line argument. Any directories found in the given directory will be recursed on and their files added the final list of all files.

# Emgarten.com April 26th, 2007
import os,sys

def findFiles(dir):
  files = []

  # remove a trailing slash if it exists
  if dir[-1:] == "/":
    dir = dir[0:-1]

  # loop through files and directories
  for x in os.listdir(dir):
    if os.path.isdir(dir + "/" + x):
      # list this dir also
      files.extend(findFiles(dir + "/" + x))
    else:
      # add the file to the list
      files.append(dir + "/" + x)
  return files

if __name__ == "__main__":
  if len(sys.argv) != 2:
    print "Usage: %s <dir>" % sys.argv[0]
    sys.exit(1)

  # get the list of files
  files = findFiles(sys.argv[1])

  for file in files:
    print file

The function returns a list with the full path of all files found in the directory.

MySQL Datetime string to a Timestamp

April 16th, 2007

Here is a simple function to convert a datetime string from MySQL to a UNIX timestamp using Python.

import time
# Converts an SQL datetime to a UNIX timestamp
# Example datetime: 2007-04-15 08:29:39
def sqlDateTimeToTimeStamp(sqlDateTime):
return int(time.mktime(time.strptime\
(sqlDateTime,\
“%Y-%m-%d %H:%M:%S”)))