Getting Images out of Office 2007 docx files
Tuesday, June 26th, 2007Office 2007 uses a new document format called Office Open XML or OOXML for short. They typically have the extension .docx to differentiate them from other Office documents which are .doc If you open one of these files in notepad you will notice that it is almost completely unreadable and you will wonder how this format could possibly be better.
The same file saved in both Office 2003 doc format and in Office 2007 docx format will be much smaller in docx format. Why is this? Its because docx files are really just zip files! Simply take any .docx file and rename it to .zip and then extract it. You will then have several directories and XML files which make up the docx file. In these XML files you can see the different styles used as well as your text in the document.xml file.
Apart from having an XML file of your text the extracted data will also contain a Media folder if any was included. This folder will contain all images and media that was added to the Word document making it extremely easy to copy all the images used in a docx to something else. Whats even more amazing then just having a folder of all the images available by only just unzipping it is that the images will be their original size, all that changes is that they are given a new name such as image1.jpg. Try it for yourself. Create a new Word doc, drag in several images. Then resize the images down and save the file as a docx. Then unzip the docx and look in the media folder. All images will be in folder and will be the real size instead of the scaled down one shown in the Word doc.
It should also be pointed out that using docx files make for smaller attachments to emails since they are already zipped up. And since they are zipped up there is really no need to compress them further unless you use something with better compression then zip or the default docx compression. If anyone ever emails you a zipped docx file you can let them know they are just wasting their time ![]()