CenterForOpenScience · kah-odonnell · May 21, 2013 · May 21, 2013 · May 21, 2013 · May 21, 2013
diff --git a/.gitignore b/.gitignore
@@ -37,3 +37,8 @@ pip-log.txt
 nosetests.xml
 *.mo
 .idea
+
+test.html
+testxml.html
+
+main.py
diff --git a/.travis.yml b/.travis.yml
@@ -2,9 +2,13 @@ language: python
 python:
   - "2.6"
   - "2.7"
-script: python main.py
+script: ./run_tests.sh
 install:
+  - python setup.py -q install
   - pip install -r requirements.txt
+env:
+  - TRAVIS_EXECUTE_PERFORMANCE=1
 notifications:
   email:
     - jason.louard.ward@gmail.com
+    - samson91787@gmail.com
diff --git a/AUTHORS b/AUTHORS
@@ -0,0 +1,2 @@
+Sam Protnow <samson91787@gmail.com>
+Jason Ward <jason.louard.ward@gmail.com>
diff --git a/CHANGELOG b/CHANGELOG
@@ -0,0 +1,74 @@
+
+Changelog
+=========
+* 0.3.13
+    * Significant performance gains for documents with a large number of table
+      cells.
+    * Significant performance gains for large documents.
+* 0.3.12
+    * Added command line support to convert from docx to either html or
+      markdown.
+* 0.3.11
+    * The non breaking hyphen tag was not correctly being imported. This issue
+      has been fixed.
+* 0.3.10
+    * Found and optimized a fairly large performance issue with tables that had
+      large amounts of content within a single cell, which includes nested
+      tables.
+* 0.3.9
+    * We are now respecting the `<w:tab/>` element. We are putting a space in
+      everywhere they happen.
+    * Each styling can have a default defined based on values in `styles.xml`.
+      These default styles can be overwritten using the `rPr` on the actual `r`
+      tag. These default styles defined in `styles.xml` are actually being
+      respected now.
+* 0.3.8
+    * If zipfile fails to open the passed in file, we are now raising a
+      `MalformedDocxException` instead of a `BadZipFIle`.
+* 0.3.7
+    * Some inline tags (most notably the underline tag) could have a `val` of
+      `none` and that would signify that the style is disabled. A `val` of
+      `none` is now correctly handled.
+* 0.3.6
+    * It is possible for a docx file to not contain a `numbering.xml` file but
+      still try to use lists. Now if this happens all lists get converted to
+      paragraphs.
+* 0.3.5
+    * Not all docx files contain a `styles.xml` file. We are no longer assuming
+      they do.
+* 0.3.4
+    * It is possible for `w:t` tags to have `text` set to `None`. This no
+      longer causes an error when escaping that text.
+* 0.3.3
+    * In the event that `cElementTree` has a problem parsing the document, a
+      `MalformedDocxException` is raised instead of a `SyntaxError`
+* 0.3.2
+    * We were not taking into account that vertical merges should have a
+      continue attribute, but sometimes they do not, and in those cases word
+      assumes the continue attribute. We updated the parser to handle the
+      cases in which the continue attribute is not there.
+    * We now correctly handle documents with unicode character in the
+      namespace.
+    * In rare cases, some text would be output with a style when it should not
+      have been. This issue has been fixed.
+* 0.3.1
+    * Added support for several more OOXML tags including:
+        * caps
+        * smallCaps
+        * strike
+        * dstrike
+        * vanish
+        * webHidden
+      More details in the README.
+* 0.3.0
+    * We switched from using stock *xml.etree.ElementTree* to using
+      *xml.etree.cElementTree*. This has resulted in a fairly significant speed
+      increase for python 2.6
+    * It is now possible to create your own pre processor to do additional pre
+      processing.
+    * Superscripts and subscripts are now extracted correctly.
+* 0.2.1
+    * Added a changelog
+    * Added the version in pydocx.__init__
+    * Fixed an issue with duplicating content if there was indentation or
+      justification on a p element that had multiple t tags.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,7 @@
+include AUTHORS
+include CHANGELOG
+include LICENSE
+include MANIFEST.in
+include README.rst
+include pydocx/fixtures/*
+include pydocx/tests/templates/*
diff --git a/README.md b/README.md
diff --git a/README.rst b/README.rst
@@ -0,0 +1,238 @@
+======
+pydocx
+======
+.. image:: https://travis-ci.org/CenterForOpenScience/pydocx.png?branch=master
+   :align: left
+   :target: https://travis-ci.org/CenterForOpenScience/pydocx
+
+pydocx is a parser that breaks down the elements of a docxfile and converts them
+into different markup languages. Right now, HTML is supported. Markdown and LaTex
+will be available soon. You can extend any of the available parsers to customize it
+to your needs. You can also create your own class that inherits DocxParser
+to create your own methods for a markup language not yet supported.
+
+Currently Supported
+###################
+
+* tables
+    * nested tables
+    * rowspans
+    * colspans
+    * lists in tables
+* lists
+    * list styles
+    * nested lists
+    * list of tables
+    * list of pragraphs
+* justification
+* images
+* styles
+    * bold
+    * italics
+    * underline
+    * hyperlinks
+* headings
+
+Usage
+#####
+
+DocxParser includes abstracts methods that each parser overwrites to satsify its own needs. The abstract methods are as follows:
+
+::
+
+    class DocxParser:
+
+        @property
+        def parsed(self):
+            return self._parsed
+
+        @property
+        def escape(self, text):
+            return text
+
+        @abstractmethod
+        def linebreak(self):
+            return ''
+
+        @abstractmethod
+        def paragraph(self, text):
+            return text
+
+        @abstractmethod
+        def heading(self, text, heading_level):
+            return text
+
+        @abstractmethod
+        def insertion(self, text, author, date):
+            return text
+
+        @abstractmethod
+        def hyperlink(self, text, href):
+            return text
+
+        @abstractmethod
+        def image_handler(self, path):
+            return path
+
+        @abstractmethod
+        def image(self, path, x, y):
+            return self.image_handler(path)
+
+        @abstractmethod
+        def deletion(self, text, author, date):
+            return text
+
+        @abstractmethod
+        def bold(self, text):
+            return text
+
+        @abstractmethod
+        def italics(self, text):
+            return text
+
+        @abstractmethod
+        def underline(self, text):
+            return text
+
+        @abstractmethod
+        def superscript(self, text):
+            return text
+
+        @abstractmethod
+        def subscript(self, text):
+            return text
+
+        @abstractmethod
+        def tab(self):
+            return True
+
+        @abstractmethod
+        def ordered_list(self, text):
+            return text
+
+        @abstractmethod
+        def unordered_list(self, text):
+            return text
+
+        @abstractmethod
+        def list_element(self, text):
+            return text
+
+        @abstractmethod
+        def table(self, text):
+            return text 
+        @abstractmethod
+        def table_row(self, text):
+            return text
+
+        @abstractmethod
+        def table_cell(self, text):
+            return text
+
+        @abstractmethod
+        def page_break(self):
+            return True
+
+        @abstractmethod
+        def indent(self, text, left='', right='', firstLine=''):
+            return text
+
+Docx2Html inherits DocxParser and implements basic HTML handling. Ex.
+
+::
+
+    class Docx2Html(DocxParser):
+
+        #  Escape '&', '<', and '>' so we render the HTML correctly
+        def escape(self, text):
+            return xml.sax.saxutils.quoteattr(text)[1:-1]
+
+        # return a line break
+        def linebreak(self, pre=None):
+            return '<br />'
+
+        # add paragraph tags
+        def paragraph(self, text, pre=None):
+            return '<p>' + text + '</p>'
+
+
+However, let's say you want to add a specific style to your HTML document. In order to do this, you want to make each paragraph a class of type `my_implementation`. Simply extend docx2Html and add what you need.
+
+::
+
+     class My_Implementation_of_Docx2Html(Docx2Html):
+
+        def paragraph(self, text, pre = None):
+            return <p class="my_implementation"> + text + '</p>'
+
+
+
+OR, let's say FOO is your new favorite markup language. Simply customize your own new parser, overwritting the abstract methods of DocxParser
+
+::
+
+    class Docx2Foo(DocxParser):
+
+        # because linebreaks in are denoted by '!!!!!!!!!!!!' with the FOO markup langauge  :)
+        def linebreak(self):
+            return '!!!!!!!!!!!!'
+
+Custom Pre-Processor
+####################
+
+When creating your own Parser (as described above) you can now add in your own custom Pre Processor. To do so you will need to set the `pre_processor` field on the custom parser, like so:
+
+::
+
+    class Docx2Foo(DocxParser):
+        pre_processor_class = FooPreProcessor
+
+
+The `FooPreProcessor` will need a few things to get you going:
+
+::
+
+    class FooPreProcessor(PydocxPreProcessor):
+        def perform_pre_processing(self, root, *args, **kwargs):
+            super(FooPreProcessor, self).perform_pre_processing(root, *args, **kwargs)
+            self._set_foo(root)
+
+        def _set_foo(self, root):
+            pass
+
+If you want `_set_foo` to be called you must add it to `perform_pre_processing` which is called in the base parser for pydocx.
+
+Everything done during pre-processing is executed prior to `parse` being called for the first time.
+
+
+Styles
+######
+
+The base parser `Docx2Html` relies on certain css class being set for certain behaviour to occur. Currently these include:
+
+* class `pydocx-insert` -> Turns the text green.
+* class `pydocx-delete` -> Turns the text red and draws a line through the text.
+* class `pydocx-center` -> Aligns the text to the center.
+* class `pydocx-right` -> Aligns the text to the right.
+* class `pydocx-left` -> Aligns the text to the left.
+* class `pydocx-comment` -> Turns the text blue.
+* class `pydocx-underline` -> Underlines the text.
+* class `pydocx-caps` -> Makes all text uppercase.
+* class `pydocx-small-caps` -> Makes all text uppercase, however truly lowercase letters will be small than their uppercase counterparts.
+* class `pydocx-strike` -> Strike a line through.
+* class `pydocx-hidden` -> Hide the text.
+
+Exceptions
+##########
+
+Right now there is only one custom exception (`MalformedDocxException`). It is raised if either the `xml` or `zipfile` libraries raise an exception.
+
+Optional Arguments
+##################
+
+You can pass in `convert_root_level_upper_roman=True` to the parser and it will convert all root level upper roman lists to headings instead.
+
+Command Line Execution
+######################
+
+First you have to install pydocx, this can be done by running the command `pip install pydocx`. From there you can simply call the command `pydocx --html path/to/file.docx path/to/output.html`. Change `pydocx --html` to `pydocx --markdown` in order to convert to markdown instead.
diff --git a/main.py b/main.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		Sam Protnow <samson91787@gmail.com>
		Jason Ward <jason.louard.ward@gmail.com>