Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
519 commits
Select commit Hold shift + click to select a range
bf2705b
refs #29: updated the xml based tests for the new expected html
May 21, 2013
061d8d1
refs #29: updated white space
May 21, 2013
2aa5922
refs #29: updated the parser for valid values
May 21, 2013
7f78ad4
Merge pull request #25 from OpenScienceFramework/issue_25
jlward May 21, 2013
13a0524
Merge branch 'master' into issue_27
May 21, 2013
6b3cdd0
Merge branch 'master' into issue_28
May 21, 2013
8f387f3
refs #28: updated test based on merged master
May 21, 2013
f44ca81
Merge branch 'master' into issue_29
May 21, 2013
cbea7a9
refs #29: updated tests based on merged master
May 21, 2013
262cdd1
Merge pull request #27 from OpenScienceFramework/issue_27
jlward May 21, 2013
cf15525
Merge branch 'master' into issue_28
May 21, 2013
1c72947
refs #28: updated tests based on merged master
May 21, 2013
eb444f9
Merge branch 'master' into issue_29
May 21, 2013
e605958
refs #29: updated tests based on merged master
May 21, 2013
e254e81
refs #28: split up a line into multiple lines
May 21, 2013
04c407d
refs #28: updated how we are doing underline
May 21, 2013
8c5b39c
refs #28: Added css stuff to the README
May 21, 2013
cc1dd25
Merge branch 'issue_28' into issue_29
May 21, 2013
a0de8a9
refs #29: namespaced all the css classes
May 21, 2013
9142a9f
Merge pull request #28 from OpenScienceFramework/issue_28
jlward May 21, 2013
29a6893
Merge branch 'master' into issue_29
May 21, 2013
d584137
refs #30: Updated the tests for expected behaviour
May 21, 2013
c9c76fb
refs #30: no longer adding invalid attributes to insert and delete tags
May 21, 2013
70fc06f
refs #30: stopped break separating tags that were inline like (insert…
May 21, 2013
6dde935
added test; made size search cleaner
SamPortnow May 21, 2013
4095602
updating tests
SamPortnow May 21, 2013
1699da1
updated test
SamPortnow May 21, 2013
988b767
fixed tests
SamPortnow May 22, 2013
07e145b
updating
SamPortnow May 22, 2013
5984952
Merge pull request #29 from OpenScienceFramework/issue_29
jlward May 22, 2013
0bd3e36
Merge branch 'master' into issue_30
May 22, 2013
72bf87a
comment fixes
SamPortnow May 22, 2013
92e7b7d
fixed error
SamPortnow May 22, 2013
2c91c7e
Merge branch 'master' of https://github.com/OpenScienceFramework/pydo…
SamPortnow May 22, 2013
ecda5d2
fixed test
SamPortnow May 23, 2013
2b896df
refs #32: small refactor
May 23, 2013
31fe2a1
refs #32: code updated based on deprications
May 23, 2013
d0cf19b
refs #32: can now do underline/italics
May 23, 2013
27de3f0
refs #32: it is now possible to test headings
May 23, 2013
a061e3a
Merge pull request #30 from OpenScienceFramework/issue_30
jlward May 23, 2013
480d417
Merge branch 'master' into issue_32
May 23, 2013
add38af
refs #33: Updated the test for expected image behaviour
May 23, 2013
9329b0a
refs #33: updated DocxParser to extract images
May 23, 2013
737d351
refs #20 code cleanup, removed print statment
May 23, 2013
c418599
Merge branch 'master' into localDpi
May 23, 2013
c525515
Merge pull request #20 from OpenScienceFramework/localDpi
jlward May 23, 2013
f134050
Merge branch 'master' into issue_32
May 23, 2013
dc25a5f
Merge branch 'master' into issue_33
May 23, 2013
2cba0a4
refs #33: name change, import clenaup
May 23, 2013
e5975fc
refs #33: removed lying comments, added a comment
May 23, 2013
196522a
refs #33: on the rest of the image test cases, made sure the image wa…
May 23, 2013
a25affc
refs #33: good catch on the KeyError
May 23, 2013
5776117
refs #33: name change, no longer need try/except
May 23, 2013
999c3a2
refs #33: assume zip_path is always set
May 23, 2013
8c03bc2
Merge pull request #33 from OpenScienceFramework/issue_33
jlward May 23, 2013
85b82e4
Merge branch 'master' into issue_32
May 23, 2013
cdd8846
refs #32: removed the != None part for the style
May 23, 2013
c16b298
Merge pull request #32 from OpenScienceFramework/issue_32
jlward May 23, 2013
5dec2fb
Bumped to version 0.1.3
May 23, 2013
8897f03
bumped to version 0.1.4
May 23, 2013
2875d87
bumped to version 0.1.5
May 23, 2013
a620d81
bumped to version 0.1.6
May 23, 2013
89e996f
bumped to version 0.1.7
May 23, 2013
bf79516
refs #35: updated the readme so PyPi likes it.
May 28, 2013
fe0c875
refs #35: hopefully the name change will force github to render with …
May 28, 2013
fb47133
refs #35: Peg PyPi support for 2.6 and 2.7
May 28, 2013
d7b7d09
Merge pull request #35 from OpenScienceFramework/issue_35
jlward May 28, 2013
4ae6ee6
bumped to version 0.1.8
May 28, 2013
766f768
Fixed the manifest
May 28, 2013
6d3a372
Fixed a broken filename
May 28, 2013
e6de463
refs #34: updated the tests for expected behaviour for base 64 encodi…
May 28, 2013
5f4f649
refs #34: store and pass around the image data, instead of the image …
May 28, 2013
82735a5
refs #34: image handler now deals with image data and base 64 encodes…
May 28, 2013
5056af4
refs #34: since we no longer write the image to disk, we no longer ne…
May 28, 2013
5cb6028
refs #37: Added tests showing what should happen with upper roman num…
May 28, 2013
eab44ab
refs #37: it is now possible to convert root level upper roman lists …
May 28, 2013
f0842d9
refs #34: passed along the filename, correctly created the src for im…
May 29, 2013
38c6d48
refs #37: updates based on code review
May 29, 2013
b657f7d
Merge pull request #34 from OpenScienceFramework/issue_34
jlward May 29, 2013
f48a013
Merge branch 'master' into issue_37
May 29, 2013
433fe73
Merge pull request #37 from OpenScienceFramework/issue_37
jlward May 29, 2013
8a649f1
bumped to version 0.2.0
May 29, 2013
2d1292d
refs #38: refactored the test code to do r tags correctly
May 30, 2013
1db516e
refs #38: added a test showing the duplicated content issue
May 30, 2013
138684a
refs #38: fixed justifications
May 30, 2013
948a4fe
refs #38: removed dead code
May 30, 2013
2cfe7e6
refs #38: added a comment showing what still needs to be done.
May 30, 2013
139dce2
refs #38: code cleanup
May 30, 2013
525857c
minor change
SamPortnow May 31, 2013
3cbee23
Merge branch 'issue_38' of https://github.com/OpenScienceFramework/py…
SamPortnow May 31, 2013
6e188f4
refs #38: Added a changelog
May 31, 2013
0ff6f50
Merge pull request #38 from OpenScienceFramework/issue_38
jlward May 31, 2013
a0f1daa
bumped to version 0.2.1
May 31, 2013
4ff701d
refs #41: refactor and started using spans inline instead of divs
May 31, 2013
3ab2569
refs #43: switched to using lxml
Jun 3, 2013
9db7b51
refs #43: udpated the reqs
Jun 3, 2013
7b196af
merged with master
SamPortnow Jun 4, 2013
b2d70b6
refs #43: got the last of the failing unit tests passing
Jun 4, 2013
cb25ea0
refs #43: Big refactor. Moved all the new lxml parser to its own file.
Jun 4, 2013
f55427b
refs #43: No longer storing on attrib, storing on a global dictionary…
Jun 4, 2013
b48e8e3
refs #43: Refactor to no longer need the subclassed parser
Jun 4, 2013
ceabf65
refs #43: switched to using cElementTree
Jun 4, 2013
f78e857
refs #43: no longer need lxml
Jun 4, 2013
7389127
refs #43: small refactor, no longer skipping the test that use to take
Jun 4, 2013
39ccc37
refs #43: change log and updated readme
Jun 4, 2013
7ce47bd
refs #43: removed a dead print statement
Jun 4, 2013
514fb5a
refs #42: Added two tests for how sub/super scripts are supposed to work
Jun 4, 2013
e07b683
refs #42: sub and super scripts are now working
Jun 4, 2013
23533d6
refs #42: it would help to add the fixture for the docx test
Jun 4, 2013
9543d4e
refs #42: added an update note and updated the readme
Jun 4, 2013
898cfd0
refs #42: added a comment
Jun 4, 2013
bc87a89
Revert "refs #41: refactor and started using spans inline instead of …
Jun 5, 2013
1d4b406
Merge pull request #43 from OpenScienceFramework/issue_43
winhamwr Jun 5, 2013
b137c64
Merge branch 'master' into issue_42
Jun 5, 2013
155dc34
refs #42: changed conditional to elif
Jun 5, 2013
e460158
Merge pull request #42 from OpenScienceFramework/issue_42
jlward Jun 5, 2013
07d1da6
bumped to version 0.3.0
Jun 5, 2013
7b088ab
simple lists and simple tables working
SamPortnow Jun 6, 2013
11f196a
merged with master
SamPortnow Jun 6, 2013
2758f4a
merged with master
SamPortnow Jun 6, 2013
76dc1e7
merged with master
SamPortnow Jun 6, 2013
7b4cca9
merged with master
SamPortnow Jun 12, 2013
191a99c
table issue
SamPortnow Jun 12, 2013
bc28d22
removed uncessary import
SamPortnow Jun 12, 2013
52286ae
removed uncessary file
SamPortnow Jun 12, 2013
2531459
updated the test
SamPortnow Jun 12, 2013
4d92d84
Merge branch 'master' into table_fix
SamPortnow Jun 12, 2013
468502e
flake8 compliant
SamPortnow Jun 12, 2013
4145aa7
flake8 compliant
SamPortnow Jun 12, 2013
b4edee2
minor changes
SamPortnow Jun 12, 2013
ddc5600
refs #44: Added tests showing what the different supported r styles s…
Jun 12, 2013
cef0f9e
refs #44: Got several more r styles working correctly
Jun 12, 2013
9741e66
refs #44: update note and updated readme
Jun 12, 2013
137fbc0
made changes based on comments; added test case
SamPortnow Jun 13, 2013
1fb36a2
made changes based on comments; added test case
SamPortnow Jun 13, 2013
7774916
made changes based on comments; added test case
SamPortnow Jun 13, 2013
251a225
refs #44: First step at passing in an rPr instead
Jun 13, 2013
3fcebc6
refs #44: updated the tests to use the rpr instead
Jun 13, 2013
97ab97a
removed uncessary line
SamPortnow Jun 13, 2013
0fcfc7c
changed table tests
SamPortnow Jun 13, 2013
d50cdfd
refs #44: small refactor, not calling find twice per inline call anymore
Jun 13, 2013
e06802f
refs #44: even better performance
Jun 13, 2013
a6807b8
refs #44: no more kwargs abusing
Jun 13, 2013
f5cc362
Merge pull request #44 from OpenScienceFramework/issue_44
jlward Jun 13, 2013
c67efbf
bumped to version 0.3.1
Jun 13, 2013
bd6e8e7
change to vmerge
SamPortnow Jun 17, 2013
81ca609
merged with masteR
SamPortnow Jun 17, 2013
83505df
updated changelog
SamPortnow Jun 17, 2013
4e32fd5
updated changelog
SamPortnow Jun 17, 2013
7c1603a
changes based on comments
SamPortnow Jun 17, 2013
ebddee8
Merge pull request #45 from OpenScienceFramework/table_fix
jlward Jun 18, 2013
8f84ebe
refs #47: Added a test showing that not all unicode is handled correctly
Jul 2, 2013
139826f
refs #47: found the encoding of the document, and passed that encodin…
Jul 2, 2013
54b89f0
refs #47: update note
Jul 2, 2013
5ff7077
refs #48: added a test showing that val=0 should not add the style
Jul 2, 2013
0f7dad0
refs #48: val=0 no longer adds a style
Jul 2, 2013
6a14f56
refs #48: update note
Jul 2, 2013
84f7586
refs #47: updates based on code review
Jul 2, 2013
2969ad8
Merge pull request #47 from OpenScienceFramework/issue_47
jlward Jul 2, 2013
6141f9f
Merge branch 'master' into issue_48
Jul 2, 2013
d47be5b
refs #48: refactor
Jul 2, 2013
23f799a
Merge pull request #48 from OpenScienceFramework/issue_48
jlward Jul 2, 2013
c46e864
bumped to version 0.3.2
Jul 2, 2013
13a677f
refs #50: Catching the SyntaxError and raising a custom exception
Jul 3, 2013
101f5d2
refs #50: update note
Jul 3, 2013
a47c8bc
Merge pull request #50 from OpenScienceFramework/issue_50
jlward Jul 3, 2013
f110526
bumped to version 0.3.3
Jul 3, 2013
08360b8
refs #46: Added a test showing the error with self closing t tags
Jul 5, 2013
cbc2feb
refs #46: fixed the problem with el.text being None
Jul 5, 2013
d84b50b
refs #46: update note
Jul 5, 2013
b05e308
refs #46: code cleanup
Jul 5, 2013
5a49778
Merge pull request #46 from OpenScienceFramework/issue_46
jlward Jul 5, 2013
2bb32d6
bumped to version 0.3.4
Jul 5, 2013
b4073fb
refs #51: Added a document fixture that is missing the styles.xml fil…
Jul 8, 2013
ef6135b
refs #51: No longer assuming that all docx files must have styles.xml
Jul 8, 2013
b54e80b
refs #51: update note
Jul 8, 2013
f207307
Merge pull request #51 from OpenScienceFramework/issue_51
jlward Jul 8, 2013
1671fc9
bumped to version 0.3.5
Jul 8, 2013
bbb8b21
refs #52: Added a test showing what should be done with files that ar…
Jul 9, 2013
c5a2452
refs #52: Nothing can be a list now if there is no numbering.xml file.
Jul 9, 2013
07eeeb3
refs #52: update note
Jul 9, 2013
d8e527c
Merge pull request #52 from OpenScienceFramework/issue_52
jlward Jul 9, 2013
b8bd86a
bumped to version 0.3.6
Jul 9, 2013
ed86607
refs #53: Added a test showing that a val of none should not be trigg…
Jul 16, 2013
91a6bdb
refs #53: a val of none no longer triggers the inline handlers.
Jul 16, 2013
9313ae4
refs #53: update note
Jul 16, 2013
cfdb29b
Merge pull request #53 from OpenScienceFramework/issue_53
jlward Jul 16, 2013
b52d649
bumped to version 0.3.7
Jul 16, 2013
77c55d5
refs #54: Added a test showing expected behaviour
Jul 19, 2013
ce615a0
refs #54: if BadZipFile is raised then raise a pydocx exception instead
Jul 19, 2013
49af864
refs #54: update note
Jul 19, 2013
c67e612
refs #54: used `raises` instead
Jul 19, 2013
bd4484c
refs #54: updated the readme
Jul 19, 2013
bae0a98
Merge pull request #54 from OpenScienceFramework/issue_54
jlward Jul 19, 2013
770c930
bumped to version 0.3.8
Jul 19, 2013
8032bdc
refs #56: Added a test with expected output
Jul 30, 2013
59e9631
refs #56: We are now taking into account the rPr on the style when ap…
Aug 1, 2013
7c0bdb8
refs #56: Removed extra white space from XML in DOCX file.
Aug 1, 2013
1a38d5f
refs #56: Added an XML based test case showing how overrides should b…
Aug 1, 2013
7dee41b
refs #56: Added comments and got overrides working correctly
Aug 1, 2013
978db43
refs #56: update note
Aug 1, 2013
2bd1726
refs #55: Added a test document with a test that contains a tab.
Aug 1, 2013
02d7c59
refs #55: Handling the tab element now
Aug 1, 2013
1478f9e
refs #55: update note
Aug 1, 2013
a0b35b3
refs #56: Big refactor
Aug 1, 2013
07a6469
refs #56: moved stripping of the rpr from headers outside the main lo…
Aug 1, 2013
e8bd553
refs #56: stupid comment, why you in wrong place?
Aug 1, 2013
633dd4b
Merge pull request #56 from OpenScienceFramework/issue_56
jlward Aug 1, 2013
b533d69
Merge branch 'master' into issue_55
Aug 1, 2013
a127c12
Merge pull request #55 from OpenScienceFramework/issue_55
jlward Aug 2, 2013
b44b9ec
Bumped to version 0.3.9
Aug 2, 2013
c06dc57
refs #59: Added another performance test, (that is causing a maxrecur…
Aug 29, 2013
04c7e3f
refs #59: Finally found the performance bottle neck with tables.
Aug 29, 2013
a8d17f5
refs #59: Update note
Aug 29, 2013
2014506
refs #59: Updated the update note
Aug 29, 2013
589e22e
Merge pull request #59 from OpenScienceFramework/issue_59
jlward Aug 29, 2013
2380768
Bumped pydocx to v0.3.10
Aug 29, 2013
ced0096
I hate the main file
Aug 29, 2013
99f8c90
Now I will never be able to accidentally commit that file again.
Aug 29, 2013
d43ad0e
refs #61: Added a docx based test showing the noBreakHyphen Issue.
Sep 16, 2013
3eef803
refs #61: We are now importing the non breaking hyphen
Sep 16, 2013
ceaf7bb
refs #61: Update note
Sep 16, 2013
270efff
Merge pull request #61 from OpenScienceFramework/issue_61
jlward Sep 17, 2013
e00173e
Bumped to version 0.3.11
Sep 17, 2013
714b948
refs #62: Added a simple entry point for docx2html
Sep 17, 2013
1eb7a81
refs #62: Added a simple entry point for the markdown converter.
Sep 17, 2013
186f8f7
refs 362: Added a readme note
Sep 17, 2013
2cb9eb8
refs #62: Added the command line entry points.
Sep 17, 2013
04bb89a
refs #62: Added an update note.
Sep 17, 2013
375c03a
Added output encoding to utf8.
zobo Sep 18, 2013
d371ffe
Merge pull request #63 from zobo/issue_62
jlward Sep 18, 2013
877a3b4
refs #62: Small refactor, only one entry point now.
Sep 18, 2013
6f2f1ce
Merge branch 'issue_62' of github.com:OpenScienceFramework/pydocx int…
Sep 18, 2013
c672b9f
refs #62: Update note
Sep 18, 2013
d20a7dd
refs #62: Better update note
Sep 18, 2013
1d670be
Merge pull request #62 from OpenScienceFramework/issue_62
jlward Sep 19, 2013
321c98c
Bumped to version 0.3.12
Sep 19, 2013
56a249d
refs #64: Added memoization around certain expensive operations.
Nov 4, 2013
e3cc3ec
refs #64: Update note
Nov 4, 2013
154fded
refs #64: Something odd with el_iter in 2.7
Nov 4, 2013
ff1ba65
refs #64: fixed typos, added memoization, set() are significantly fas…
Nov 5, 2013
928bcd1
refs #64: Update note
Nov 5, 2013
014377a
refs #64: Simple name change.
Nov 5, 2013
094df92
Merge pull request #64 from OpenScienceFramework/issue_64
jlward Nov 5, 2013
ce0d830
Bumped to version 0.3.13
Nov 5, 2013
1a613cd
Update setup.py
Jan 16, 2014
f340949
Update README.rst
Jan 16, 2014
81635ea
Fixed tabs, line breaks not rendering in html
Jan 24, 2014
ac24045
line breaks rendering for both versions of MSWord
kah-odonnell Jan 24, 2014
607437d
style fixes
kah-odonnell Jan 24, 2014
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,8 @@ pip-log.txt
nosetests.xml
*.mo
.idea

test.html
testxml.html

main.py
6 changes: 5 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,13 @@ language: python
python:
- "2.6"
- "2.7"
script: python main.py
script: ./run_tests.sh
install:
- python setup.py -q install
- pip install -r requirements.txt
env:
- TRAVIS_EXECUTE_PERFORMANCE=1
notifications:
email:
- jason.louard.ward@gmail.com
- samson91787@gmail.com
2 changes: 2 additions & 0 deletions AUTHORS
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Sam Protnow <samson91787@gmail.com>
Jason Ward <jason.louard.ward@gmail.com>
74 changes: 74 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@

Changelog
=========
* 0.3.13
* Significant performance gains for documents with a large number of table
cells.
* Significant performance gains for large documents.
* 0.3.12
* Added command line support to convert from docx to either html or
markdown.
* 0.3.11
* The non breaking hyphen tag was not correctly being imported. This issue
has been fixed.
* 0.3.10
* Found and optimized a fairly large performance issue with tables that had
large amounts of content within a single cell, which includes nested
tables.
* 0.3.9
* We are now respecting the `<w:tab/>` element. We are putting a space in
everywhere they happen.
* Each styling can have a default defined based on values in `styles.xml`.
These default styles can be overwritten using the `rPr` on the actual `r`
tag. These default styles defined in `styles.xml` are actually being
respected now.
* 0.3.8
* If zipfile fails to open the passed in file, we are now raising a
`MalformedDocxException` instead of a `BadZipFIle`.
* 0.3.7
* Some inline tags (most notably the underline tag) could have a `val` of
`none` and that would signify that the style is disabled. A `val` of
`none` is now correctly handled.
* 0.3.6
* It is possible for a docx file to not contain a `numbering.xml` file but
still try to use lists. Now if this happens all lists get converted to
paragraphs.
* 0.3.5
* Not all docx files contain a `styles.xml` file. We are no longer assuming
they do.
* 0.3.4
* It is possible for `w:t` tags to have `text` set to `None`. This no
longer causes an error when escaping that text.
* 0.3.3
* In the event that `cElementTree` has a problem parsing the document, a
`MalformedDocxException` is raised instead of a `SyntaxError`
* 0.3.2
* We were not taking into account that vertical merges should have a
continue attribute, but sometimes they do not, and in those cases word
assumes the continue attribute. We updated the parser to handle the
cases in which the continue attribute is not there.
* We now correctly handle documents with unicode character in the
namespace.
* In rare cases, some text would be output with a style when it should not
have been. This issue has been fixed.
* 0.3.1
* Added support for several more OOXML tags including:
* caps
* smallCaps
* strike
* dstrike
* vanish
* webHidden
More details in the README.
* 0.3.0
* We switched from using stock *xml.etree.ElementTree* to using
*xml.etree.cElementTree*. This has resulted in a fairly significant speed
increase for python 2.6
* It is now possible to create your own pre processor to do additional pre
processing.
* Superscripts and subscripts are now extracted correctly.
* 0.2.1
* Added a changelog
* Added the version in pydocx.__init__
* Fixed an issue with duplicating content if there was indentation or
justification on a p element that had multiple t tags.
7 changes: 7 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
include AUTHORS
include CHANGELOG
include LICENSE
include MANIFEST.in
include README.rst
include pydocx/fixtures/*
include pydocx/tests/templates/*
2 changes: 0 additions & 2 deletions README.md

This file was deleted.

238 changes: 238 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
======
pydocx
======
.. image:: https://travis-ci.org/CenterForOpenScience/pydocx.png?branch=master
:align: left
:target: https://travis-ci.org/CenterForOpenScience/pydocx

pydocx is a parser that breaks down the elements of a docxfile and converts them
into different markup languages. Right now, HTML is supported. Markdown and LaTex
will be available soon. You can extend any of the available parsers to customize it
to your needs. You can also create your own class that inherits DocxParser
to create your own methods for a markup language not yet supported.

Currently Supported
###################

* tables
* nested tables
* rowspans
* colspans
* lists in tables
* lists
* list styles
* nested lists
* list of tables
* list of pragraphs
* justification
* images
* styles
* bold
* italics
* underline
* hyperlinks
* headings

Usage
#####

DocxParser includes abstracts methods that each parser overwrites to satsify its own needs. The abstract methods are as follows:

::

class DocxParser:

@property
def parsed(self):
return self._parsed

@property
def escape(self, text):
return text

@abstractmethod
def linebreak(self):
return ''

@abstractmethod
def paragraph(self, text):
return text

@abstractmethod
def heading(self, text, heading_level):
return text

@abstractmethod
def insertion(self, text, author, date):
return text

@abstractmethod
def hyperlink(self, text, href):
return text

@abstractmethod
def image_handler(self, path):
return path

@abstractmethod
def image(self, path, x, y):
return self.image_handler(path)

@abstractmethod
def deletion(self, text, author, date):
return text

@abstractmethod
def bold(self, text):
return text

@abstractmethod
def italics(self, text):
return text

@abstractmethod
def underline(self, text):
return text

@abstractmethod
def superscript(self, text):
return text

@abstractmethod
def subscript(self, text):
return text

@abstractmethod
def tab(self):
return True

@abstractmethod
def ordered_list(self, text):
return text

@abstractmethod
def unordered_list(self, text):
return text

@abstractmethod
def list_element(self, text):
return text

@abstractmethod
def table(self, text):
return text
@abstractmethod
def table_row(self, text):
return text

@abstractmethod
def table_cell(self, text):
return text

@abstractmethod
def page_break(self):
return True

@abstractmethod
def indent(self, text, left='', right='', firstLine=''):
return text

Docx2Html inherits DocxParser and implements basic HTML handling. Ex.

::

class Docx2Html(DocxParser):

# Escape '&', '<', and '>' so we render the HTML correctly
def escape(self, text):
return xml.sax.saxutils.quoteattr(text)[1:-1]

# return a line break
def linebreak(self, pre=None):
return '<br />'

# add paragraph tags
def paragraph(self, text, pre=None):
return '<p>' + text + '</p>'


However, let's say you want to add a specific style to your HTML document. In order to do this, you want to make each paragraph a class of type `my_implementation`. Simply extend docx2Html and add what you need.

::

class My_Implementation_of_Docx2Html(Docx2Html):

def paragraph(self, text, pre = None):
return <p class="my_implementation"> + text + '</p>'



OR, let's say FOO is your new favorite markup language. Simply customize your own new parser, overwritting the abstract methods of DocxParser

::

class Docx2Foo(DocxParser):

# because linebreaks in are denoted by '!!!!!!!!!!!!' with the FOO markup langauge :)
def linebreak(self):
return '!!!!!!!!!!!!'

Custom Pre-Processor
####################

When creating your own Parser (as described above) you can now add in your own custom Pre Processor. To do so you will need to set the `pre_processor` field on the custom parser, like so:

::

class Docx2Foo(DocxParser):
pre_processor_class = FooPreProcessor


The `FooPreProcessor` will need a few things to get you going:

::

class FooPreProcessor(PydocxPreProcessor):
def perform_pre_processing(self, root, *args, **kwargs):
super(FooPreProcessor, self).perform_pre_processing(root, *args, **kwargs)
self._set_foo(root)

def _set_foo(self, root):
pass

If you want `_set_foo` to be called you must add it to `perform_pre_processing` which is called in the base parser for pydocx.

Everything done during pre-processing is executed prior to `parse` being called for the first time.


Styles
######

The base parser `Docx2Html` relies on certain css class being set for certain behaviour to occur. Currently these include:

* class `pydocx-insert` -> Turns the text green.
* class `pydocx-delete` -> Turns the text red and draws a line through the text.
* class `pydocx-center` -> Aligns the text to the center.
* class `pydocx-right` -> Aligns the text to the right.
* class `pydocx-left` -> Aligns the text to the left.
* class `pydocx-comment` -> Turns the text blue.
* class `pydocx-underline` -> Underlines the text.
* class `pydocx-caps` -> Makes all text uppercase.
* class `pydocx-small-caps` -> Makes all text uppercase, however truly lowercase letters will be small than their uppercase counterparts.
* class `pydocx-strike` -> Strike a line through.
* class `pydocx-hidden` -> Hide the text.

Exceptions
##########

Right now there is only one custom exception (`MalformedDocxException`). It is raised if either the `xml` or `zipfile` libraries raise an exception.

Optional Arguments
##################

You can pass in `convert_root_level_upper_roman=True` to the parser and it will convert all root level upper roman lists to headings instead.

Command Line Execution
######################

First you have to install pydocx, this can be done by running the command `pip install pydocx`. From there you can simply call the command `pydocx --html path/to/file.docx path/to/output.html`. Change `pydocx --html` to `pydocx --markdown` in order to convert to markdown instead.
12 changes: 0 additions & 12 deletions main.py

This file was deleted.

Loading