微软的 Office 系列具有非常厉害的技术，但产品具有极高的学习成本，例如在查找替换这一项，它不支持标准的正则表达式（毕竟是 wysiwyg 嘛，也要考虑样式），但不是基于正则表达式定制，而是自己另辟蹊径，这就有点让人难受了。而内置的 Word VBA 的相关文档和 Demo 又很少，基本要靠录制宏来现学现卖，而录制宏生成的代码过于 adhoc，难以泛化，所以这时候借助于 Python 来处理相关文档就显得比较有意义。

Word VBA 实例

以下面这个为例，为了将一个小数转成百分数表示，先要搜索 0.([0-9]{2})([0-9]@)^13 这个串，其中 ^13 表示换行符，@ 表示一个以上的字符或者表达式，类似于非贪心的 +，在这里不太清楚 @ 和 * 的行为有什么区别。

Sub DoReplace()
With ActiveDocument.Range.Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Forward = True
    .Format = False
    .Wrap = wdFindContinue
    .MatchWildcards = True
    .Text = "0.([0-9]{2})([0-9]@)^13"
    .Replacement.Text = "\1.\2%^p" ' 到这里为止，加上了百分比符号
    .Execute Replace:=wdReplaceAll
    .Text = "0([0-9].[0-9]@%)" ' 去掉前面的0
    .Replacement.Text = "\1"
    .Execute Replace:=wdReplaceAll
End With
End Sub

Word

Word上的操作主要依赖 docx 这个库。
我们用一个 Document 维护一个文档。
一个文档由很多的段落组成，可以用下面的办法进行枚举。

1 2	for p in doc.paragraphs: print p

需要注意的是，paragraph 是一个 getter/setter 方法

# document.py
@property
def paragraphs(self):
    # _body是一个_Body对象，而后者继承了BlockItemContainer
    return self._body.paragraphs
# blkcntnr.py
@property
def paragraphs(self):
    """
    A list containing the paragraphs in this container, in document
    order. Read-only.
    """
    return [Paragraph(p, self) for p in self._element.p_lst]

一个 paragraph 由很多个 run 组成，如果单纯设置或者访问 paragraph.text，会丢掉格式。

# paragraph.py
@property
def text(self):
    text = ''
    for run in self.runs:
        text += run.text
    return text

@text.setter
def text(self, text):
    self.clear()
    self.add_run(text)

加图片 document.add_picture，会先在 document 的最后加上一个 paragraph 和 run，在这个 run 里面加上 picture

1
2
3

def add_picture(self, image_path_or_stream, width=None, height=None):
    run = self.add_paragraph().add_run()
    return run.add_picture(image_path_or_stream, width, height)

Excel+Pandas

可以通过 Pandas 来操作 Excel，这里详见Pandas 的介绍

Reference

https://python-docx.readthedocs.io/en/latest/