如何用Python批量提取PPT中含有某关键词的一页，并将这些PPT合并

2024年4月25日 86次阅读来源: OnlyGw

前提：我有一堆PPT文件：1.pptx、2.pptx、3.pptx……每个文件中都含有若干张幻灯片，这若干张幻灯片中都有一张含有某个关键词的一页，例如含有”月分析“，（我就是每个月要从每个分公司中提取他们的月分析）

需求：将这些含有关键词的PPTX文件，删除无关的页，只保留含有关键词的页，并替换关键词，防止混乱，最后将这些ppt全部合并。

奏乐，代码上：

import pptx
from pptx import Presentation
import os
import re
#import ALL

def replace_text(text_frame):#该函数实现的是文本替换功能
    for paragraph in text_frame.paragraphs:
        for run in paragraph.runs:
            for tt in TEXT_NEED_REPLACE:
                if tt[0] in run.text:
                    run.text = run.text.replace(tt[0], tt[1])

def process_ppt(filename_open, filename_save,Procices):
    prs = Presentation(filename_open)
    m=0;
    for slide in prs.slides:
        for shape in slide.shapes:
            if shape.has_text_frame:#判断Shape是否含有文本框
                text_frame = shape.text_frame
                for paragraph in text_frame.paragraphs:
                    for run in paragraph.runs:
                        ret = re.findall("关键字", run.text)
                        if (ret != []):
                            run.text = run.text.replace("关键字","替换的新的关键字"+Procices)
                            x=m;#查询到的含有新技术新业务的页面，记录下index值。
        m+=1;#遍历的页面计数，每次增加一个，便于探测是那一页有关键词。
    slides = list(prs.slides._sldIdLst)
    for index in range(len(slides)):
        if(index!=x):
            prs.slides._sldIdLst.remove(slides[index])#凡是非含有的全部移除
    prs.save(filename_save)#保存
def del_slide(index):
    slides = list(prs.slides._sldIdLst)
    prs.slides._sldIdLst.remove(slides[index])

    # 遍历文件夹及其子文件夹中的文件，并存储在一个列表中
    # 输入文件夹路径、空文件列表[]
    # 返回 文件列表Filelist,包含文件名（完整路径）
def get_filelist(dir, Filelist):
        newDir = dir
        if os.path.isfile(dir):
            Filelist.append(dir)
            # # 若只是要返回文件文，使用这个
            # Filelist.append(os.path.basename(dir))
        elif os.path.isdir(dir):
            for s in os.listdir(dir):
                # 如果需要忽略某些文件夹，使用以下代码
                # if s == "xxx":
                # continue
                newDir = os.path.join(dir, s)
                get_filelist(newDir, Filelist)
        return Filelist

list1 = get_filelist('原始文件夹路径', [])#原始文件路径
print(len(list1))
for e in list1:
    dir, file = os.path.split(e)
    name=file.split('.')
    process_ppt(e,'保存的文件架路径'+file,name[0])#保存的路径

此时保存的文件架路径中出现了很多只含有关键字页的pptx文件，用插件即可合并，插件可以见这位知乎大佬的：https://www.zhihu.com/question/68117952

到此为止，很奇葩的需求，但是可以节省很多时间。

参考了这位大佬的文章，表示感谢：

https://blog.csdn.net/fei347795790/article/details/106996817/

    原文作者：OnlyGw
    原文地址: https://blog.csdn.net/weixin_42426690/article/details/107917331
    本文转自网络文章，转载此文章仅为分享知识，如有侵权，请联系博主进行删除。