lire, mettre en évidence, enregistrer le PDF par programmation

Je voudrais écrire un petit script (qui tournera sur un serveur Linux sans tête) qui lit un fichier PDF, met en évidence le texte qui correspond à tout ce qui se trouve dans un tableau de chaînes que je passe, puis enregistre le fichier PDF modifié. J’imagine que je finirai par utiliser quelque chose comme les liaisons python à poppler mais malheureusement, il y a presque zéro documentation et je n’ai pratiquement aucune expérience en python.

Si quelqu’un pouvait me diriger vers un tutoriel, un exemple ou une documentation utile pour me lancer, ce serait très apprécié!

Avez-vous essayé de regarder PDFMiner ? On dirait que ça fait ce que vous voulez.

PDFlib a des liaisons Python et supporte ces opérations. Vous voudrez avec PDI si vous voulez ouvrir un PDF. http://www.pdflib.com/products/pdflib-family/pdflib-pdi/ et TET.

Malheureusement, c’est un produit commercial. J’ai utilisé cette bibliothèque en production par le passé et cela fonctionne très bien. Les liaisons sont très fonctionnelles et pas si python. J’ai vu des tentatives pour les rendre plus Pythonic: https://github.com/alexhayes/pythonic-pdflib Vous voudrez utiliser: open_pdi_document ().

Il semblerait que vous souhaitiez effectuer une recherche de quelque sorte:

http://www.pdflib.com/tet-cookbook/tet-and-pdflib/highlight-search-terms/

Oui, c’est possible avec une combinaison de pdfminer ( pip install pdfminer.six ) et PyPDF2 .

Premièrement, trouvez les coordonnées (par exemple, comme ceci ). Ensuite, mettez-le en évidence:

 #!/usr/bin/env python """Create sample highlight in a PDF file.""" from PyPDF2 import PdfFileWriter, PdfFileReader from PyPDF2.generic import ( DictionaryObject, NumberObject, FloatObject, NameObject, TextSsortingngObject, ArrayObject ) def create_highlight(x1, y1, x2, y2, meta, color=[0, 1, 0]): """ Create a highlight for a PDF. Parameters ---------- x1, y1 : float bottom left corner x2, y2 : float top right corner meta : dict keys are "author" and "contents" color : iterable Three elements, (r,g,b) """ new_highlight = DictionaryObject() new_highlight.update({ NameObject("/F"): NumberObject(4), NameObject("/Type"): NameObject("/Annot"), NameObject("/Subtype"): NameObject("/Highlight"), NameObject("/T"): TextSsortingngObject(meta["author"]), NameObject("/Contents"): TextSsortingngObject(meta["contents"]), NameObject("/C"): ArrayObject([FloatObject(c) for c in color]), NameObject("/Rect"): ArrayObject([ FloatObject(x1), FloatObject(y1), FloatObject(x2), FloatObject(y2) ]), NameObject("/QuadPoints"): ArrayObject([ FloatObject(x1), FloatObject(y2), FloatObject(x2), FloatObject(y2), FloatObject(x1), FloatObject(y1), FloatObject(x2), FloatObject(y1) ]), }) return new_highlight def add_highlight_to_page(highlight, page, output): """ Add a highlight to a PDF page. Parameters ---------- highlight : Highlight object page : PDF page object output : PdfFileWriter object """ highlight_ref = output._addObject(highlight) if "/Annots" in page: page[NameObject("/Annots")].append(highlight_ref) else: page[NameObject("/Annots")] = ArrayObject([highlight_ref]) def main(): pdf_input = PdfFileReader(open("samples/test3.pdf", "rb")) pdf_output = PdfFileWriter() page1 = pdf_input.getPage(0) highlight = create_highlight(89.9206, 573.1283, 376.849, 591.3563, { "author": "John Doe", "contents": "Lorem ipsum" }) add_highlight_to_page(highlight, page1, pdf_output) pdf_output.addPage(page1) output_stream = open("output.pdf", "wb") pdf_output.write(output_stream) if __name__ == '__main__': main()