this post was submitted on 05 Jan 2024
19 points (100.0% liked)
Python
5396 readers
1 users here now
Welcome to the Python community on the programming.dev Lemmy instance!
📅 Events
October 2023
- PyConES Canarias 2023, 6-8th
- DjangoCon US 2023, 16-20th (!django 💬)
November 2023
- PyCon Ireland 2023, 11-12th
- PyData Tel Aviv 2023 14th
Past
July 2023
- PyDelhi Meetup, 2nd
- PyCon Israel, 4-5th
- DFW Pythoneers, 6th
- Django Girls Abraka, 6-7th
- SciPy 2023 10-16th, Austin
- IndyPy, 11th
- Leipzig Python User Group, 11th
- Austin Python, 12th
- EuroPython 2023, 17-23rd
- Austin Python: Evening of Coding, 18th
- PyHEP.dev 2023 - "Python in HEP" Developer's Workshop, 25th
August 2023
- PyLadies Dublin, 15th
- EuroSciPy 2023, 14-18th
September 2023
- PyData Amsterdam, 14-16th
- PyCon UK, 22nd - 25th
🐍 Python project:
- Python
- Documentation
- News & Blog
- Python Planet blog aggregator
💓 Python Community:
- #python IRC for general questions
- #python-dev IRC for CPython developers
- PySlackers Slack channel
- Python Discord server
- Python Weekly newsletters
- Mailing lists
- Forum
✨ Python Ecosystem:
🌌 Fediverse
Communities
- #python on Mastodon
- c/django on programming.dev
- c/pythorhead on lemmy.dbzer0.com
Projects
- Pythörhead: a Python library for interacting with Lemmy
- Plemmy: a Python package for accessing the Lemmy API
- pylemmy pylemmy enables simple access to Lemmy's API with Python
- mastodon.py, a Python wrapper for the Mastodon API
Feeds
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
PyMuPDF is excellent for extracting 'structured' text from a pdf page — though I believe 'pulling out relevant information' will still be a manual task, UNLESS the text you're working with allows parsing into meaningful units.
That's because 'textual' content in a pdf is nothing other than a bunch of instructions to draw glyphs inside a rect that represents a page; utilities that come with mupdf or poppler arrange those glyphs (not always perfectly) into 'blocks', 'lines', and 'words' based solely on whitespace separation; the programmer who uses those utilities in an end-user facing application then has to figure out how to create the illusion (so to speak) that the user is selecting/copying/searching for paragraphs, sentences, and so on, in proper reading order.
PyMuPDF comes with a rich collection of convenience functions to make all that less painful; like dehyphenation, eliminating superfluous whitespace, etc. but still, need some further processing to pick out humanly relevant info.
Built-in regex capabilities of Python can suffice for that parsing; but if not, you might want to look into NLTK tools, which apply sophisticated methods to tokenize words & sentences.
EDIT: I really should've mentioned some proper full text search tools. Once you have a good plaintext representation of a pdf page, you might want to feed that representation into tools like the following to index them properly for relevant info:
https://lunr.readthedocs.io/en/latest/ -- this is easy to use, & set up, esp. in a python project.
... it's based on principles that are put to use in this full-scale, 'industrial strength' full text search engine: https://solr.apache.org/ -- it's a bit of a pain to set up; but python can interface with it through any http client. Once you set up some kind of mapping between search tokens/keywords/tags, the plaintext page, & the actual pdf, you can get from a phrase search, for example, to a bunch of vector graphics (i.e. the pdf) relatively painlessly.