PyMuPDF4LLM#

PyMuPDF4LLM is based on PyMuPDF - the fastest PDF extraction tool for Python.

This documentation explains how to use the PyMuPDF4LLM package as well as providing links to other related RAG & LLM resources for PyMuPDF.

Features#

  • Support for multi-column pages

  • Support for image and vector graphics extraction (and inclusion of references in the MD text)

  • Support for page chunking output.

  • Direct support for output as LlamaIndex Documents.

Document support#

PyMuPDF4LLM supports the following file types for text extraction:

PDF

DOCX

XLSX

PPTX

HWPX

_images/icon-pdf.svg _images/icon-docx.svg _images/icon-xlsx.svg _images/icon-pptx.svg _images/icon-hangul.svg

XPS

EPUB

MOBI

FB2

CBZ

_images/icon-xps.svg _images/icon-epub.svg _images/icon-mobi.svg _images/icon-fb2.svg _images/icon-cbz.svg
  • This package converts the pages of a file to text in Markdown format using PyMuPDF.

  • Standard text and tables are detected, brought in the right reading sequence and then together converted to GitHub-compatible Markdown text.

  • Header lines are identified via the font size and appropriately prefixed with one or more # tags.

  • Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.

  • By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of 0-based page numbers.

Installation#

Install the package via pip with:

pip install pymupdf4llm

Using in LLM / RAG Applications#

PyMuPDF4LLM is aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.

Extracting a file as Markdown#

To retrieve your document content in Markdown simply install the package and then use a couple of lines of Python code to get results.

Then in your Python script do:

import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")

Note

Instead of the filename string as above, one can also provide a PyMuPDF Document. A second parameter may be a list of 0-based page numbers, e.g. [0,1] would just select the first and second pages of the document.

If you want to store your Markdown file, e.g. store as a UTF8-encoded file, then do:

import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

Extracting a file as a LlamaIndex document#

PyMuPDF4LLM supports direct conversion to a LLamaIndex document. A document is first converted into Markdown format and then a LlamaIndex document is returned as follows:

import pymupdf4llm
llama_reader = pymupdf4llm.LlamaMarkdownReader()
llama_docs = llama_reader.load_data("input.pdf")

API#

See API.

Change Log#

See Change Log.

Further Resources#

Sample code#

Blogs#



This software is provided AS-IS with no warranty, either express or implied. This software is distributed under license and may not be copied, modified or distributed except as expressly authorized under the terms of that license. Refer to licensing information at artifex.com or contact Artifex Software Inc., 39 Mesa Street, Suite 108A, San Francisco CA 94129, United States for further information.