PyMuPDF4LLM is based on PyMuPDF - the fastest PDF extraction tool for Python.

This documentation explains how to use the PyMuPDF4LLM package as well as providing links to other related RAG & LLM resources for PyMuPDF.


  • Support for multi-column pages

  • Support for image and vector graphics extraction (and inclusion of references in the MD text)

  • Support for page chunking output.

  • Direct support for output as LlamaIndex Documents.

Document support#

PyMuPDF4LLM supports the following file types for text extraction:






  • This package converts the pages of a file to text in Markdown format using PyMuPDF.

  • Standard text and tables are detected, brought in the right reading sequence and then together converted to GitHub-compatible Markdown text.

  • Header lines are identified via the font size and appropriately prefixed with one or more # tags.

  • Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.

  • By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of 0-based page numbers.


Install the package via pip with:

pip install pymupdf4llm

Using in LLM / RAG Applications#

PyMuPDF4LLM is aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.

Extracting a file as Markdown#

To retrieve your document content in Markdown simply install the package and then use a couple of lines of Python code to get results.

Then in your Python script do:

import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")


Instead of the filename string as above, one can also provide a PyMuPDF Document. A second parameter may be a list of 0-based page numbers, e.g. [0,1] would just select the first and second pages of the document.

If you want to store your Markdown file, e.g. store as a UTF8-encoded file, then do:

import pathlib

Extracting a file as a LlamaIndex document#

PyMuPDF4LLM supports direct conversion to a LLamaIndex document. A document is first converted into Markdown format and then a LlamaIndex document is returned as follows:

import pymupdf4llm
llama_reader = pymupdf4llm.LlamaMarkdownReader()
llama_docs = llama_reader.load_data("input.pdf")


Change Log#

Further Resources#

