This Python script automates the process of crawling websites, saving individual pages as PDFs, and combining them into a single document with a clickable table of contents. It's ideal for creating offline archives, comprehensive documentation, or e-books from web content.
- Crawls websites starting from a given URL
- Converts each page to a PDF
- Combines all PDFs into a single document
- Generates a clickable table of contents
- Allows setting crawl depth and excluding specific links
- Handles relative and absolute URLs
- Avoids duplicate content
- Python 3.7+
- pip (Python package manager)
-
Clone this repository:
git clone https://github.com/yourusername/web-to-pdf-crawler.git cd web-to-pdf-crawler
-
Install the required packages:
pip install -r requirements.txt
Run the script from the command line with the following syntax:
python web_to_pdf_crawler.py <root_url> [options]
root_url
: The starting URL for the crawler (required)
-e
or--exclude
: Specify link texts to exclude from crawling (can be used multiple times)-L
or--level
: Set the maximum depth of the crawl (default is 0, which crawls only the root page)
-
Basic usage (crawl only the root page):
python web_to_pdf_crawler.py https://example.com
-
Crawl to a depth of 2, excluding certain pages:
python web_to_pdf_crawler.py https://example.com -L 2 -e "Privacy Policy" "Terms of Service"
- Individual PDFs are saved in the
website_pdfs
directory - The final combined PDF is saved as
final_combined_output.pdf
in the script's directory
- The script uses Playwright to render web pages and convert them to PDFs
- BeautifulSoup is used to parse HTML and extract links
- PyPDF2 is used to combine PDFs and add internal links
- A table of contents is generated using ReportLab
- JavaScript-heavy websites may not render correctly
- The script respects
robots.txt
by default (controlled by Playwright) - Very large websites may take a long time to crawl completely
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.