Web-to-PDF Crawler

This Python script automates the process of crawling websites, saving individual pages as PDFs, and combining them into a single document with a clickable table of contents. It's ideal for creating offline archives, comprehensive documentation, or e-books from web content.

Features

Crawls websites starting from a given URL
Converts each page to a PDF
Combines all PDFs into a single document
Generates a clickable table of contents
Allows setting crawl depth and excluding specific links
Handles relative and absolute URLs
Avoids duplicate content

Prerequisites

Python 3.7+
pip (Python package manager)

Installation

Clone this repository:

git clone https://github.com/yourusername/web-to-pdf-crawler.git
cd web-to-pdf-crawler

Install the required packages:
```
pip install -r requirements.txt
```

Usage

Run the script from the command line with the following syntax:

python web_to_pdf_crawler.py <root_url> [options]

Arguments

root_url: The starting URL for the crawler (required)

Options

-e or --exclude: Specify link texts to exclude from crawling (can be used multiple times)
-L or --level: Set the maximum depth of the crawl (default is 0, which crawls only the root page)

Examples

Basic usage (crawl only the root page):

python web_to_pdf_crawler.py https://example.com

Crawl to a depth of 2, excluding certain pages:

python web_to_pdf_crawler.py https://example.com -L 2 -e "Privacy Policy" "Terms of Service"

Output

Individual PDFs are saved in the website_pdfs directory
The final combined PDF is saved as final_combined_output.pdf in the script's directory

How It Works

The script uses Playwright to render web pages and convert them to PDFs
BeautifulSoup is used to parse HTML and extract links
PyPDF2 is used to combine PDFs and add internal links
A table of contents is generated using ReportLab

Limitations

JavaScript-heavy websites may not render correctly
The script respects robots.txt by default (controlled by Playwright)
Very large websites may take a long time to crawl completely

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
final_combined_output.pdf		final_combined_output.pdf
main.py		main.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
toc.pdf		toc.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web-to-PDF Crawler

Features

Prerequisites

Installation

Usage

Arguments

Options

Examples

Output

How It Works

Limitations

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

mltframework/website2pdf

Folders and files

Latest commit

History

Repository files navigation

Web-to-PDF Crawler

Features

Prerequisites

Installation

Usage

Arguments

Options

Examples

Output

How It Works

Limitations

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages