Skip to content

Cat-Ling/wayback-go

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wayback-Go

A command-line tool to download websites from the Wayback Machine, re-written in Go.

Overview

This program is a Go port of the popular Ruby-based wayback-machine-downloader by hartator (available at https://github.com/hartator/wayback-machine-downloader). It allows you to download all available snapshots of a given URL from the Internet Archive's Wayback Machine, saving them locally.

Features

  • Download Entire Websites: Recursively downloads all files associated with a given URL from the Wayback Machine.
  • Exact URL Download: Option to download only the exact URL provided, without following links.
  • Timestamp Filtering: Specify from and to timestamps to download snapshots within a particular date range.
  • Regex Filtering: Include or exclude URLs based on regular expressions.
  • All Timestamps: Download all available timestamps for each file, not just the latest.
  • Concurrency: Utilizes multiple threads for faster downloads.
  • List Only Mode: Preview the list of files that would be downloaded in JSON format without actually downloading them.
  • Error Handling: Option to download all files, even those that return errors.

Installation

To install wayback-go, you need to have Go installed on your system (Go 1.16 or later is recommended).

  1. Clone the repository:
    git clone https://github.com/Cat-Ling/wayback-go.git
    cd wayback-go
  2. Build the executable:
    go build -o wayback-go
  3. Move to your PATH (optional):
    sudo mv wayback-go /usr/local/bin/

Usage

./wayback-go --url <URL> [options]

Options:

  • --url <URL>: The base URL to download from Wayback Machine (required).
  • --exact-url: Download only the exact URL.
  • --dir <directory>: Directory to save the downloaded files (defaults to websites/<domain>).
  • --all-timestamps: Download all available timestamps for each file.
  • --from <timestamp>: Download snapshots from this timestamp (e.g., 20060102150405).
  • --to <timestamp>: Download snapshots to this timestamp (e.g., 20060102150405).
  • --only <regex>: Only download URLs matching this regex filter.
  • --exclude <regex>: Exclude URLs matching this regex filter.
  • --all: Download all files, even if they return an error.
  • --max-pages <number>: Maximum number of snapshot pages to retrieve from Wayback Machine API (default: 100).
  • --threads <number>: Number of concurrent download threads (default: 1).
  • --list: Only list file URLs in JSON format, won't download anything.

Examples:

  1. Download a website:
    ./wayback-go --url https://example.com
  2. Download only a specific URL:
    ./wayback-go --url https://example.com/page.html --exact-url
  3. Download with a specific output directory:
    ./wayback-go --url https://example.com --dir my_archive
  4. Download snapshots from a specific date:
    ./wayback-go --url https://example.com --from 20200101000000 --to 20201231235959
  5. List files in JSON format:
    ./wayback-go --url https://example.com --list
  6. Download with 5 concurrent threads:
    ./wayback-go --url https://example.com --threads 5
  7. Only download CSS files:
    ./wayback-go --url https://example.com --only "\.css$"

Contributing

Contributions are welcome! Please feel free to open issues or submit pull requests.

License

This project is licensed under the MIT License. See the LICENSE file for details.