Skip to content

tz-archiver-cli

License

Notifications You must be signed in to change notification settings

melon-dog/tz-archiver-cli

Repository files navigation

TZ Archiver CLI

⚠️ Warning:
This repository is in experimental state. 90% of the code has been generated by Sonnet 4 based on the TypeScript code from the Tezos Archiver website.

A Python command-line tool for archiving Tezos NFTs to the Wayback Machine. This tool fetches NFT metadata from Tezos wallets using the TzKT API and automatically archives IPFS-hosted artifacts to ensure long-term preservation.

Prerequisites

  • Python 3.10+
  • Internet Archive account with API access
  • Required Python packages (see installation)

Installation

  1. Clone the repository:
git clone https://github.com/melon-dog/tz-archiver-cli.git
cd tz-archiver-cli
  1. Install dependencies:
pip install requests python-dotenv wayback-utils
  1. Create a .env file in the src/ directory with your Internet Archive credentials:
ARCHIVE_ACCESS=your_access_key_here
ARCHIVE_SECRET=your_secret_key_here

Usage

Basic Usage

Archive NFTs from a specific Tezos wallet:

python src/main.py -w tz1U7C2NVwbhdvG3fJixLLUWUyZHuXWNiF7V

Spider Mode (Random Discovery)

Run without specifying a wallet to archive random tokens:

python src/main.py

Advanced Usage

Specify a custom limit for the number of tokens to process:

python src/main.py -w tz1U7C2NVwbhdvG3fJixLLUWUyZHuXWNiF7V -l 500

Command-line Options

  • -w, --wallet (optional): Tezos wallet address (e.g., tz1...). If not provided, runs in spider mode
  • -l, --limit (optional): Number of tokens to process (default: 10,000)
  • -h, --help: Show detailed help message with examples

How It Works

1. Token Discovery

The tool queries the TzKT API to find:

  • Minted tokens: Tokens created by the wallet
  • Owned tokens: Tokens currently in the wallet
  • Contract tokens: Tokens from contracts associated with the wallet

2. IPFS Detection

Scans token metadata for artifactUri fields containing IPFS URLs (ipfs://...)

3. Smart Archiving Process

For each IPFS artifact:

  • Pre-check: Verifies if already archived (doesn't count for rate limit)
  • Rate limiting: Only applied to actual archiving operations
  • URL conversion: Converts IPFS CID to HTTP URL via ipfs.fileship.xyz
  • Wayback submission: Submits for archiving with optimized parameters

4. Concurrent Processing

  • Maintains up to 4 concurrent archiving processes
  • Smart queue management with available slot detection
  • Automatic retry logic for failed operations

5. State Persistence

All data is automatically saved to src/data/:

  • processed_cids.json: Successfully processed IPFS CIDs
  • errors_cids.json: CIDs that failed to archive (for manual retry)

6. Resume Capability

The tool automatically:

  • Loads previous session data on startup
  • Skips already processed CIDs
  • Continues from where it left off

Configuration

Environment Variables

Create a .env file in the src/ directory:

ARCHIVE_ACCESS=your_access_key_here
ARCHIVE_SECRET=your_secret_key_here

Note: You can obtain your API keys at the following link:
https://archive.org/account/s3.php

Rate Limiting

The tool implements intelligent rate limiting:

  • Wayback Machine limit: 12 captures/minute (configurable)
  • Check operations: wayback.indexed() calls don't count towards limit
  • Archive operations: Only wayback.save() calls count towards limit
  • Sliding window: 60-second rolling window for accurate rate tracking

Archiving Parameters

Optimized Wayback Machine settings:

  • js_behavior_timeout: 7 seconds
  • delay_wb_availability: False
  • if_not_archived_within: 31,536,000 seconds (1 year)
  • max_concurrent_processes: 4

Project Structure

tz-archiver-cli/
├── src/
│   ├── data/                     # Persistent state storage (auto-created)
│   │   ├── processed_cids.json   # Successfully processed CIDs
│   │   └── errors_cids.json      # Failed CIDs for retry
│   ├── utils/                    # Utility modules
│   │   ├── __init__.py           # Package initialization
│   │   ├── logger.py             # Colored logging system
│   │   └── tzkt.py               # TzKT API client with full type hints
│   ├── main.py                   # CLI entry point with argument parsing
│   ├── config.py                 # Centralized configuration management
│   ├── processor.py              # Core business logic and rate limiting
│   ├── archiver.py               # Wayback Machine integration
│   ├── state_manager.py          # Persistent state management
│   └── .env                      # Environment variables (create this)
├── README.md                     # This documentation
└── requirements.txt              # Python dependencies (optional)

Architecture

Core Components

  • main.py: CLI entry point with comprehensive argument validation
  • processor.py: Token processing with smart rate limiting
  • archiver.py: Wayback Machine integration with concurrency control
  • state_manager.py: Atomic file operations for data persistence
  • config.py: Centralized configuration with environment variable support
  • utils/logger.py: Advanced logging with ANSI colors and Windows compatibility
  • utils/tzkt.py: Fully typed TzKT API client with comprehensive dataclasses

API Integration

TzKT API

Integrates with TzKT API for Tezos blockchain data:

  • Mints: /v1/tokens?firstMinter={address}&limit={limit}
  • Balances: /v1/tokens/balances?account={address}&limit={limit}
  • Contract Tokens: /v1/tokens?contract={address}&limit={limit}
  • Random Tokens: /v1/tokens?select=*&limit={limit}&sort=random

Wayback Machine API

Uses wayback-utils library:

  • Check Archive Status: wayback.indexed() (doesn't count for rate limit)
  • Submit for Archiving: wayback.save() (counts for rate limit)
  • Rate Limiting: 12 captures/minute with sliding window algorithm

Data Persistence

The tool automatically creates a data/ folder in the source directory to store:

  • processed_cids.json: List of successfully processed IPFS CIDs with timestamps
  • errors_cids.json: List of CIDs that failed to archive (for manual retry)

Data format:

{
  "processed_cids": ["Qm...", "bafy..."],
  "errors_cids": ["Qm...", "bafy..."],
}

Performance Features

  • Smart caching: Avoids reprocessing already handled CIDs
  • Concurrent processing: Up to 4 parallel archiving operations
  • Rate limit optimization: Only counts actual archiving requests
  • Memory efficient: Streams data and processes in batches
  • Resumable sessions: No work lost on interruption

Contributing

  1. Fork the repository
  2. Make your changes with proper type hints
  3. Ensure code follows the established patterns
  4. Submit a pull request

License

MIT License - see LICENSE file for details

Important Notes

  • Rate Limits: The tool respects Wayback Machine's 12 captures/minute limit
  • Processing Time: Large collections may take significant time to process
  • Asynchronous Results: Internet Archive archiving is asynchronous - results may not be immediately available
  • Network Dependency: Requires stable internet connection for API calls
  • Storage: Local state files grow with processed CID count

Advanced Usage Examples

Resume a Previous Session

# Simply run the same command - the tool automatically resumes
python src/main.py -w tz1YourWalletAddress

Monitor Rate Limiting

# The tool displays current rate status:
# "Archiving CID (rate: 8/12/min): QmHashHere"

Process Multiple Wallets

# Process different wallets sequentially
python src/main.py -w tz1FirstWallet -l 1000
python src/main.py -w tz2SecondWallet -l 1000

Spider Mode for Discovery

# Continuous random token discovery
python src/main.py
# Press Ctrl+C to stop gracefully

Generated with ❤️ for the Tezos NFT community

Releases

No releases published

Packages

No packages published

Languages