Hardware Requirements
You can scale how many concurrent jobs you're accepting, and must find a sweet spot to not overburden the CPU, or run out of RAM or storage.
Don't use small servers - if you scale the numbers below lineralily, you'll run out of space when you get a couple big jobs.
With the latest version of tumblr-grab following numbers seem to work:
- 8 cores / 16 threads (E5 / Epyc), 64 GB RAM, 1 TB storage (better 2 TB): 200 concurrency
- 4 cores / 8 threads (E3 dedicated), 32 GB RAM, 1 TB storage: 100 concurrency
Rate limitation by Tumblr: above 300-350 concurrency per IP.
Using HDD instead of SSD should be fine.
Commands
All as root.
All commands are meant to be run as root on a fresh Debian 9 server.
Install
adduser --system --group --shell /bin/bash archiveteam && apt-get update && apt-get upgrade -y && apt-get install -y git-core libgnutls28-dev libgnutls30 screen lua5.1 liblua5.1-0 liblua5.1-0-dev python-dev python-pip bzip2 zlib1g-dev unzip python-setuptools build-essential flex autoconf python-gnutls atop && pip install wheel && pip install --upgrade seesaw && su -c "cd /home/archiveteam; git clone https://github.com/ArchiveTeam/tumblr-grab.git; cd tumblr-grab; ./get-wget-lua.sh" archiveteam && wget https://gist.github.com/JustAnotherArchivist/f4617c902626377532692a341794f273/raw/4a81f66b5dcbc18deb0d530979a443be12b1844a/tumblr-monitor && chmod +x tumblr-monitor
Launch
Replace YOURNICKHERE with your nick which is to appear on the leaderboard (http://tracker.archiveteam.org/tumblr/).
for i in {8001..8010}; do screen -dm su -c "cd /home/archiveteam/tumblr-grab/; run-pipeline pipeline.py --concurrent 2 --port $i --address '127.0.0.1' YOURNICKHERE" archiveteam; sleep 1; done
Launches 20 concurrency overall - adjust {8001..8010} to launch less or more. So for 40 concurrency use {8001..8020}, for 80 use {8001..8040}, and so on.
Monitor
Current jobs and their progress:
./tumblr-monitor
General system information:
atop
Just count the current jobs (for example to monitor graceful stop):
pgrep -f tumblr-blog | wc -l
Stop
Gracefully - might not be effective or take a lot of time (multiple hours/days):
su -c "cd /home/archiveteam/tumblr-grab/; touch STOP" archiveteam
Forcefully:
pkill -f wget-lua && reboot
Restart - on a forceful stop, previously active jobs don't get resumed. So let's remove the tumblr-grab directory, get the latest version and get new jobs (perhaps ask for your claims to be released in the IRC channel; run the "Launch" part again after this):
rm -rf /home/archiveteam/tumblr-grab && su -c "cd /home/archiveteam; git clone https://github.com/ArchiveTeam/tumblr-grab.git; cd tumblr-grab; ./get-wget-lua.sh" archiveteam