Public
Authored by trvz

Save our Smuts

Instructions for running https://github.com/ArchiveTeam/tumblr-grab manually.

More info: http://tracker.archiveteam.org/tumblr/, #tumbledown on efnet

Edited
tumblr.md 2.75 KB

Hardware Requirements

You can scale how many concurrent jobs you're accepting, and must find a sweet spot to not overburden the CPU, or run out of RAM or storage.

Don't use small servers - if you scale the numbers below lineralily, you'll run out of space when you get a couple big jobs.

With the latest version of tumblr-grab following numbers seem to work:

  • 8 cores / 16 threads (E5 / Epyc), 64 GB RAM, 1 TB storage (better 2 TB): 200 concurrency
  • 4 cores / 8 threads (E3 dedicated), 32 GB RAM, 1 TB storage: 100 concurrency

Rate limitation by Tumblr: above 300-350 concurrency per IP.

Using HDD instead of SSD should be fine.

Commands

All as root.

All commands are meant to be run as root on a fresh Debian 9 server.

Install

adduser --system --group --shell /bin/bash archiveteam && apt-get update && apt-get upgrade -y && apt-get install -y git-core libgnutls28-dev libgnutls30 screen lua5.1 liblua5.1-0 liblua5.1-0-dev python-dev python-pip bzip2 zlib1g-dev unzip python-setuptools build-essential flex autoconf python-gnutls atop && pip install wheel && pip install --upgrade seesaw && su -c "cd /home/archiveteam; git clone https://github.com/ArchiveTeam/tumblr-grab.git; cd tumblr-grab; ./get-wget-lua.sh" archiveteam && wget https://gist.github.com/JustAnotherArchivist/f4617c902626377532692a341794f273/raw/4a81f66b5dcbc18deb0d530979a443be12b1844a/tumblr-monitor && chmod +x tumblr-monitor

Launch

Replace YOURNICKHERE with your nick which is to appear on the leaderboard (http://tracker.archiveteam.org/tumblr/).

for i in {8001..8010}; do screen -dm su -c "cd /home/archiveteam/tumblr-grab/; run-pipeline pipeline.py --concurrent 2 --port $i --address '127.0.0.1' YOURNICKHERE" archiveteam; sleep 1; done

Launches 20 concurrency overall - adjust {8001..8010} to launch less or more. So for 40 concurrency use {8001..8020}, for 80 use {8001..8040}, and so on.

Monitor

Current jobs and their progress:

./tumblr-monitor

General system information:

atop

Just count the current jobs (for example to monitor graceful stop):

pgrep -f tumblr-blog | wc -l

Stop

Gracefully - might not be effective or take a lot of time (multiple hours/days):

su -c "cd /home/archiveteam/tumblr-grab/; touch STOP" archiveteam

Forcefully:

pkill -f wget-lua && reboot

Restart - on a forceful stop, previously active jobs don't get resumed. So let's remove the tumblr-grab directory, get the latest version and get new jobs (perhaps ask for your claims to be released in the IRC channel; run the "Launch" part again after this):

rm -rf /home/archiveteam/tumblr-grab && su -c "cd /home/archiveteam; git clone https://github.com/ArchiveTeam/tumblr-grab.git; cd tumblr-grab; ./get-wget-lua.sh" archiveteam
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment