(cache)browsertrix: fetch all on forummuenchen.org

A installed browsertrix locally including an UI. The first crawl was to fetch everything at forummuenchen.org, including subdomains. The main websites are forummuenchen.org and archiv.forummuenchen.org.

The crawl took 17:56 hours, fetched 17 478 pages, creating 14,3 GB of WARC files.

This leaves me with these observations:

just like browsertrix cli tool, browsertrix including an UI works smoothly when following the installation procedure.
browsertrix really quickly produces A LOT of data. I mean, this is not surprising at all. But still, I am still new enough in web archving that it amazes me. The web hoster speaks of 16,38 GB web and database that forummuenchen.org uses. So, browsertrix results is ~90% of the original websites.
Are there best practices concerning pages that show tags, e.g. https://archiv.forummuenchen.org/schlagwort/feminismus/? A lot of the pages are not „original content“ but a new view on it, e.g. tags, or authors and so on. There can be argued for both: for saving it – to get the best replay experience – or against saving it – to save resources.

Schreibe einen Kommentar Antworten abbrechen

Ähnliche Beiträge

Exploring Omeka S API with examples

Sorry, auch Datenanalysen sind nicht der Heilige Gral der Objektivität

Stock Taking: How big is the queer Wikidata

Leaflet.js: Regensburg und seine Migranten