I want to scrape an entire wiki that uses MediaWiki software. The amount of pages are pretty small, but they have plenty of revisions, and I'd like to preferably scrape revisions as well.

The wiki does not offer database dumps, unlike Wikipedia. Are there any existing software/scripts designed to scrape MediaWiki sites?

closed as off-topic by fixer1234, Twisty Impersonator, Scott, mdpc, bwDraco Aug 7 '15 at 12:07

This question appears to be off-topic. The users who voted to close gave this specific reason:

  • "Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. Here are a few suggestions on how to properly ask this type of question." – fixer1234, Twisty Impersonator, Scott, mdpc, bwDraco
If this question can be reworded to fit the rules in the help center, please edit the question.

up vote 3 down vote accepted

If the maintainer of the wiki hasn't turned it off, you can export pages with their history through Special:Export. This will give you an XML dump similar to Wikipedia's database dumps, which you can then import into another wiki.

Another way to obtain page history from MediaWiki in XML format is to use the prop=revisions API query. However, the API results format is somewhat different from that produced by Special:Export, so you'll probably have to process the output a bit before you can feed it to standard import scripts.

  • Thanks, but with Special Export I can only export one page at a time? – apscience Feb 9 '12 at 4:37
  • No, you can export as many pages as you like; just separate the names with linefeeds. That's why the user interface has this whole big <textarea> just for the page list. – Ilmari Karonen Feb 9 '12 at 13:22
  • However, see the footnotes on the page I linked to: if some of the pages have a lot of revisions, you may find that you need to export them one by one anyway. You could always write a simple script to loop over the page list and send export requests for them one at a time. – Ilmari Karonen Feb 9 '12 at 13:30

Check out the tools available at from WikiTeam. http://archiveteam.org/index.php?title=WikiTeam

I personally use wikiteam's dumpgenerator.py which is available here: https://github.com/WikiTeam/wikiteam

It depends on python 2. You can get the software using git or download the zip from github:

git clone https://github.com/WikiTeam/wikiteam.git

The basic usage is:

python dumpgenerator.py http://wiki.domain.org --xml --images
  • 1
    Welcome to Super User! Can you add in the relevant parts of the link into your answer? We ask this to help the OP out, so they will not have to search through information that may not pertain to them. This is also to preserve the relevant information in case the hosting site goes down. For more information, see this meta post. – Cfinley Aug 5 '15 at 19:48

Not the answer you're looking for? Browse other questions tagged or ask your own question.