What software can I use to scrape (download) a MediaWiki wiki? [closed]

Question

I want to scrape an entire wiki that uses MediaWiki software. The amount of pages are pretty small, but they have plenty of revisions, and I'd like to preferably scrape revisions as well.

The wiki does not offer database dumps, unlike Wikipedia. Are there any existing software/scripts designed to scrape MediaWiki sites?

It's "scrape". I think Wikimedia would be most disappointed if you scrapped their site. — Daniel R Hicks, Feb 5 '12 at 13:18

Ilmari Karonen · Accepted Answer · 2012-02-08 23:37:41Z

up vote 3 down vote accepted

If the maintainer of the wiki hasn't turned it off, you can export pages with their history through Special:Export. This will give you an XML dump similar to Wikipedia's database dumps, which you can then import into another wiki.

Another way to obtain page history from MediaWiki in XML format is to use the prop=revisions API query. However, the API results format is somewhat different from that produced by Special:Export, so you'll probably have to process the output a bit before you can feed it to standard import scripts.

answered Feb 8 '12 at 23:37

Ilmari Karonen

1,385821

Thanks, but with Special Export I can only export one page at a time? – apscience Feb 9 '12 at 4:37
No, you can export as many pages as you like; just separate the names with linefeeds. That's why the user interface has this whole big <textarea> just for the page list. – Ilmari Karonen Feb 9 '12 at 13:22
However, see the footnotes on the page I linked to: if some of the pages have a lot of revisions, you may find that you need to export them one by one anyway. You could always write a simple script to loop over the page list and send export requests for them one at a time. – Ilmari Karonen Feb 9 '12 at 13:30

add a comment |

TimSC · Answer 2 · 2015-08-06 14:07:55Z

up vote 5 down vote

Check out the tools available at from WikiTeam. http://archiveteam.org/index.php?title=WikiTeam

I personally use wikiteam's dumpgenerator.py which is available here: https://github.com/WikiTeam/wikiteam

It depends on python 2. You can get the software using git or download the zip from github:

git clone https://github.com/WikiTeam/wikiteam.git

The basic usage is:

python dumpgenerator.py http://wiki.domain.org --xml --images

edited Aug 6 '15 at 14:07

answered Aug 5 '15 at 19:31

TimSC

15113

1

Welcome to Super User! Can you add in the relevant parts of the link into your answer? We ask this to help the OP out, so they will not have to search through information that may not pertain to them. This is also to preserve the relevant information in case the hosting site goes down. For more information, see this meta post. – Cfinley Aug 5 '15 at 19:48

add a comment |

asked	6 years, 6 months ago
viewed	5,696 times
active	3 years ago

Stack Exchange Network

current community

your communities

more stack exchange communities

What software can I use to scrape (download) a MediaWiki wiki? [closed]

closed as off-topic by fixer1234, Twisty Impersonator, Scott, mdpc, bwDraco Aug 7 '15 at 12:07

2 Answers 2

Not the answer you're looking for? Browse other questions tagged download script archiving mediawiki wikipedia or ask your own question.

Hot Network Questions

What software can I use to scrape (download) a MediaWiki wiki? [closed]

closed as off-topic by fixer1234, Twisty Impersonator, Scott, mdpc, bwDraco Aug 7 '15 at 12:07

2 Answers 2

Not the answer you're looking for? Browse other questions tagged download script archiving mediawiki wikipedia or ask your own question.

Related

Hot Network Questions