Mediawiki Deleted Page and Spam Account Cleanup

Background
I've decided to remove all of the deleted pages from Mediawiki's database, along with removing spammer user accounts. My definition of spammer: (1) the account is currently blocked, or (2) the user's only revisions are deleted pages.

I'm making notes about the process, so that we know what to do next time. (Or, what not to do, if that happens to be the case).

Every non-spam page has been marked as "patrolled"; all of the spam has been deleted. This makes the cleanup a lot easier.

backup stuff
First, we backup the entire database.

mysqldump --complete-insert ob_wiki | xz -c > obwiki-pre-cleanup.sql.xz

And, let's make a separate backup of the user table. In case we accidentally delete an user we shouldn't, this will make it easy to put them back. Note that this form of mysqldump writes each wrote as a separate insert statement.

mysqldump --extended-insert=0 \ --complete-insert \ --no-create-info \ ob_wiki user obwiki-user.sql

Of course, you should rehearse this procedure, by loading the database dump into your own development environment.

deleted page removal
First, truncate the archive table. This removes metadata from deleted pages, along with their edit history. MYSQL> truncate table archive;

Next, remove the text of delete pages. (Media wiki store all text -- deleted and non-deleted in the same text table).

SHELL> php maintenance/purgeOldText.php --purge

Removing Orphaned Accounts
Mediawiki comes with maintenance/removeUnusedAccounts.php which is a pretty conservative account removal tool. It removes accounts that don't meet the following criteria:


 * (No edits, no deleted edits, no log entries, no current/old uploads)

I've made a modified version of this program, that omits the "no log entries" condition.

--- maintenance/removeUnusedAccounts-obit.php  2012-09-01 19:33:45.672785090 -0400 */       private function isInactiveAccount( $id, $master = false ) { $dbo = wfGetDB( $master ? DB_MASTER : DB_SLAVE ); !              $checks = array( 'revision' => 'rev', 'archive' => 'ar', 'logging' => 'log', !                                                'image' => 'img', 'oldimage' => 'oi', 'filearchive' => 'fa' ); $count = 0; $dbo->begin; --- 91,102         */        private function isInactiveAccount( $id, $master = false ) { $dbo = wfGetDB( $master ? DB_MASTER : DB_SLAVE ); !              $checks = array( 'revision' => 'rev', !                                  'archive' => 'ar', !                                  //'logging' => 'log', !                                  'image' => 'img', !                                  'oldimage' => 'oi', !                                  'filearchive' => 'fa' ); $count = 0; $dbo->begin;
 * maintenance/removeUnusedAccounts.php       2012-09-01 19:33:53.875901952 -0400
 * 91,98 ****
 * 91,98 ****

So, let's run this

php maintenance/removeUnusedAccounts-obit.php --delete

Finally, let's remove users that appear in the blocked list.

-- See who we're about to delete MYSQL> select ipb_address from ipblocks where ipb_address in (select user_name from user);

-- now delete them MYSQL> delete from user where user_name in (select ipb_address from ipblocks);

Cleaning IP Blocks
Now that we've removed a bunch of users, we no longer need blocks for those users.

Range based blocks contain text in ipb_range_start and ipb_range_end. For user blocks, these will be empty fields.

MYSQL> delete from ipblocks where ipb_range_start =  and ipb_range_end = ;

Stuff removed
These statistics come from a 1 Sep. 2012 cleanup run.


 * 3019 pages from archive, representing 1162 pages.
 * 4133 records from text
 * 3010 accounts from user
 * 2006 records from ipblocks