Mediawiki Deleted Page and Spam Account Cleanup

From wiki.occupyboston.org
Revision as of 19:58, 1 September 2012 by Srevilak (talk | contribs) (Notes on mediawiki cleanup. Initial revision.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Background

I've decided to remove all of the deleted pages from Mediawiki's database, along with removing spammer user accounts. My definition of spammer: (1) the account is currently blocked, or (2) the user's only revisions are deleted pages.

I'm making notes about the process, so that we know what to do next time. (Or, what not to do, if that happens to be the case).

Every non-spam page has been marked as "patrolled"; all of the spam has been deleted. This makes the cleanup a lot easier.

backup stuff

First, we backup the entire database.

mysqldump --complete-insert ob_wiki | xz -c > obwiki-pre-cleanup.sql.xz

And, let's make a separate backup of the user table. In case we accidentally delete an user we shouldn't, this will make it easy to put them back. Note that this form of mysqldump writes each wrote as a separate insert statement.

mysqldump --extended-insert=0 \
  --complete-insert \
  --no-create-info \
  ob_wiki user  obwiki-user.sql

Of course, you should rehearse this procedure, by loading the database dump into your own development environment.

deleted page removal

First, truncate the archive table. This removes metadata from deleted pages, along with their edit history.

MYSQL> truncate table archive;

Next, remove the text of delete pages. (Media wiki store all text -- deleted and non-deleted in the same text table).

SHELL> php maintenance/purgeOldText.php --purge


Removing Orphaned Accounts

Mediawiki comes with maintenance/removeUnusedAccounts.php which is a pretty conservative account removal tool. It removes accounts that don't meet the following criteria:

(No edits, no deleted edits, no log entries, no current/old uploads)

I've made a modified version of this program, that omits the "no log entries" condition.

*** maintenance/removeUnusedAccounts.php        2012-09-01 19:33:53.875901952 -0400
--- maintenance/removeUnusedAccounts-obit.php   2012-09-01 19:33:45.672785090 -0400
***************
*** 91,98 ****
         */
        private function isInactiveAccount( $id, $master = false ) {
                $dbo = wfGetDB( $master ? DB_MASTER : DB_SLAVE );
!               $checks = array( 'revision' => 'rev', 'archive' => 'ar', 'logging' => 'log',
!                                                'image' => 'img', 'oldimage' => 'oi', 'filearchive' => 'fa' );
                $count = 0;
  
                $dbo->begin();
--- 91,102 ----
         */
        private function isInactiveAccount( $id, $master = false ) {
                $dbo = wfGetDB( $master ? DB_MASTER : DB_SLAVE );
!               $checks = array( 'revision' => 'rev', 
!                                  'archive' => 'ar', 
!                                  //'logging' => 'log',
!                                  'image' => 'img', 
!                                  'oldimage' => 'oi', 
!                                  'filearchive' => 'fa' );
                $count = 0;
  
                $dbo->begin();

So, let's run this

php maintenance/removeUnusedAccounts-obit.php --delete

Finally, let's remove users that appear in the blocked list.

-- See who we're about to delete
MYSQL> select ipb_address from ipblocks where ipb_address in (select user_name from user);

-- now delete them
MYSQL> delete from user where user_name in (select ipb_address from ipblocks);

Cleaning IP Blocks

Now that we've removed a bunch of users, we no longer need blocks for those users.

Range based blocks contain text in ipb_range_start and ipb_range_end. For user blocks, these will be empty fields.

MYSQL> delete from ipblocks where ipb_range_start = '' and ipb_range_end = '';


Stuff removed

These statistics come from a 1 Sep. 2012 cleanup run.

  • 3019 pages from archive, representing 1162 pages.
  • 4133 records from text
  • 3010 accounts from user
  • 2006 records from ipblocks