Clean Sweep 404 Manager for Movable Type

Clean Sweep is a plugin that assists administrators in finding and fixing broken inbound links to their website. It was build to support two use cases:

  • to help users get a clean start with their blog by allowing them to completely restructure their permalink URL structure and have a system that can automatically adapt by redirecting stale and inbound links to the proper destination

  • to help users in the process of migrating to Movable Type who are forced to modify their web site's URL and permalink structure

Both of these use cases have to do with preserving a site's page rank in light of a major redesign.

Features and Benefits

  • Manage redirects in your blog using an easy to use user interface.
  • Help maintain good SEO and page rank by keeping links fresh.
  • No need to hack Apache configuration files until you are sure your redirects are correct.

Download

Screenshots

Dashboard Widget

Clean Sweep Dashboard Widget

Create a Redirect

Clean Sweep Map 404 to Destination

View a List of All 404s

Clean Sweep 404 Listing

View a List of Recommended Rewrite Rules

apache-rewrite.png

Installation Instructions

  1. Unpack the Clean Sweep archive.
  2. Copy the contents of CleanSweep-1.1x/plugins to: /path/to/mt/plugins/ and copy the contents of CleanSweep-1.1x/mt-static to: /path/to/your/mt-static/.
  3. Create a page in Movable Type called "URL Not Found". Give it a basename of "404". Place whatever personalized message you want that will be displayed to your visitors when Clean Sweep is unsuccessful in mapping the request to the correct page or destination.
  4. Publish the page and remember the complete URL to this page on your published blog.
  5. Navigate to the Plugin Settings area for Clean Sweep.
  6. Enter in the full URL to your "URL Not Found" page you created in step #3. Copy that URL into the "404 URL" configuration parameter for Clean Sweep.
  7. In your plugin settings area for Clean Sweep, make note of the Apache configuration directive that Clean Sweep asks that you place in your httpd.conf or in an .htaccess file.
  8. Add the Apache configuration directive to your web server. This may be placed in your httpd.conf file or in an .htaccess file located in the DocumentRoot for your blog.
  9. Restart Apache

How it Works

Once properly configured, Clean Sweep will track all inbound links that result in a 404. Administrators can monitor the list of 404s on their web site through a dedicated listing screen in the application found under the Manage > 404s menu, or through a convenient dashboard widget.

If Clean Sweep can determine what file to serve in place of a request that resulted in a 404 it will. If all else fails the plugin will serve a custom 404 page you design.

Clean Sweep will also provide you with a list of Apache mod_rewrite rules that you can add to your web server's configuration settings to permanently redirect users to the proper resource, thereby bypassing the Clean Sweep plugin from that point forward for those specific set of links.

The Redirection Decision Making Process

The following is how Clean Sweep determines what files to serve in place of a requested file that could not be found on the file system:

  1. Check to see if a redirect has been setup by a user for the specific file being requested. If one exists, redirect the client to that file.

  2. Is the target resource using the entry id as a URL This is a prevalent URL pattern for older MT installations. This will:

    Map: http://www.majordojo.com/archives/000675.php To: http://www.majordojo.com/2005/07/goodbye-bookque.php

  3. Is the target resource using underscore when it should be using hyphens? Many users have switched to using hyphens for purported SEO benefits. This will attempt to look for a file in the system of the same name, but using '-' instead of '_'. This will:

    Map: http://www.majordojo.com/2005/07/goodbye_bookque.php To: http://www.majordojo.com/2005/07/goodbye-bookque.php

  4. Is their a target resource with the same basename somewhere? If a user switches their primary mapping to use a date based URL as opposed to a category based URL, then this rule will apply. This will:

    Map: http://www.majordojo.com/personal-projects/goodbye-bookque.php To: http://www.majordojo.com/2005/07/goodbye-bookque.php

  5. Let me know and I will add it!

  6. If all else fails, serve up the users configured custom 404 URL.

Reporting Bugs

During the length of the beta please use the comment form at the bottom of the page to report any bugs with Clean Sweep.

License

Clean Sweep is licensed under the GPL (v2).

Copyright

Donated to the Movable Type Open Source Project. Copyright 2007-2008 Six Apart Ltd.

55 Comments

Byrne

I'm glad it's not automated and hope that if you offer an automated version at some stage that the automation is optional. I really don't like handing over too much control to something else :)

Michele

I get this error after installing CleanSweap Can't call method "id" on an undefined value at /home/23753/domains/mysite.com/html/plugins/CleanSweep/lib/CleanSweep/CMS.pm line 111.

I am having the same issue. Did you figure out a cure for this?

Carlo, w3.myopenid.com...

You must add the Clean Sweep widget to the dashboard for the blog in question.

If you try and add the widget to the 'System Overview' dashboard, you will receive this error.

(Hint: The 'id' parameter in the MT error string refers to a blog id).

For some reason, the dashboard widget doesn't show up on my system (perhaps because I haven't collected any 404s yet). Something happened to it though, because now the graph in the dashboard widget covers up part of the Manage menu (or the Manage menu is displayed behind the dashboard graph).

The code snippet that's supposed to go in the http.conf file gets cut off for longer URLs. It's still selectable, but not fully displayed, which might be a little confusing to some people.

Also, the Location directory is not allowed in .htaccess (at least for Apache anyway). If people are going to put the ErrorDocument in .htaccess, they need to leave off the Location wrapper.

I'm having trouble implementing this for my site. I'm using dynamic publishing for most of my templates.. so each of my blogs have the .htaccess file MT creates. How do I modify this to work with cleansweep??

How does Clean Sweep's Apache directive in .htaccess reconcile with the MT4-generated directives?

As an example, MT4 generates the following:

<IfModule !mod_rewrite.c>
  # if mod_rewrite is unavailable, we forward any missing page
  # or unresolved directory index requests to mtview
  # if mtview.php can resolve the request, it returns a 200
  # result code which prevents any 4xx error code from going
  # to the server's access logs. However, an error will be
  # reported in the error log file. If this is your only choice,
  # and you want to suppress these messages, adding a "LogLevel crit"
  # directive within your VirtualHost or root configuration for
  # Apache will turn them off.
  ErrorDocument 404 /mtview.php
  ErrorDocument 403 /mtview.php
</IfModule>

I've put the Clean Sweep 404 directive BELOW this MT4 directive, but my 404s are still picked up by mtview.php and not by Clean Sweep. Does the CS directive need to be ABOVE the MT one?

@Kelly - in thinking more about it, I am thinking that under dynamic publishing there will need to be an alternative to mtview.php and mtview.cgi. That is the only way to have CS work with MTDP.

Thanks for responding so quickly, Byrne. It sounds like Clean Sweep is not yet compatible with sites published dynamically. Is this a fair assertion? If so, you might want to put that in the documentation above to avoid excess user frustration.

All the same, it still looks like a great plugin - and as soon as my bug report on dynamic publishing clears I'll switch my site to static publishing and will use it in earnest.

Hey Byrne.

I don't understand these steps:

Navigate to the Plugin Settings area for Clean Sweep. Enter in the full URL to your "URL Not Found" page you created in step #3. Copy that URL into the "404 URL" configuration parameter for Clean Sweep.

Where do I navigate to? When I go to System Overview > Plugins, I get a listing of plugins, but no place to add a URL.

Thanks, Eli

@Eli - you need to:

  1. navigate the dashboard for the blog you wish to enable Clean Sweep for.
  2. from the Preferences menu, select Plugins
  3. listed there you will see Clean Sweep

That is where you need to be.

Burning question: Where do I "clean" the Clean Sweep log file?

Wow... I don't think you can yet. I will need to add that capability.

That would be smashing! You have the header check box there to check all (that doesn't check all, btw) but no 'Delete' button like on the other pages with this same UI layout, such as Entries, Pages, etc, etc.

Have another question for ya Byrne, what is this:

'

That right now has the highest count (15, last was 1 hr. ago), I assume that is an apostrophe, but what does it mean? Many thank.

I've got the plugin installed, I added the line to my htaccess without the tags, I go to "Manage 404's" but it says: "No cleansweep_logs could be found"

I have not yet restarted Apache because I don't know how (with Media Temple). Could this be the reason why it's not working?

thanks, shane

I also forgot to ask, will this work with dynamic publishing?

Shane, it will say that until it logs some 404s. Then the log file will be created, and it will show them on the Manage > 404s screen.

I noticed that message too the first time I checked.

I do not know about dynamic publishing, as I do not use it. But I think that was discussed earlier here in the comments. It sounds like it might be possible with some modifications to mtview.php and the ErrorDocument definition.

Ok thanks a lot, Ken. Does anyone know how to modify the "mtview.php and the ErrorDocument definition" to work with Dynamic Publishing? Our site has over 5,000 entries, so static publishing is not an option.

thanks!

To get the functionality (sorta) of this plugin, you could just look in your log file. Do you use AwStats or some other stats package?

In other words I have no idea what to modify, and I won't make any suggestions because of that ;)

Hi,

I uploaded the folders to both locations, after which the console said it needed updating and according to the log was successful:

User 'admin' installed plugin 'Clean Sweep', version 1.02 (schema version 0.17).

CleanSweep shows under System, but when I go to plugins for a blog I get a 500 error (no details in the raw log). Under my Manage menu for the blog, I do have the new 404s item which appears fine.

I'm on MT 4.1 and Apache/2.2.8 with mod_rewrite installed (though I've never used it).

Help?

Installed this version today and when I try to show the broken link report on the dashboard I get this error:

Can't call method "id" on an undefined value at /Library/WebServer/CGI-Executables/mt/plugins/CleanSweep/lib/CleanSweep/CMS.pm line 123.

Just installed it but it's only showing "mt4/mt.cgi?_mode=404&blogid=1" in the 404 list without any of my actual 404.

Also, it's outputting "Reading Config" to my logs. Debug statement left in there?

Hmm, I fixed my error by chaging my ErrorDocument to be a relative path vs. absolute in my .htaccess.

ErrorDocument 404 /mt4/mt.cgi...

My 404 log only contains one item: cgi-bin/MT-4.1-en/mt.cgi?_mode=404&blogid=2

No matter what URL I try to go to. I'm turning CleanSweep off (from my .htaccess) until I can figure out why this is - but does anyone have any thoughts? Byrne?

If I'd bother to read @seth's comment, then re-comment, I wouldn't have had to bother posting - or re-posting myself. OOoooops. It was late....

Similar to Carlo and kimonostereo, I'm getting: Can't call method "id" on an undefined value at /home/virtual/site50/fst/var/www/html/adam/cgi-bin/MT-4.1-en/plugins/CleanSweep/lib/CleanSweep/CMS.pm line 124.

This is when I try to go to my blog home page, where I have the module set to display. The only way I can get to my homepage is to disable CleanSweep. If it's linked to the wrong blog ID, where can I change it?

Adam

Ok, I think it's time for me to back away from the computers. First I had the same problem as someone else and didn't read thru previous posts. Now I've done all my homework, read thru everything, tested it a few times, and complained of a problem (see above).

Now I go back into my blog admin console, and sure enough, even with CleanSweep on, everything seems to be working ok. On my blogs homepage.

When I try to go to the system overview/dashboard, I find two different URL's being used: http://www.server.com/cgi-bin/MT-4.1-en/mt.cgi?blogid=0&mode=list404s&blog_id=2

http://www.server.com/cgi-bin/MT-4.1-en/mt.cgi?__mode=dashboard

And I also find that I get that error message (above) when I try to go to either page. If I just try to go to MT's homepage: http://www.server.com/cgi-bin/MT-4.1-en/mt.cgi

Same thing - error on line 124.

So, I'm not crazy, just was a little confused. I hope that this will help things a little.

Since I was having plugin challenges, found another way to bandaid my rearranged site.

http://httpd.apache.org/docs/2.2/mod/mod_alias.html#redirectpermanent

In my site's root htaccess, I'm just adding one entry per article (single line wrapping here):

RedirectPermanent /2008/03/watching-for-ov.html http://www.practicalsurveys.com/questionnaires/broadquestions.php

Tedious, but compared to spending more hours looking for an elegant solution... This command sends a 301 along with the redirect, so hopefully it will also prompt the search engines to update.

I don't recall anyone else mentioning it, but the navigations while using Clean Sweep (jump to start, page back, page forward, jump to end) aren't working for me - on the initial CleanSweep page, I can only jump to the end, and then from there, jump to the beginning, page back, or jump to the end. When I'm somewhere in the middle, the page forward never works.

Also, it would be nice to have the option to re-sort based on how recently the item appeared in the list - in addition to frequency of being requested, sometimes it's nice to make changes for things that are starting to go wrong.

Another idea: I used to have a different, much harder to deal with method of logging and fixing 404's - the page would note the referrer, time, date, etc. and log it into a file.

Now that Clean Sweep is handing it (quite nicely) I just had to take the logging part out since it's just logging hits to the 404.php page and not producing anything else - CleanSweep is keeping all the good stuff to itself.

I'm wondering if it might be possible to capture the referrers for two reasons: 1) if it's an honest, actual link, notifying the other site that the URL is changed 2) for the error page, where you could say something like "You just came from THIS page, and THIS is what you were requesting" - this way, visitors might be able to then use the search field on the page (in my case, it's right in with the message apologizing that the link wasn't found) to try to find where they're going.

Just another idea from an over-active mind...

Works well generally (MT 4.2rc2-en here). Problems I've noticed:

  • the "next" button on the list of 404s doesn't work. I have to go to "last" and then work backwards with "previous".

  • desperately need the ability to clear the log and reset all rules. Once the RewriteRules have been put into .htaccess, there's no point keeping the maps or logs for those entries. I have to do this manually in the database at the moment.

  • RewriteRule generation puts a leading slash on the URLs. This is incorrect for .htaccess use: I have to manually remove the slashes for it to work.

  • RewriteRule generation for "gone" URLs is incorrect. There should be a hyphen between the pattern and the flags to replace the missing substitution.

  • Should be able to sort the log by URL, not just by 404 count.

Keep up the good work!

http://majordojo.com/travelogue/photosfromtha.php An error occurred DBD::mysql::st execute failed: Column 'cleansweeploguri' cannot be null at /srv/www/vhosts/www.majordojo.com/htdocs/cgi-bin/mt/extlib/Data/ObjectDriver/Driver/DBI.pm line 348.

Hey I've just installed Clean Sweep. When I click the Manage -> 404s link I get the generic Movable Type error page except without any text. Do you have any ideas how I can begin to troubleshoot this?

What is the error message saying?

I get an error when I try to Clean or Reset a broken link from the management window. The error is: Can't locate object method "load" via package "CleanSweep::Log" at /var/www/vhosts/mydomain.com/cgi- bin/mt/plugins/CleanSweep/lib/CleanSwee p/CMS.pm line 267.

Could that be a permissions issue? I tried altering permissions but still no luck.

Thanks for this plugin btw, it works great other than this issue.

Matt - I've been having similar issues for months :(

If it makes any difference, I'm using FastCGI on CentOS 32bit, with memcached enabled and mod_perl

Michele, maybe you have a similar configuration?

Matt

I'm using Ubuntu without memcached, fastcgi

Michele

Byrne - in step 3 do you mean create an index template or a "page"?

Nevermind... I see that it doesn't really matter if you create an index template or "page" or if you already have a 404 page. =)

Version Clean Sweep 1.16 give me two Errors.

First:

Click on http://example.com/blog/this_site_is_not_here.html
Can't locate object method "_guess_file_path" via package "MT::App::CMS"

Second:

Click on "Generate Rewrite Rules"
Undefined subroutine &CleanSweep::CMS::plugin called
#####
Movable Type version 4.23-de with SQLite.
The URL for 404-Page is: http://example.com/blog/404.html

.htaccess
ErrorDocument 404 /mt/mt.cgi?__mode=404&blog_id=1

Just released 1.17. See if that fixes your problems.

Hi Burn, the problems are fixed with Clean Sweep 1.17. :-) Thank you for this useful plugin. Sabine

Hi Byrne.

I cannot get this to work with subdomains like my url above but it does work on urls that have subfolders like http://gokboet.nu/album

The MT install is at http://gokboet.nu/mt/.

Maybe I am just not thinking in ways of subdomains.

This should work for sub-domains, but each sub-domain should be given its own 404 handler in your web server using some kind of virtualhost config.

i have this error in the 404s. "No cleansweep_logs could be found."

This would indicate that the plugin is not properly installed. Did it prompt you to upgrade your installation after you copied the files onto your server? Is "Clean Sweep" appearing in your list of plugins?

Thank you for your reply, I installed properly and upgraded to V. 4.21 and it appears in my plugins' list. But when I put apache code in my .htaccess, I get error 500.

Byrne and Amin, that is exactly what happens to me when I try it out on my subdomain webpage, the one that is linked in my comment.

I have added the correct stuff to the htacess but it kills it as it does not understand the path to MT. I tried it with the full url to the mt folder but that did not help either.

MT is installed on our main url in an mt4/ folder so it supports other websites under our main domain and they work but with a subdomain (not folder, which on some servers are treated as subdomains) I cannot get it working.

Carina and Amin,

I am not sure how to troubleshoot the problems you are having. My hunch is that perhaps Apache does not permit some of the rules being specified in your .htaccess file?

I am not sure.

Have you or can you try to cut and paste the rules directly into your web server's httpd.conf or virtual host config file to see if that works?

Just installed with a fresh copy of 4.25, and I'm getting:

An error occurred

DBD::mysql::st execute failed: Column 'cleansweeploguri' cannot be null

Checked to make sure that the upgrades have been run, and everything appears to be ok - any help out there?

This is a slick blogging platform. Which is it?

Hey Byrne--is Clean Sweep in your github account? I didn't see it. Anyway, bug in the widget: CMS.pm, line 150 should use an equal sign, not a big arrow.

I can't believe I missed it when migrating all my work from code.sixapart.com! Here you go:

http://github.com/byrnereese/mt-plugin-cleansweep/

dose it use for mt3? i want use it for mt 3.38 help me please thanks

Leave a comment

what will you say?


Recent Comments

Close