After years of very little activity and the inability to upgrade Drupal from 6.0 due to missing modules in 7.0, this site will now soon be archived and all the existing content will be served using static pages.
Given the use of the Acidfree module and Forum module, and RSS feeds, it made the task a bit more difficult than the standard techniques blogged about elsewhere. In addition to that, I also wanted to archive it in a sub-dir and continue to serve the old pages from the old path (as much as possible), so need a bunch of new .htaccess rules too. Putting this in a sub-dir would allow me in the future to install another web site management tool on the web site and keep both the new pages and the old Drupal pages around.
First check out all the info at:
The reason we use httrack instead of wget is to keep the internal links as they are without having them get a .html suffix. wget also adds .1, .2, etc to downloaded pages in an unpredictable way, and makes handling Drupal pagination messy, so all this makes it harder to fix up the .html files (which one could do somewhat by using sed scripts on downloaded files to remove the .html from text and use .htaccess RewriteRule commands to add it back in). HTTrack has a -N rewrite option to do this. Only problem is that it only allows for a single rewrite rule, so all filename.jpg files become filename/index.jpg, etc. (But that is easily fixed by a httrack patch provided below which allows for two -N rewrite rules.)
The steps below assume access to the folders in the site to create/update .htaccess files, Drupal module code, ability to install software (like httrack, if necesssary), etc. Many of them are optional, depending on the requirements desired.
The goal: Try to keep the links in the file as similar as possible to the original Drupal archive. This means: no trailing / at end of node links and optionally, try to avoiding changing non-html files to filename/index.suffix and keep them as filename.suffix.
Here's what I actually did:
- Make the web site clean without any forms or input capability for non-logged in users. This involved a bunch of steps listed below. It still allows the admin user to create and update content. After the first few trial runs, also noticed a bunch of broken links and incorrect url paths for some of the web pages, so cleaned all that up too before the final run.
- Site Configuration User Permissions (admin/user/permissions) - remove all create/edit permissions from anonymous users. Comment module, forum module, etc
- Disable "Add new comment" link for anonymous users. No need to do this for all users (which requires running a SQL command
update node set comment = '1';
on the node table) since it may be useful to allow admin to fix up comments if there are bad links discovered in them in the mirroring process. So just change the User Permissions for the comment module removing comments add/edit/update for anonymous users. Do check that anonymous users never see any link to add new comments or any "Login to post comments" link or any link to edit comments.
- Forum module will still show "Login to create new forum topic" and there is an existing outstanding bug in Drupal site about this and it has had a lot of discussion but not yet fixed in the code. Do this manually by editing the forum.module code:
function _forum_new($tid) {
// $forum_types['login'] = array('title' => t('Login to post new content in the forum.', ...
// don't show login button, moving to static web site
- [Optional] One of the external links above suggests installing and enabling "Disable all forms" module. I did do that, which also requires "Bad Judgement" module, but in the end, not sure this was really necessary. Seems like it is not easy/possible to undo that install (there are certainly no undo instructions), so avoid it if you can.
- Acidfree changes: go to Image module site configuration, and make the Preview image size larger than any image in your albums. I used 6400 x 6400. This will help avoid the unnecessary middle-sized Preview image in the Acidfree image views and help avoid messy links such as
?size=_original
etc. After this, only the Thumbnail and Original size will be displayed on the web site.
- Acidfree: Edit the code to remove the link to "Thumbnail" from below the image. Go to modules/image/image.module and change this function to return empty array:
function image_link($type, $node, $main = 0) {
$links = array();
return $links; // don't show any links at bottom of image, moving to static web site
...
- [Optional] Acidfree: on the Site Configuration for Acidfree, set albums image count to 0 to allow unlimited thumbnails on album page to avoid pagination. I only had a max of 60 images per album, and since pagination results in ugly URLs, best to avoid it when possible.
- [Optional] Forum module changes: Edit the module admin.inc file and add 200 to the number of forum topics on a page. This is to avoid pagination - I had around 140 topics, but the Forum module starts paging after a maximum of 100. So fixed that by changing code.
Change Topics per page: to 200. Web page admin only allows it to go 100. Edit forum module .inc file and add number 200 to list of allowed numbers. Then go back to admin page and select 200.
- [Optional] Remove RSS links.
<a href="feed/" class="feed-icon">...
at bottom of page can be removed by editing theme page.tpl.php and remove the feed line or comment it out: <?php /* creating static site, no rss feed anymore: print $feed_icons */ ?>
and then edit the includes/common.inc
file:
function drupal_add_feed($url = NULL, $title = '') {
static $stored_feed_links = array();
return $stored_feed_links; // example.com: don't want any RSS feed in header or elsewhere
...
- [Optional] Remove the query string used for the .css and .js files in the header. Edit the
includes/common.inc
file:
function drupal_get_css($css = NULL) {
...
$query_string = '?'. substr(variable_get('css_js_query_string', '0'), 0, 1);
$query_string = ''; // static archive created, don't need this
...
- Manually check that a non-logged in user does not see any forms or interactive links anymore.
- Locally installed httrack by running
./configure --prefix=$HOME && make && make install
. This now provides access to ~/bin/httrack on the Web host. This was done to avoid running httrack over the network - since it would need to be run a couple of times to iron out the problems, best if this can done on the web host if possible.
- [Optional] Apply this patch to your httrack code to support
-N ?html?
to allow a separate file save pattern for html files.
- If you have a XML sitemap, regenerate it one last time and save over the sitemap files. For example:
http://www.example.com/sitemap.xml
. This can be later incorporated into a site-wide sitemap.
- Run httrack! Best to run it on a small section of your web site first, watch the pages being retrieved and make corrections as needed. Assuming you are keeping pre-existing static content, make httrack skip links to the appropriate folders as shown below:
~/bin/httrack http://www.example.com/ -O ~/static.archive.backup,~/httrack/cache -wq%v -s0 \
-N "?html?%p/%n/index%[page].%t" -N "%p/%n.%t" \
+www.example.com/* -*/feed -www.example.com/cgi-bin/* -www.example.com/files/* \
-www.example.com/fonts/* -www.example.com/gallery/* -www.example.com/images/*
If you are not using the patched httrack, use a single -N like this: -N "%p/%n/index%[page].%t"
.
If you use -W instead of -w, the first time httrack encounters an outside domain, it will show a prompt (due to the -W arg). Just type in * and ENTER here for it to skip such links and leave them unchanged.
The -O arg above keeps the archive and the httrack cache in a non-web-visible folder (not in public www for example) so as to keep a backup. The -N argument does not use %h host, it avoids that second-level of a directory under www/static.archive.
Add --depth=1 and --debug-log arguments to see debug information which will be stored in the ~/httrack/cache/hts-log.txt file.
- Usually necessary to run httrack multiple times and fix the problems with the mirroring. Check the top-level mirror directories - do they look right? Look at the errors in the hts-log.txt file and fix them - usually bad links in the original site. Look at the hts-cache/new.lst and new.txt files to see if anything strange pops up there. All these problems are usually easily fixed by editing the original site content or comments or code pages.
- Before running sed on the retrieved files, make a copy of the static archive since it is likely the sed script may make mistakes in initial iterations and will need updates. That is why we use the folder name ~/static.archive.backup in the command above. Make a copy of it to ~/www/static.archive which is where we'll run sed and then run diff on it to see what all changed.
cp -pur ~/static.archive.backup ~/www/static.archive
is one example of making a directory copy, if the destination does not exist.
- Fix up the html code created by httrack. Any text edits are risky since the danger is that the edits may be applied to text that is not meant to be fixed up! There are multiple fixes needed and a single sed script can do it, shown in the next few points below:
- Since the Apache web server can serve index.html automatically from a folder, no need to have internal page links that end in
/index.html
, so that suffix can be removed. nd just like normal Drupal pages, we don't want a trailing / either.
- htrtack will change the root url
/
to be inside a folder called index
when run with the above args. We will manually move that index/index.html
one level up, so need to fix up all the links to it from the html pages.
- Remove the query string in the static css/js resources. (This is Drupal issue, not httrack. Since the pages are static, don't need to support updates and static resource query strings.)
- All changes above can be accomplished using this bash shell script:
SED_SCRIPT='# Fix up HTTrack pages
# Use top-level home page instead of HTTrack index/ folder
s|\.\./index/index\.html"|../"|g
# Removing any trailing /index.html text, no / needed at end
s|/index\.html"|"|g
s|"index\.html"|"./"|g'
find ~/www/static.archive -name "*.html" -type f -print0 | xargs -0 sed -i -e "$SED_SCRIPT"
- Verify that sed did not change anything we didn't want changed. Since we changed all hrefs, the diff can be huge. But at the very least, we can look at the diff skipping all the href files. Example commands to experiment with:
diff -rbwBN -U 0 static.archive.backup static.archive > diff.full
less diff.full # for a quick glance at the full file. This will be huge.
egrep "^[-+][^-+]" diff.full | grep -v href | less # This should be empty or require manual fix ups.
- Move the index/index.html file created by httrack one level up so it is in the right place. Fix up the links - remove the
../
prefixes in the href links: sed -e 's/"\.\.\//"/g' < index/index.html > index.html
and make any other changes - especially look for the home page url hrefs.
- Since we skipped some directories when running httrack, copy them over manually. In my example above,
files
needs to copied over: cp -purv drupal-dir/files ~/www/static.archive/
- Create .htaccess rewrite rules. The static archive is in a top level folder called
static.archive
. Any incoming link that represents a folder or file in the static.archive/ subtree should be transparently rewritten to that path. All other paths should be left unchanged.
- Example htaccess updates: Put this somewhere appropriate in the .htaccess file:
RewriteEngine on
# Ensure trailing / is necessary for avoiding other mod_dir or .htaccess
# rules from doing an external redirect to the hidden archive subdir.
# Must use external redirect here.
RewriteCond %{REQUEST_URI} !/$
RewriteCond %{REQUEST_URI} -d [or]
RewriteCond %{DOCUMENT_ROOT}/static.archive%{REQUEST_URI} -d
RewriteRule ^(.*)$ /$1/ [last,redirect=301]
# If necessary, override top level uri /. Otherwise, the RewriteCond
# below by default assumes we want the static archive root to be displayed.
# RewriteRule ^/*$ new-root-path [last]
# If archived drupal page exists, show it keeping the original url unchanged.
#
# Skip if already starting with /static.archive/...
RewriteCond %{REQUEST_URI} !^/static.archive/
# Skip if already pointing to any existing file anywhere on site
RewriteCond %{REQUEST_FILENAME} !-f
# Skip if already pointing to any existing directory anywhere on site
# but accept if it is the root URI /
RewriteCond %{REQUEST_URI} ^/*$ [or]
RewriteCond %{REQUEST_FILENAME} !-d
# Accept if the uri points to an actual file or dir in static archive
RewriteCond %{DOCUMENT_ROOT}/static.archive%{REQUEST_URI} -f [or]
RewriteCond %{DOCUMENT_ROOT}/static.archive%{REQUEST_URI} -d
# Checks passed, rewrite URL to use static archive file
RewriteRule (.*) static.archive/$1 [last]
# Direct user web access of archive directory is forbidden.
RewriteCond %{THE_REQUEST} ^(GET|HEAD)\ /static.archive [nocase]
RewriteCond %{ENV:REDIRECT_STATUS} !(403|404|500)
RewriteRule .* /404.shtml [nosubreq,redirect=404,last]
Note: be careful with the .htaccess changes. The above works for me but it all depends on each site so use the above only as a guideline for what needs to be done.
- All archived .html pages will contain a HTML comment string:
Mirrored from ... by HTTrack ...