Archiving web sites the Planet Bods way

Published on 6 August 2010 in Web Development, Planet Bods, web development

As I mentioned yesterday, I’ve been having a bit of a spring clean of my website and archiving several areas of content. It took me quite a while to come up with my arching strategy so I thought I’d share what I decided to do.

Photography by Dolescum. Creative Commons licensed.

When coming up with my archiving plan I eventually decided on a number of steps I’d go through.

1. All the files would be moved into a subdirectory called /archived

Over the years pages on this site have come and gone and moved around. I keep finding directories with long abandoned content, intermingled with stuff that’s regularly updated.

I decided I wanted a major tidy up and decided that all archived content would be moved to a sub-directory of www.planetbods.org/archived. So The Wise and Sage Words of Tim Westwood would move from www.planetbods.org/theshed/westwood to www.planetbods.org/archived/westwood. All the file names would remain the same and a simple redirect would be put in place to transfer people seemlessly from one place to another. File extensions – where they exist – would be kept the same.

Now I know there will be one person hear sucking through their teeth shrieking “That’s bad SEO!!!!!” but frankly I don’t care. This is content I’m essentially abandoning. Some of it is so little read that it really doesn’t matter either way. I’m not fussed about pagerank or whatever. And it means everything is stashed away and I know exactly where it is.

It also meant one other thing and that was that I would need to change all the links in the site to point to the new URL. Not a major disaster – I tended to use local links in hand built pages.

2. All sites would be saved as simple, straight HTML – all PHP or other dynamic code would be removed

Right now there are three different ways I use to build this site. Most of it is published within Movanle Type, with shared template modules to control a standard design.

Several older sections are built in PHP files with global elements pulled in and coupled with lots of messy file manipulation. Originally each section had its own PHP file because I always liked the idea of different sections having different designs, however over the years I finally managed to standardise on one PHP file which controls all the PHP pages. Well all the ones expect any pages I managed to forget to convert!

Then there’s tech 3 – Hitop. This is a HTML pre-processor called HitopLive that in some respects is a bit like PHP.

Hitop was around before PHP started to hit mass adoption, and it had some funky features that takes me about 200 lines of code to do now in PHP. Plus I knew the people who wrote it, and my website used to be hosted on their server. At one time my entire site was built in it, however PHP began to dominate and Hitop development ended.

Over the years I’ve slowly but surely tried to convert my content, but there’s still about 100 pages of this site built in Hitop format. And it causes problems.

Last year I decided to move my site off my friends server as it was causing a few performance issues and Movable Type was having to run off SQLite and was getting very slow. However Hitop isn’t found on any hosting package and I didn’t have time to migrate all the remaining Hitop pages to PHP. In the end I bought a virtual server from Bytemark and managed to install Hitop on it – it was a nightmare as Hitop hasn’t been updated since 2002 and I always struggle to get it compile these days.

Eventually I managed to find an old Debian installation file (Hitop was briefly packaged for Debian systems) and coaxed it to install. My content was safe, but I knew it was just a temporary solution. The Hitop code had to go.

For content I was going to keep, I’d port it into Movable Type. However for stuff I was going to archive, that would be pointless effort. I decided to save the files as flat HTML instead – I just saved them using wget.

After much deliberation, I decided to do that with the PHP files too – whilst I doubt PHP is going anywhere for some time, saving them entirely as flat HTML files would mean I could run them from anywhere in the future without problem. If I wanted to quickly run them off a flash drive, I could.

This decision caused just one problem – I had a set of three quizzes which the user hit radio buttons and got their score based on their answers. In the end I converted them to pure text and had a magazine style “Mostly A’s” type section for working out the answer. Not ideal but frankly only about 2 people a month ever used them, so so what?

3. All archived sites would be entirely self contained

This was the decision that creates the most work. I didn’t want any changes I make elsewhere on the site to potentially break the archived pages, so I decided they would be entirely standalone. All necessary images and CSS files that were previously shared would be copied, and links updated. I could move everything to an entirely different server on a completely different domain, and the archived content would just work.

JavaScript I stripped out – I have a standard set of files loaded on all pages. They all do relatively minor things and none of the pages I have so far archived actually used them, so I decided to strip it out. I also stripped out a bit of navigation on some pages if it pointed to non-archived content.

One bit of JavaScript remained – Google Analytics code so I could continue to check user stats. This was put into the code that generated the “This page is no longer updated” banner below…

4. A “this page is no longer updated” banner would be added

Taking best practice from the BBC here – a lot of websites just let old webpages rot and rot and rot. They may have been killed off for years but you’d never know. Some years ago the BBC started doing something to point out such pages on its website – a “mothballed” banner proclaimed very boldly and very dominantly, that the page was archived.

It makes sense and I’m surprised more people don’t do it. I decided I would.

I decided to make it a shared include, just in case I ever wanted to update the banner for any reason – the original reason I decided to do it was in case Google Analytics changed their JavaScript.

The banner is pulled in as two simple includes by PHP. One contains the CSS and Google Analytics JavaScript; the second contains the actual banner itself. All the flattened pages would be served as PHP (regardless of their original file type) but because it was very simple PHP, I would be able to change it very easily in the future if I had to.

5. And finally, leave it as it is…

For me one of the things I felt I really needed to do was one that took next to no effort. I wanted to leave it online looking just as it does now. That the design shown to the user should reflect the way the webpage was when I archived it.

I wanted to do this because every now and then I come across a webpage that looks rather structured but consists of Times New Roman text on a white background – the stylesheets have gone; the images lost.

That was what I wanted to avoid in this process. Yes URLs may change and there may be a whacking banner at the top, but it was still roughly the same.

Maybe in ten years time it will all be moot and the web browsers of the future won’t be able to display the pages I’ve archived now. I bloomin’ well hope that it won’t be the case! But maybe by then only about one person a year will actually view the pages anyway…

And perhaps that’s the question – how long do you leave stuff online? Does there reach a point when it all becomes a trifle irrelevant?

Probably… but hey, server space is cheap. Right?