eoghan posted on November 15, 2007 16:28
 
    
        
        Anybody who tried to access the CLUAS home page, blogs or discussion board this morning (from about 6am GMT to 1pm GMT) would have seen a big fat 'sorry you can't access this website, mate' error.
These sort of errors happen once in a while around these part so it's - normally - no big deal, just a bit of an inconvenience. However this time around it was the first time that the ability to access parts of the site was deliberately blocked by our hosting company because there was too much traffic to the site (to be perfectly precise, access was not blocked to CLUAS but our website was disconnected from the database that contains the content of our blogs and discussion board; the other parts of the site not depending on the database were still accessible).
Being blocked for attracting too much traffic is good news (hey, hey we're attracting more users!) and bad news ("damn, we're going to have to toughen up the hosting infrastructure to deal with the increased levels of traffic").
While sorting out the problem in conjunction with CLUAS.com's hosting company I was curious to see that the increase in traffic was less becuase an increase in humans accessing the site and more because of a (big) increase in visits of the search engines 'bots' accessing CLUAS to retrieve our content for their own purposes (i.e. knowing what's on CLUAS so that they can present relevant results for their users when they search for various key words).
So what did I have to do to persuade our hosters that we would no longer completely hog access to the server which hosts our database (and databases of other websites) and so they would be confident enough to allow CLUAS to reconnect with its database? Basically I reduced the traffic the site will get from search engine bots by:
    - Reducing the frequency with which the Google 'bot' visits CLUAS from the default of 'Normal' to 'Slower' (it's possible to set this via CLUAS.com's account on Google's rather good Webmaster Tools)
- Adding extra lines into CLUAS.com's robots.txt file that instruct the Yahoo bot (called 'Slurp') to stop crawling any files in the parts of the site that are database-driven.
- To be sure I also barred another major Yahoo bot ("Yahooseeker") from crawling the entire website.
Following these steps the site was reconnected by the hosting company and, bingo, CLUAS was working again around lunchtime today.
The steps I had to take have their obvious downsides - some CLUAS pages will be indexed either less frequently or not at all by major search engines, which in time is going to compromise the amount of traffic we get from search engines (which has always been very good). But it is a short term solution until I move the CLUAS site to a more robust hosting environment.
But one encouraging lesson I have learned from this is that the migration of CLUAS from its previous dated environment almost exactly one year ago is (finally!) paying dividends - at least if when I note that the search engines have decided what we have merits them to go and increase significantly the rate and frequency at which they trawl our site. I always knew it would be a long road but that, in long run, CLUAS would be much better equipped for the future. We are certainly not there yet. There is a ton more to do. But the tools to do what we need are at our disposal.
Anyways. To mark the 1 year anniversary of the beginning of the massive (and ongoing) operation to bring CLUAS (kicking and screaming) into the 21st century I am going to, this week, start a series of blog entries that outline what CLUAS has done in the last year, with what technologies, why, and where this all may lead the site in the future (and, er, you can wake up from your slumber now).
Fret not, though I will also be getting back into posting, in parallel, some more blog entries about music and technology and what is going on out there. Watch this space, etc.
More ...
        [Read more...]