Rebooting ColdFusion server nightmare

June 15, 2008

I recently came across an "interesting problem" (by which I mean nightmare) when rebooting a ColdFusion server. The server hosts approximately 30 web applications, all of which use a framework (mostly Fusebox). The issue I had was that the server was receiving multiple hits to each website and the server was unable to start all of the applications. This was because of the overhead that a framework has when it is first initialised. Each application creates a thread for the application scope, ColdFusion can only process a limited number of threads asynchronously, the rest are queued. The result is JRun taking 100% of the CPU time, the applications failing to start up, the request queue gets longer and the server becomes unresponsive. It's a viscous circle!

I believe that Enterprise handles this much better than the Standard edition, but the issue still exists. As I understand it frameworks have a named lock which is used when starting up the application. So multiple request to the same web application will result in a queue, so multiple request to multiple sites will produce a large queue which is not good!

I got the server up an running by stopping each individual web site at the webserver level and allowing each site to start up (read the config files and initiate the application scope). This works but it's a manual process and not recommended.

So, this leads me to wonder how to handle this programmatically. These are my possible solutions.

Request Tuning settings

This idea is a bit of a non-starter as far as I can tell, as the optimal request tuning settings for web sites once they have started up is completely different to when they are starting up. I wouldn't be able to tune them after the server rebooted as the CFIDE was unresponsive.

Server scope "token"

The idea here is that each application will have some code in the Application.cfc that checks for a server scope variable to see if any other applications are currently starting up. If so, the application aborts with a nice message. Something like this (please note this is "concept" code rather than the production code):

<cffunction name="onApplicationStart">
  <cfif Not StructKeyExists(server, "appinitialising")
    OR Not server.appinitialising>
  
    <!--- no applications on the server have been started or are starting --->
    <cfset server.appstarting = True />
    <!--- start up this application --->
    <cftry>
      <cflock timeout="30">
        ... start up code here ...
      </cflock>
      <cfset server.appstarting = False />
      <cfcatch>
        <!--- failed to start release the token --->
        <cfset server.appstarting = False />
        ... show nice error message ...
        <cfabort />
      </cfcatch>
    </cftry>
  <cfelse>
    ... show nice error message ...
    <cfabort />
  </cfif>
  ...
</cffunction>

Application scope "token"

The idea here is that the application only allows the first request to get as far as the named lock which starts the application. The Application.cfc checks for an application scope variable to see if the application is already started or if a request has already started the application are currently starting up. Something like this (please note this is "concept" code rather than the production code):

<cffunction name="onApplicationStart">
  <cfif Not StructKeyExists(application, "started")
    OR Not application.started>
    <!--- application has not been started and isn't starting --->
    <cfset application.started = False />
    <!--- start up the application --->
    <cftry>
      <cflock timeout="30">
        ... start up code here ...
      </cflock>
      <cfset application.started = True />
      <cfcatch>
        <!--- application failed to start --->
        ... show nice error message ...
        <cfabort />
      </cfcatch>
    </cftry>
  <cfelse>
    ... show nice error message ...
    <cfabort />  
  </cfif>
  ...
</cffunction>

Start individual IIS websites with .NET

I have no idea how to do this in .NET, I just know that it can be done so should be considered as an option.

I haven't deployed any of these ideas, so I don't know which of them is the best. To me, the server variable, followed by the application variable ideas are the best. I also saw a post on the Lynch Consulting Blog about log running webservice requests which is well worth a look and could potentially be adapted to cope with this issue.

I'd welcome any thoughts!


3 comments

  1. Good post

    I'd prefer to handle it on the application level with CF. I think you might be over complicating it.

    I'd suggest lowering the CFLOCK ( exclusive ) timeout and catch the timeout error to display the friendly message. That way you could get rid of the application token. (this is what we do)

    Figure out how long it takes to startup the framework with a single request. Set the timeout to that value * 1.5 (or what works)

    If the framework has issues starting up in multiple application scopes at the same time, I'd reckon that would be a serious issue of the framework itself.

    Comment by Brett S – June 15, 2008
  2. Hi Brett, Thanks for the comment.

    Setting the cflock timeout lower does not solve the problem as it means that the application tries to start up and fails (timeout) as the server is too busy. As the application failed to start, each subsequent request tries to start the application as well - which is a lot of overhead. I had Adobe support on the end of the phone and they think this is what caused the server to lock up.

    Starting up one application at a time takes a maximum of 5 seconds. The trouble is when you try to start 30 applications (each is a completely separate install) at the same time.

    Another solution occurred to me last night - I could use the same named lock for all applications (and then catch the timeout on queued requests), so that only one is starting up at any given moment. That might be easier...

    I'm just dreading my next reboot!

    Comment by John Whish – June 16, 2008
  3. @Brett,
    Just reading your comments again and wanted to add that:
    - Yes, I'm probably over complicating things. I just wanted to explore all avenues and then dismiss the crazy ones :)
    - Even if I set the timeout to 1 second, then each thread will wait for 1 second for a lock before failing. If you have lots of requests hitting your server then the queue of threads waiting for a lock grows very quickly.

    Comment by John Whish – June 16, 2008

Leave a comment

If you found this post useful, interesting or just plain wrong, let me know - I like feedback :)

Please note: If you haven't commented before, then your comments will be moderated before they are displayed.