Monk’ing IDS

May2011: “Monk” to the Rescue – playing a bit of a “private investigator” for a site that has an apparently random response time for a specific query of 8-10 minutes. Confident we found the issue last week – still testing and eliminating other possibilities. Fun work, even though it can be a bit of “hit and miss” until you start to hone in on a pattern, or on this case, something kept appearing that just “shouldn’t be there during this time of day.”

As it turns out, it was in fact the “rogue job” that wasn’t supposed to be running. And of course, it was an UPDATE STATISTICS job that had crept into a longer and longer job, not finishing nearly when it was supposed to finish. I wasn’t really looking for this as the culprit though. I was told that the client thought for sure it was a “locking issue.” And from what they described, their diagnosis sounded pretty good. So I was heading down that path to capture queries during that time, see who’s waiting on locks, etc… Much easier in 11.x if you fully understand and have used the new sysadmin database and the ph stuff. But since I was under the gun, I was starting to craft my own scripts (as we all have done over the years) to drill down a little a time.

Once I had some scripts in place, I decided to look closer at the historical “Watcher” data I had been collecting, focusing of course on the window of time the client specified. I noticed a few odd things, but nothing really stuck out. Some clients have to run a flavor of UPD STATS during the day, but typically they’re quick-hitters. I noticed a high value of “2” for UPD STATS jobs during the window mentioned. So at most, there were 2 UPD STATS jobs running. That seemed odd(er). I checked the crontab and the UPD STATS logs and the nightly UPD STATS was supposed to kick-off around 2am, and it only typically ran for a couple of hours. The “bad window” was 7-10am though! So for whatever reason, this job had spread out in to the morning work hours. I noticed it had a PDQ of only 50, and the DS_TOTAL_MEMORY was not set very high at all. So the PDQ wasn’t helping too much at all. I had just finished a new “MSTATS” package that generates all needed and up-to-date UPD STATS scripts for an engine, based completely on SQL and shell scripts. I decided to implement the new MSTATS instead of trying to backtrack as to why the old UPD STATS jobs ran longer than normal. (In Art’s defense, we all love do_stats and have for years. I wanted one that didn’t depend on esql/c for now. In John’s defense, the new “Auto Update Stats” will be the Jaguar of UPD STATS at some point, but it only works against logged databases. All these databases were non-logged, so …


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s