Cluster group takes five minutes to fail over
Windows Server Forum Index Windows Server
Server discussion on Windows platform.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web winserverhelp.com
Cluster group takes five minutes to fail over

 
Post new topic   Reply to topic    Windows Server Forum Index -> Clustering
Author Message
Guest






Posted: Wed Sep 21, 2005 12:50 pm    Post subject: Cluster group takes five minutes to fail over Reply with quote

We have several two node clusters (all are Windows 2000 Advanced
Server, SP 4, all hotfixes). A single group on one of these clusters
takes five minutes to manually fail over between nodes. It has 12 file
share resources, a Generic Service resource (which is present on every
cluster; it runs our backup software), and the IP Address, Network
Name, and Physical disk resources.

When I manually fail over the group, all the resources go off line in
fairly short order until the physical disk resource. That resource
takes about two minutes to go offline. The process is then reversed on
the other node; the physical disk resource takes about two minutes to
come online, and then all the other resources follow in normal fashion.

What could be causing this? I can't see anything to indicate problems
in the event log. The shared storage is an IBM FAStT.
Back to top
Chuck Timon [MSFT]
Guest





Posted: Wed Sep 21, 2005 12:50 pm    Post subject: Re: Cluster group takes five minutes to fail over Reply with quote

I would test by taking all the resources in the group offline. Then bring
just the disk resource online. Test failover several times between the
nodes. This will give you a baseline of how long a failover for the disk
resource should take with no obvious 'handles' to it.

I would also like to know what 3rd party apps are running that may not be
cluster aware but may being placing handles to the disk...things like
anti-virus software, quota software, etc...

--
Chuck Timon, Jr.
Microsoft Corporation
CCE Beta Engineer
This posting is provided "AS IS" with no
warranties, and confers no rights.
<dbcricket@hotmail.com> wrote in message
news:1127304852.880400.119840@f14g2000cwb.googlegroups.com...
Quote:
We have several two node clusters (all are Windows 2000 Advanced
Server, SP 4, all hotfixes). A single group on one of these clusters
takes five minutes to manually fail over between nodes. It has 12 file
share resources, a Generic Service resource (which is present on every
cluster; it runs our backup software), and the IP Address, Network
Name, and Physical disk resources.

When I manually fail over the group, all the resources go off line in
fairly short order until the physical disk resource. That resource
takes about two minutes to go offline. The process is then reversed on
the other node; the physical disk resource takes about two minutes to
come online, and then all the other resources follow in normal fashion.

What could be causing this? I can't see anything to indicate problems
in the event log. The shared storage is an IBM FAStT.
Back to top
John Toner [MVP]
Guest





Posted: Wed Sep 21, 2005 4:51 pm    Post subject: Re: Cluster group takes five minutes to fail over Reply with quote

Are you using any multi-path software for your HBAs? If so, this might cause
this type of behavior. Also check your HBA firmware and drivers and make
sure these are current.

Regards,
John

<dbcricket@hotmail.com> wrote in message
news:1127304852.880400.119840@f14g2000cwb.googlegroups.com...
Quote:
We have several two node clusters (all are Windows 2000 Advanced
Server, SP 4, all hotfixes). A single group on one of these clusters
takes five minutes to manually fail over between nodes. It has 12 file
share resources, a Generic Service resource (which is present on every
cluster; it runs our backup software), and the IP Address, Network
Name, and Physical disk resources.

When I manually fail over the group, all the resources go off line in
fairly short order until the physical disk resource. That resource
takes about two minutes to go offline. The process is then reversed on
the other node; the physical disk resource takes about two minutes to
come online, and then all the other resources follow in normal fashion.

What could be causing this? I can't see anything to indicate problems
in the event log. The shared storage is an IBM FAStT.
Back to top
Guest






Posted: Wed Sep 21, 2005 4:51 pm    Post subject: Re: Cluster group takes five minutes to fail over Reply with quote

Thanks for the rapid response, Chuck. I'll have to schedule an outage
before I can test your suggestion, as this is unfortunately a
production cluster.

The only third party app that should be involved is NetShield (scan
engine version is 4.4.00, same as on our other clusters). It's also
worth noting that we have other groups on this cluster that don't
exhibit the same behaviour. No quota software is running.
Back to top
Guest






Posted: Wed Sep 21, 2005 4:51 pm    Post subject: Re: Cluster group takes five minutes to fail over Reply with quote

Yes we are -- IBM RDAC software. But would that be likely to cause the
problem on one group and not on others?

The drivers and firmware are not completely up to date, so I'll start
working on resolving this.

John Toner [MVP] wrote:
Quote:
Are you using any multi-path software for your HBAs? If so, this might cause
this type of behavior. Also check your HBA firmware and drivers and make
sure these are current.

Regards,
John

dbcricket@hotmail.com> wrote in message
news:1127304852.880400.119840@f14g2000cwb.googlegroups.com...
We have several two node clusters (all are Windows 2000 Advanced
Server, SP 4, all hotfixes). A single group on one of these clusters
takes five minutes to manually fail over between nodes. It has 12 file
share resources, a Generic Service resource (which is present on every
cluster; it runs our backup software), and the IP Address, Network
Name, and Physical disk resources.

When I manually fail over the group, all the resources go off line in
fairly short order until the physical disk resource. That resource
takes about two minutes to go offline. The process is then reversed on
the other node; the physical disk resource takes about two minutes to
come online, and then all the other resources follow in normal fashion.

What could be causing this? I can't see anything to indicate problems
in the event log. The shared storage is an IBM FAStT.
Back to top
MarkFox
Guest





Posted: Wed Sep 21, 2005 8:52 pm    Post subject: Re: Cluster group takes five minutes to fail over Reply with quote

You may also want to check your event logs to see if CHKDSK is running. In a
clustered environment if the cluster detects corruption(dirty bit set) it
will automatically run chkdsk to fix the problem before bringing the physical
disk resource online. The CHKDSK will be run on the node that the resource
is being moved to.
--
Mark


"dbcricket@hotmail.com" wrote:

Quote:
Yes we are -- IBM RDAC software. But would that be likely to cause the
problem on one group and not on others?

The drivers and firmware are not completely up to date, so I'll start
working on resolving this.

John Toner [MVP] wrote:
Are you using any multi-path software for your HBAs? If so, this might cause
this type of behavior. Also check your HBA firmware and drivers and make
sure these are current.

Regards,
John

dbcricket@hotmail.com> wrote in message
news:1127304852.880400.119840@f14g2000cwb.googlegroups.com...
We have several two node clusters (all are Windows 2000 Advanced
Server, SP 4, all hotfixes). A single group on one of these clusters
takes five minutes to manually fail over between nodes. It has 12 file
share resources, a Generic Service resource (which is present on every
cluster; it runs our backup software), and the IP Address, Network
Name, and Physical disk resources.

When I manually fail over the group, all the resources go off line in
fairly short order until the physical disk resource. That resource
takes about two minutes to go offline. The process is then reversed on
the other node; the physical disk resource takes about two minutes to
come online, and then all the other resources follow in normal fashion.

What could be causing this? I can't see anything to indicate problems
in the event log. The shared storage is an IBM FAStT.


Back to top
 
Post new topic   Reply to topic    Windows Server Forum Index -> Clustering All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




New Topics Powered by phpBB