| Author |
Message |
Guest
|
Posted:
Wed Sep 21, 2005 12:50 pm Post subject:
Cluster group takes five minutes to fail over |
|
|
We have several two node clusters (all are Windows 2000 Advanced
Server, SP 4, all hotfixes). A single group on one of these clusters
takes five minutes to manually fail over between nodes. It has 12 file
share resources, a Generic Service resource (which is present on every
cluster; it runs our backup software), and the IP Address, Network
Name, and Physical disk resources.
When I manually fail over the group, all the resources go off line in
fairly short order until the physical disk resource. That resource
takes about two minutes to go offline. The process is then reversed on
the other node; the physical disk resource takes about two minutes to
come online, and then all the other resources follow in normal fashion.
What could be causing this? I can't see anything to indicate problems
in the event log. The shared storage is an IBM FAStT. |
|
| Back to top |
|
 |
Chuck Timon [MSFT]
Guest
|
Posted:
Wed Sep 21, 2005 12:50 pm Post subject:
Re: Cluster group takes five minutes to fail over |
|
|
I would test by taking all the resources in the group offline. Then bring
just the disk resource online. Test failover several times between the
nodes. This will give you a baseline of how long a failover for the disk
resource should take with no obvious 'handles' to it.
I would also like to know what 3rd party apps are running that may not be
cluster aware but may being placing handles to the disk...things like
anti-virus software, quota software, etc...
--
Chuck Timon, Jr.
Microsoft Corporation
CCE Beta Engineer
This posting is provided "AS IS" with no
warranties, and confers no rights.
<dbcricket@hotmail.com> wrote in message
news:1127304852.880400.119840@f14g2000cwb.googlegroups.com...
| Quote: | We have several two node clusters (all are Windows 2000 Advanced
Server, SP 4, all hotfixes). A single group on one of these clusters
takes five minutes to manually fail over between nodes. It has 12 file
share resources, a Generic Service resource (which is present on every
cluster; it runs our backup software), and the IP Address, Network
Name, and Physical disk resources.
When I manually fail over the group, all the resources go off line in
fairly short order until the physical disk resource. That resource
takes about two minutes to go offline. The process is then reversed on
the other node; the physical disk resource takes about two minutes to
come online, and then all the other resources follow in normal fashion.
What could be causing this? I can't see anything to indicate problems
in the event log. The shared storage is an IBM FAStT.
|
|
|
| Back to top |
|
 |
John Toner [MVP]
Guest
|
Posted:
Wed Sep 21, 2005 4:51 pm Post subject:
Re: Cluster group takes five minutes to fail over |
|
|
Are you using any multi-path software for your HBAs? If so, this might cause
this type of behavior. Also check your HBA firmware and drivers and make
sure these are current.
Regards,
John
<dbcricket@hotmail.com> wrote in message
news:1127304852.880400.119840@f14g2000cwb.googlegroups.com...
| Quote: | We have several two node clusters (all are Windows 2000 Advanced
Server, SP 4, all hotfixes). A single group on one of these clusters
takes five minutes to manually fail over between nodes. It has 12 file
share resources, a Generic Service resource (which is present on every
cluster; it runs our backup software), and the IP Address, Network
Name, and Physical disk resources.
When I manually fail over the group, all the resources go off line in
fairly short order until the physical disk resource. That resource
takes about two minutes to go offline. The process is then reversed on
the other node; the physical disk resource takes about two minutes to
come online, and then all the other resources follow in normal fashion.
What could be causing this? I can't see anything to indicate problems
in the event log. The shared storage is an IBM FAStT.
|
|
|
| Back to top |
|
 |
Guest
|
Posted:
Wed Sep 21, 2005 4:51 pm Post subject:
Re: Cluster group takes five minutes to fail over |
|
|
Thanks for the rapid response, Chuck. I'll have to schedule an outage
before I can test your suggestion, as this is unfortunately a
production cluster.
The only third party app that should be involved is NetShield (scan
engine version is 4.4.00, same as on our other clusters). It's also
worth noting that we have other groups on this cluster that don't
exhibit the same behaviour. No quota software is running. |
|
| Back to top |
|
 |
Guest
|
Posted:
Wed Sep 21, 2005 4:51 pm Post subject:
Re: Cluster group takes five minutes to fail over |
|
|
Yes we are -- IBM RDAC software. But would that be likely to cause the
problem on one group and not on others?
The drivers and firmware are not completely up to date, so I'll start
working on resolving this.
John Toner [MVP] wrote:
| Quote: | Are you using any multi-path software for your HBAs? If so, this might cause
this type of behavior. Also check your HBA firmware and drivers and make
sure these are current.
Regards,
John
dbcricket@hotmail.com> wrote in message
news:1127304852.880400.119840@f14g2000cwb.googlegroups.com...
We have several two node clusters (all are Windows 2000 Advanced
Server, SP 4, all hotfixes). A single group on one of these clusters
takes five minutes to manually fail over between nodes. It has 12 file
share resources, a Generic Service resource (which is present on every
cluster; it runs our backup software), and the IP Address, Network
Name, and Physical disk resources.
When I manually fail over the group, all the resources go off line in
fairly short order until the physical disk resource. That resource
takes about two minutes to go offline. The process is then reversed on
the other node; the physical disk resource takes about two minutes to
come online, and then all the other resources follow in normal fashion.
What could be causing this? I can't see anything to indicate problems
in the event log. The shared storage is an IBM FAStT.
|
|
|
| Back to top |
|
 |
MarkFox
Guest
|
Posted:
Wed Sep 21, 2005 8:52 pm Post subject:
Re: Cluster group takes five minutes to fail over |
|
|
You may also want to check your event logs to see if CHKDSK is running. In a
clustered environment if the cluster detects corruption(dirty bit set) it
will automatically run chkdsk to fix the problem before bringing the physical
disk resource online. The CHKDSK will be run on the node that the resource
is being moved to.
--
Mark
"dbcricket@hotmail.com" wrote:
| Quote: | Yes we are -- IBM RDAC software. But would that be likely to cause the
problem on one group and not on others?
The drivers and firmware are not completely up to date, so I'll start
working on resolving this.
John Toner [MVP] wrote:
Are you using any multi-path software for your HBAs? If so, this might cause
this type of behavior. Also check your HBA firmware and drivers and make
sure these are current.
Regards,
John
dbcricket@hotmail.com> wrote in message
news:1127304852.880400.119840@f14g2000cwb.googlegroups.com...
We have several two node clusters (all are Windows 2000 Advanced
Server, SP 4, all hotfixes). A single group on one of these clusters
takes five minutes to manually fail over between nodes. It has 12 file
share resources, a Generic Service resource (which is present on every
cluster; it runs our backup software), and the IP Address, Network
Name, and Physical disk resources.
When I manually fail over the group, all the resources go off line in
fairly short order until the physical disk resource. That resource
takes about two minutes to go offline. The process is then reversed on
the other node; the physical disk resource takes about two minutes to
come online, and then all the other resources follow in normal fashion.
What could be causing this? I can't see anything to indicate problems
in the event log. The shared storage is an IBM FAStT.
|
|
|
| Back to top |
|
 |
|
|
|
|