Skip navigation

Tag Archives: Troubleshooting


As you guys may notice, I have spent some hours on vSphere vShield product recently. I have came cross a design flaw issue I would like to discuss with you.

First all, let me briefly describe my test environment.

I have two physical HP boxes and a EMC SAN as my test box. In this case, I have built a vCenter as VM sitting on one of ESX host. Therefore, I can even make snapshot if I want to. However, this has been generate some issues for vShield product.

Symptoms:

In terms of testing installing and configuring vShield product. I normally install vShield on one host and move some test VMs to new host to see how VMs respond. Then, I will vMotion vCenter VM to new host and install vShield on the second host since some of vShield components requires reboot host. I have done that couple of times. Eventually, it happened.

shissue-03

I initialled vMotion from a host which has zone, firewall, vApp to a host which doesn’t have those settings. vCenter got frozen.

I was waiting for couple of minutes but I was still not able to connect to vCenter. Not even pingable.

so I jump on new host with directly vClient and I found vCenter is up running in the new host. But it’s not pingable. Other VMs sitting in the same vSwitch are not having issues at all. I vMotioned vCenter before I install vShield without any issues. Why I can’t connect to vCenter VM this time?

Cause:

The reason is simple. It’s caused by vShield Zone and other components. Let’s take a look to see what happens when I vMotion a normal VM to a host installed with vShield.

shissue-01

 

The normal procedure should be:

  1. Query
  2. Migrate a new VM into new host.

 

However, as you can see from the picture, it actually reconfigured the VM afterwards.

Notice:

And  if you monitor vMotion ping status, the ping drop during vMotion from 1 time out become 10 times out depends on how you configure vShield.

shissue-02

 

so what exactly this reconfiguration step do?

The answer is that virtual machine vmx file has been reconfigured with vShield information. The more important thing is this step is done by vCenter!!

With a host installed with vShield products(like Zone), any VMs vMotion into that host will automatically configured with vZone. If vZone information is not configured, the VM will not able to communicate with other VM even if VMs in the same vSwitch because it’s caused at vNic leve.

Just imagine what happened if you try to vMotion a vCenter? No one is going to modify vCenter VM since it’s temporary disconnect from network!!

Solution:

I think this is a design flaw since use VM as vCenter is an option provided by VMware.

What I did was to use putty to connect to ESX host and manually modify vmx file of vCenter VM.

This is what old vmx looks like. This host has all vShield parts.

shissue-05

We need to remove filter0.name and param1 and add vEndpoint to match whatever new host got. The result is following.

shissue-04

After modification, the vCenter is able to start and connect to network.

Conclusion:

vShield is still a new product. VMware needs to resolve issues when vCenter in VM mode and let host , instead of vCenter, to reconfigure vmx files everytime a new VM vmotion into host or register a new VM.

Plus, the reconfiguration takes too long to finish. For important time sensitive machine, 10 time out may not be acceptable.

Advertisements

One of biggest changes for vSphere 4.1 is introduction of Network I/O control and Storage I/O.

This post will give you an introduction and understanding of what Network I/O control (NetIOC) is. This is a new technology and we still need to wait and see more real case in the future. But for now, Let’s see what Network IO control is.

Why do we need to have Network IO Control (NetIOC)?

1. 1Gbit network is not enough

As you may know, we have more and more demanding on the network traffic. FT traffice, iSCSI Traffic(Don’t you team up?) and NFS Traffic, vMotion Traffic etc. Although you can team up multiple physical nics together, but from a single VM perspective, it can only allow to use one physical nic at one time no matter what kind of teaming method you are using. Plus, network team has already started to talk about 100Gbit network and it’s about time to push 10Gbit network into public.

2. Blade server demands

All new blade server has 10Gbit ports switch in the blade. The architecture of Blade server has changed from each blade has it’s own ports to central ethernet Module. It saves a lot of resource and traffic can be easily Qos and scaled.

Prerequisites for Network IO control

1. You need Enterprise Plus license

The reason for that is NetIOC is only available for vDS. For vSS, you can only control outbound traffic.

2. You need vSphere 4.1 and ESX 4.1

With vSphere 4.0, you do can control traffic by port group. But you can’t preconfigure traffic by type (or you can call it by class). This is fundamental architecture change. We will talk about it later.  ESX 4.1 is also required otherwise you won’t see the new tab in the vCenter.

How does Network IO Control (NetIOC) work?

If you recall vSphere 4.0, we also have ingress and egress traffic control for vDS.(for vSS, we only have outbound control) Traffic shaping is controlled by Average Bandwidth, pea bandwidth and bust size. You have manually divide dvUplinks by functions. Like this dvUplink is for FT, this is for vMotion, this is for iSCSI etc. Everything is done by manual.

With new vSphere 4.1, we are not only able to control traffic by port group, we are also control traffic by class.

The NetIOC concept revolves around resource pools that are similar in many ways to the ones already existing for CPU and Memory.
NetIOC classifies traffic into six predefined resource pools as follows:
• vMotion
• iSCSI
• FT logging
• Management
• NFS
• Virtual machine traffic

If you open vCenter, you will see the new tab of dvSwitch  in your ESX i 4.1 server.

This means all traffic go through this vDS will be under Qos by these rules. Remember, it only works for this vDS.

Now, let’s see the architecture picture first and then, we talk about how this thing work.

As you can see, there are 3 layers in NETIOC. Teaming Policy, shaper and Scheduler. As what my previous post mentioned, vDS is actually a combination of special hidden vSS and policy profiles downloaded from vCenter.

Teaming policy (New policy, LBT)

There is a new method of teaming called LBT(Load base teaming). It basically detect how busy those physical nics are, then it will move the flows to different cards. LBT will only move a flow when the mean send or receive utilization on an uplink exceeds 75 percent of capacity over a 30-second period. LBT will not move flows more often than every 30 seconds.

Best practice 4: We recommend that you use LBT as your vDS teaming policy while using NetIOC in order to maximize the networking capacity utilization.
NOTE: As LBT moves flows among uplinks it may occasionally cause reordering of packets at the receiver.

I haven’t done any tests on how much extra CPU cycles are required to run LTB, but we will keep eyes on it.

Limits

There are two attributes( Shares and Limit) you can control over traffic via Resource Allocation. Resource Allocation is controlling base on vDS and only apply to this vDS. It applies on vDS level not on port group or dvUplink level. Shaper is where limits apply. It limits traffic by the class of traffic.  Be noticed at this 4.1, each vDS has it’s own resource pool and resource pool are not shared between vDS.

A user can specify an absolute shaping limit for a given resource-pool flow using a bandwidth capacity limiter. As opposed to shares that are enforced at the dvUplink level, limits are enforced on the overall vDS set of dvUplinks, which means that a flow of a given resource pool will never exceed a given limit for a vDS out of a given vSphere host.

Shares

Shares apply to dvUplink Level and each share rates will be calculated base on traffic of each dvUplink. It controls share value of traffic going through this particular dvUplink and make sure share percentage is correct.

the network flow scheduler is the entity responsible for enforcing shares and therefore is in charge of the overall arbitration under overcommitment. Each resource-pool flow has its own dedicated software queue inside the scheduler so that packets from a given resource pool won’t be dropped due to high utilization by other flows.

NetIOC Best Practices

NetIOC is a very powerful feature that will make your vSphere deployment even more suitable for your I/O-consolidated datacenter. However, follow these best practices to optimize the usage of this feature:

Best practice 1: When using bandwidth allocation, use “shares” instead of “limits,” as the former has greater flexibility for unused capacity redistribution. Partitioning the available network bandwidth among different types of network traffic flows using limits has shortcomings. For instance, allocating 2Gbps bandwidth by using a limit for the virtual machine resource pool provides a maximum of 2Gbps bandwidth for all the virtual machine traffic even if the team is not saturated. In other words, limits impose hard limits on the amount of the bandwidth usage by a traffic flow even when there is network bandwidth available.

Best practice 2: If you are concerned about physical switch and/or physical network capacity, consider imposing limits on a given resource pool. For instance, you might want to put a limit on vMotion traffic flow to help in situations where multiple vMotion traffic flows initiated on different ESX hosts at the same time could possibly oversubscribe the physical network. By limiting the vMotion traffic bandwidth usage at the ESX host level, we can prevent the possibility of jeopardizing performance for other flows going through the same points of contention.

Best practice 3: Fault tolerance is a latency-sensitive traffic flow, so it is recommended to always set the corresponding resource-pool shares to a reasonably high relative value in the case of custom shares. However, in the case where you are using the predefined default shares value for VMware FT, leaving it set to high is recommended.

Best practice 4: We recommend that you use LBT as your vDS teaming policy while using NetIOC in order to maximize the networking capacity utilization.

NOTE: As LBT moves flows among uplinks it may occasionally cause reordering of packets at the receiver.

Best practice 5: Use the DV Port Group and Traffic Shaper features offered by the vDS to maximum effect when configuring the vDS. Configure each of the traffic flow types with a dedicated DV Port Group. Use DV Port Groups as a means to apply configuration policies to different traffic flow types, and more important, to provide additional Rx bandwidth controls through the use of Traffic Shaper. For instance, you might want to enable Traffic Shaper for the egress traffic on the DV Port Group used for vMotion. This can help in situations when multiple vMotions initiated on different vSphere hosts converge to the same destination vSphere server.

Let me know if you have more questions.

Reference:

http://h18000.www1.hp.com/products/blades/components/ethernet/10-10gb-f/specifications.html

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1022585

http://www.vmware.com/files/pdf/techpaper/VMW_Netioc_BestPractices.pdf


This is second post regarding Troubleshooting Routing for ESX. I only post a diagram without actually commands. The reason for that is every environment use your own system to monitoring and manage your ESXs. You may use vMA, vPowerCLI, ssh, vCenter, even SVCMM. However, the steps you will diagnosis your system should be similar. Please let me know how you would do it.

– Silver


It’s not easy to trouble shoot vmware ESX or vCenter. When emergency happens (like host disconnected from vCenter), you may feel you are lost in millions things and have no idea where to start. However, before you pick up your phone and start to call Vmware support, there are few things you can do and this article supposes to show you the basic methods.

I’m giving an example please let me know what you think about it.

-Silver


This is fix up for PVSCSI issue in ESX 4.0 U2. Previous link https://geeksilver.wordpress.com/2010/06/21/upgrade-esx-3-5-to-vsphere-upgrade-your-vms-to-get-performance-jump-part-1/

Basically, there are 2 errors you may encounter during adding pvscsi device  to windows 2k3 and 2k8.

1. PXE issue

After you added new disks, you encounter PXE issue. Even if disk you installed is the secondary disk, you will still encounter this issue.

Reason:

Vmware bug

New Updates:

Just spoke to Vmware support and they are able to reproduce this issue in their lab.

Solution:

Vmware vCenter 4 U2 can’t take too many options in adding and changing hardware at same time. so do one step a time.

E.g: add disk. then, click ok. Get into setting, change type of scsi, ok. etc.

New fix:

This issue is caused by boot SCSI card sequence has changed after you delete and add new SCSI controller.

All what you need to do is to make sure the SCSI (0:0) is in the first bootable position like what  you can see in the diagram.

2. Blue screen of windows

After changing boot disk SCSI controller type, you can see windows started, then, you encounter blue screen. The system keep restarts.

Cause: Windows doesn’t have your SCSI driver.

Solution: You must have all SCSI drivers available in device manager before you load all type of disks. If you build machine with PVSCSI, you won’t have LSI SCSI driver. So you need to add secondary disk of LSI to let vmtools to install driver.

-Silver