mirror of
				https://gitlab.com/libvirt/libvirt.git
				synced 2025-10-30 20:24:58 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			418 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			XML
		
	
	
	
	
	
			
		
		
	
	
			418 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			XML
		
	
	
	
	
	
| <?xml version="1.0" encoding="UTF-8"?>
 | |
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 | |
| <html xmlns="http://www.w3.org/1999/xhtml">
 | |
|   <body>
 | |
|     <h1>Control Groups Resource Management</h1>
 | |
| 
 | |
|     <ul id="toc"></ul>
 | |
| 
 | |
|     <p>
 | |
|       The QEMU and LXC drivers make use of the Linux "Control Groups" facility
 | |
|       for applying resource management to their virtual machines and containers.
 | |
|     </p>
 | |
| 
 | |
|     <h2><a name="requiredControllers">Required controllers</a></h2>
 | |
| 
 | |
|     <p>
 | |
|       The control groups filesystem supports multiple "controllers". By default
 | |
|       the init system (such as systemd) should mount all controllers compiled
 | |
|       into the kernel at <code>/sys/fs/cgroup/$CONTROLLER-NAME</code>. Libvirt
 | |
|       will never attempt to mount any controllers itself, merely detect where
 | |
|       they are mounted.
 | |
|     </p>
 | |
| 
 | |
|     <p>
 | |
|       The QEMU driver is capable of using the <code>cpuset</code>,
 | |
|       <code>cpu</code>, <code>memory</code>, <code>blkio</code> and
 | |
|       <code>devices</code> controllers. None of them are compulsory.
 | |
|       If any controller is not mounted, the resource management APIs
 | |
|       which use it will cease to operate. It is possible to explicitly
 | |
|       turn off use of a controller, even when mounted, via the
 | |
|       <code>/etc/libvirt/qemu.conf</code> configuration file.
 | |
|     </p>
 | |
| 
 | |
|     <p>
 | |
|       The LXC driver is capable of using the <code>cpuset</code>,
 | |
|       <code>cpu</code>, <code>cpuacct</code>, <code>freezer</code>,
 | |
|       <code>memory</code>, <code>blkio</code> and <code>devices</code>
 | |
|       controllers. The <code>cpuacct</code>, <code>devices</code>
 | |
|       and <code>memory</code> controllers are compulsory. Without
 | |
|       them mounted, no containers can be started. If any of the
 | |
|       other controllers are not mounted, the resource management APIs
 | |
|       which use them will cease to operate.
 | |
|     </p>
 | |
| 
 | |
|     <h2><a name="currentLayout">Current cgroups layout</a></h2>
 | |
| 
 | |
|     <p>
 | |
|       As of libvirt 1.0.5 or later, the cgroups layout created by libvirt has been
 | |
|       simplified, in order to facilitate the setup of resource control policies by
 | |
|       administrators / management applications. The new layout is based on the concepts
 | |
|       of "partitions" and "consumers". A "consumer" is a cgroup which holds the
 | |
|       processes for a single virtual machine or container. A "partition" is a cgroup
 | |
|       which does not contain any processes, but can have resource controls applied.
 | |
|       A "partition" will have zero or more child directories which may be either
 | |
|       "consumer" or "partition".
 | |
|     </p>
 | |
| 
 | |
|     <p>
 | |
|       As of libvirt 1.1.1 or later, the cgroups layout will have some slight
 | |
|       differences when running on a host with systemd 205 or later. The overall
 | |
|       tree structure is the same, but there are some differences in the naming
 | |
|       conventions for the cgroup directories. Thus the following docs split
 | |
|       in two, one describing systemd hosts and the other non-systemd hosts.
 | |
|     </p>
 | |
| 
 | |
|     <h3><a name="currentLayoutSystemd">Systemd cgroups integration</a></h3>
 | |
| 
 | |
|     <p>
 | |
|       On hosts which use systemd, each consumer maps to a systemd scope unit,
 | |
|       while partitions map to a system slice unit.
 | |
|     </p>
 | |
| 
 | |
|     <h4><a name="systemdScope">Systemd scope naming</a></h4>
 | |
| 
 | |
|     <p>
 | |
|       The systemd convention is for the scope name of virtual machines / containers
 | |
|       to be of the general format <code>machine-$NAME.scope</code>. Libvirt forms the
 | |
|       <code>$NAME</code> part of this by concatenating the driver type with the name
 | |
|       of the guest, and then escaping any systemd reserved characters.
 | |
|       So for a guest <code>demo</code> running under the <code>lxc</code> driver,
 | |
|       we get a <code>$NAME</code> of <code>lxc-demo</code> which when escaped is
 | |
|       <code>lxc\x2ddemo</code>. So the complete scope name is <code>machine-lxc\x2ddemo.scope</code>.
 | |
|       The scope names map directly to the cgroup directory names.
 | |
|     </p>
 | |
| 
 | |
|     <h4><a name="systemdSlice">Systemd slice naming</a></h4>
 | |
| 
 | |
|     <p>
 | |
|       The systemd convention for slice naming is that a slice should include the
 | |
|       name of all of its parents prepended on its own name. So for a libvirt
 | |
|       partition <code>/machine/engineering/testing</code>, the slice name will
 | |
|       be <code>machine-engineering-testing.slice</code>. Again the slice names
 | |
|       map directly to the cgroup directory names. Systemd creates three top level
 | |
|       slices by default, <code>system.slice</code> <code>user.slice</code> and
 | |
|       <code>machine.slice</code>. All virtual machines or containers created
 | |
|       by libvirt will be associated with <code>machine.slice</code> by default.
 | |
|     </p>
 | |
| 
 | |
|     <h4><a name="systemdLayout">Systemd cgroup layout</a></h4>
 | |
| 
 | |
|     <p>
 | |
|       Given this, a possible systemd cgroups layout involving 3 qemu guests,
 | |
|       3 lxc containers and 3 custom child slices, would be:
 | |
|     </p>
 | |
| 
 | |
|     <pre>
 | |
| $ROOT
 | |
|   |
 | |
|   +- system.slice
 | |
|   |   |
 | |
|   |   +- libvirtd.service
 | |
|   |
 | |
|   +- machine.slice
 | |
|       |
 | |
|       +- machine-qemu\x2dvm1.scope
 | |
|       |   |
 | |
|       |   +- emulator
 | |
|       |   +- vcpu0
 | |
|       |   +- vcpu1
 | |
|       |
 | |
|       +- machine-qemu\x2dvm2.scope
 | |
|       |   |
 | |
|       |   +- emulator
 | |
|       |   +- vcpu0
 | |
|       |   +- vcpu1
 | |
|       |
 | |
|       +- machine-qemu\x2dvm3.scope
 | |
|       |   |
 | |
|       |   +- emulator
 | |
|       |   +- vcpu0
 | |
|       |   +- vcpu1
 | |
|       |
 | |
|       +- machine-engineering.slice
 | |
|       |   |
 | |
|       |   +- machine-engineering-testing.slice
 | |
|       |   |   |
 | |
|       |   |   +- machine-lxc\x2dcontainer1.scope
 | |
|       |   |
 | |
|       |   +- machine-engineering-production.slice
 | |
|       |       |
 | |
|       |       +- machine-lxc\x2dcontainer2.scope
 | |
|       |
 | |
|       +- machine-marketing.slice
 | |
|           |
 | |
|           +- machine-lxc\x2dcontainer3.scope
 | |
|     </pre>
 | |
| 
 | |
|     <h3><a name="currentLayoutGeneric">Non-systemd cgroups layout</a></h3>
 | |
| 
 | |
|     <p>
 | |
|       On hosts which do not use systemd, each consumer has a corresponding cgroup
 | |
|       named <code>$VMNAME.libvirt-{qemu,lxc}</code>. Each consumer is associated
 | |
|       with exactly one partition, which also have a corresponding cgroup usually
 | |
|       named <code>$PARTNAME.partition</code>. The exceptions to this naming rule
 | |
|       are the three top level default partitions, named <code>/system</code> (for
 | |
|       system services), <code>/user</code> (for user login sessions) and
 | |
|       <code>/machine</code> (for virtual machines and containers). By default
 | |
|       every consumer will of course be associated with the <code>/machine</code>
 | |
|       partition.
 | |
|     </p>
 | |
| 
 | |
|     <p>
 | |
|       Given this, a possible systemd cgroups layout involving 3 qemu guests,
 | |
|       3 lxc containers and 2 custom child slices, would be:
 | |
|     </p>
 | |
| 
 | |
|     <pre>
 | |
| $ROOT
 | |
|   |
 | |
|   +- system
 | |
|   |   |
 | |
|   |   +- libvirtd.service
 | |
|   |
 | |
|   +- machine
 | |
|       |
 | |
|       +- vm1.libvirt-qemu
 | |
|       |   |
 | |
|       |   +- emulator
 | |
|       |   +- vcpu0
 | |
|       |   +- vcpu1
 | |
|       |
 | |
|       +- vm2.libvirt-qemu
 | |
|       |   |
 | |
|       |   +- emulator
 | |
|       |   +- vcpu0
 | |
|       |   +- vcpu1
 | |
|       |
 | |
|       +- vm3.libvirt-qemu
 | |
|       |   |
 | |
|       |   +- emulator
 | |
|       |   +- vcpu0
 | |
|       |   +- vcpu1
 | |
|       |
 | |
|       +- engineering.partition
 | |
|       |   |
 | |
|       |   +- testing.partition
 | |
|       |   |   |
 | |
|       |   |   +- container1.libvirt-lxc
 | |
|       |   |
 | |
|       |   +- production.partition
 | |
|       |       |
 | |
|       |       +- container2.libvirt-lxc
 | |
|       |
 | |
|       +- marketing.partition
 | |
|           |
 | |
|           +- container3.libvirt-lxc
 | |
|     </pre>
 | |
| 
 | |
|     <h2><a name="customPartiton">Using custom partitions</a></h2>
 | |
| 
 | |
|     <p>
 | |
|       If there is a need to apply resource constraints to groups of
 | |
|       virtual machines or containers, then the single default
 | |
|       partition <code>/machine</code> may not be sufficiently
 | |
|       flexible. The administrator may wish to sub-divide the
 | |
|       default partition, for example into "testing" and "production"
 | |
|       partitions, and then assign each guest to a specific
 | |
|       sub-partition. This is achieved via a small element addition
 | |
|       to the guest domain XML config, just below the main <code>domain</code>
 | |
|       element
 | |
|     </p>
 | |
| 
 | |
|     <pre>
 | |
|   ...
 | |
|   <resource>
 | |
|     <partition>/machine/production</partition>
 | |
|   </resource>
 | |
|   ...
 | |
|     </pre>
 | |
| 
 | |
|     <p>
 | |
|       Note that the partition names in the guest XML are using a
 | |
|       generic naming format, not the low level naming convention
 | |
|       required by the underlying host OS. That is, you should not include
 | |
|       any of the <code>.partition</code> or <code>.slice</code>
 | |
|       suffixes in the XML config. Given a partition name
 | |
|       <code>/machine/production</code>, libvirt will automatically
 | |
|       apply the platform specific translation required to get
 | |
|       <code>/machine/production.partition</code> (non-systemd)
 | |
|       or <code>/machine.slice/machine-production.slice</code>
 | |
|       (systemd) as the underlying cgroup name
 | |
|     </p>
 | |
| 
 | |
|     <p>
 | |
|       Libvirt will not auto-create the cgroups directory to back
 | |
|       this partition. In the future, libvirt / virsh will provide
 | |
|       APIs / commands to create custom partitions, but currently
 | |
|       this is left as an exercise for the administrator.
 | |
|     </p>
 | |
| 
 | |
|     <p>
 | |
|       <strong>Note:</strong> the ability to place guests in custom
 | |
|       partitions is only available with libvirt >= 1.0.5, using
 | |
|       the new cgroup layout. The legacy cgroups layout described
 | |
|       later in this document did not support customization per guest.
 | |
|     </p>
 | |
| 
 | |
|     <h3><a name="createSystemd">Creating custom partitions (systemd)</a></h3>
 | |
| 
 | |
|     <p>
 | |
|       Given the XML config above, the admin on a systemd based host would
 | |
|       need to create a unit file <code>/etc/systemd/system/machine-production.slice</code>
 | |
|     </p>
 | |
| 
 | |
|     <pre>
 | |
| # cat > /etc/systemd/system/machine-testing.slice <<EOF
 | |
| [Unit]
 | |
| Description=VM testing slice
 | |
| Before=slices.target
 | |
| Wants=machine.slice
 | |
| EOF
 | |
| # systemctl start machine-testing.slice
 | |
|     </pre>
 | |
| 
 | |
|     <h3><a name="createNonSystemd">Creating custom partitions (non-systemd)</a></h3>
 | |
| 
 | |
|     <p>
 | |
|       Given the XML config above, the admin on a non-systemd based host
 | |
|       would need to create a cgroup named '/machine/production.partition'
 | |
|     </p>
 | |
| 
 | |
|     <pre>
 | |
| # cd /sys/fs/cgroup
 | |
| # for i in blkio cpu,cpuacct cpuset devices freezer memory net_cls perf_event
 | |
|   do
 | |
|     mkdir $i/machine/production.partition
 | |
|   done
 | |
| # for i in cpuset.cpus  cpuset.mems
 | |
|   do
 | |
|     cat cpuset/machine/$i > cpuset/machine/production.partition/$i
 | |
|   done
 | |
| </pre>
 | |
| 
 | |
|     <h2><a name="resourceAPIs">Resource management APIs/commands</a></h2>
 | |
| 
 | |
|     <p>
 | |
|       Since libvirt aims to provide an API which is portable across
 | |
|       hypervisors, the concept of cgroups is not exposed directly
 | |
|       in the API or XML configuration. It is considered to be an
 | |
|       internal implementation detail. Instead libvirt provides a
 | |
|       set of APIs for applying resource controls, which are then
 | |
|       mapped to corresponding cgroup tunables
 | |
|     </p>
 | |
| 
 | |
|     <h3>Scheduler tuning</h3>
 | |
| 
 | |
|     <p>
 | |
|      Parameters from the "cpu" controller are exposed via the
 | |
|      <code>schedinfo</code> command in virsh.
 | |
|     </p>
 | |
| 
 | |
|     <pre>
 | |
| # virsh schedinfo demo
 | |
| Scheduler      : posix
 | |
| cpu_shares     : 1024
 | |
| vcpu_period    : 100000
 | |
| vcpu_quota     : -1
 | |
| emulator_period: 100000
 | |
| emulator_quota : -1</pre>
 | |
| 
 | |
| 
 | |
|     <h3>Block I/O tuning</h3>
 | |
| 
 | |
|     <p>
 | |
|      Parameters from the "blkio" controller are exposed via the
 | |
|      <code>bkliotune</code> command in virsh.
 | |
|     </p>
 | |
| 
 | |
| 
 | |
|     <pre>
 | |
| # virsh blkiotune demo
 | |
| weight         : 500
 | |
| device_weight  : </pre>
 | |
| 
 | |
|     <h3>Memory tuning</h3>
 | |
| 
 | |
|     <p>
 | |
|      Parameters from the "memory" controller are exposed via the
 | |
|      <code>memtune</code> command in virsh.
 | |
|     </p>
 | |
| 
 | |
|     <pre>
 | |
| # virsh memtune demo
 | |
| hard_limit     : 580192
 | |
| soft_limit     : unlimited
 | |
| swap_hard_limit: unlimited
 | |
|     </pre>
 | |
| 
 | |
|     <h3>Network tuning</h3>
 | |
| 
 | |
|     <p>
 | |
|       The <code>net_cls</code> is not currently used. Instead traffic
 | |
|       filter policies are set directly against individual virtual
 | |
|       network interfaces.
 | |
|     </p>
 | |
| 
 | |
|     <h2><a name="legacyLayout">Legacy cgroups layout</a></h2>
 | |
| 
 | |
|     <p>
 | |
|       Prior to libvirt 1.0.5, the cgroups layout created by libvirt was different
 | |
|       from that described above, and did not allow for administrator customization.
 | |
|       Libvirt used a fixed, 3-level hierarchy <code>libvirt/{qemu,lxc}/$VMNAME</code>
 | |
|       which was rooted at the point in the hierarchy where libvirtd itself was
 | |
|       located. So if libvirtd was placed at <code>/system/libvirtd.service</code>
 | |
|       by systemd, the groups for each virtual machine / container would be located
 | |
|       at <code>/system/libvirtd.service/libvirt/{qemu,lxc}/$VMNAME</code>. In addition
 | |
|       to this, the QEMU drivers further child groups for each vCPU thread and the
 | |
|       emulator thread(s). This leads to a hierarchy that looked like
 | |
|     </p>
 | |
| 
 | |
| 
 | |
|     <pre>
 | |
| $ROOT
 | |
|   |
 | |
|   +- system
 | |
|       |
 | |
|       +- libvirtd.service
 | |
|            |
 | |
|            +- libvirt
 | |
|                |
 | |
|                +- qemu
 | |
|                |   |
 | |
|                |   +- vm1
 | |
|                |   |   |
 | |
|                |   |   +- emulator
 | |
|                |   |   +- vcpu0
 | |
|                |   |   +- vcpu1
 | |
|                |   |
 | |
|                |   +- vm2
 | |
|                |   |   |
 | |
|                |   |   +- emulator
 | |
|                |   |   +- vcpu0
 | |
|                |   |   +- vcpu1
 | |
|                |   |
 | |
|                |   +- vm3
 | |
|                |       |
 | |
|                |       +- emulator
 | |
|                |       +- vcpu0
 | |
|                |       +- vcpu1
 | |
|                |
 | |
|                +- lxc
 | |
|                    |
 | |
|                    +- container1
 | |
|                    |
 | |
|                    +- container2
 | |
|                    |
 | |
|                    +- container3
 | |
|     </pre>
 | |
| 
 | |
|     <p>
 | |
|       Although current releases are much improved, historically the use of deep
 | |
|       hierarchies has had a significant negative impact on the kernel scalability.
 | |
|       The legacy libvirt cgroups layout highlighted these problems, to the detriment
 | |
|       of the performance of virtual machines and containers.
 | |
|     </p>
 | |
|   </body>
 | |
| </html>
 |