Beyond Virtual Machines and Hypervisors: Overview of Bare Metal Provisioning with OpenStack Cloud

本文主要是介绍Beyond Virtual Machines and Hypervisors: Overview of Bare Metal Provisioning with OpenStack Cloud，希望对大家解决编程问题提供一定的参考价值，需要的开发者们随着小编来一起学习吧！

http://www.mirantis.com/blog/bare-metal-provisioning-with-openstack-cloud/

Many people refer to ‘cloud’ and ‘virtualization’ in the same breath, and from there assume that the cloud is all about managing the virtual machines that run on your hypervisor. CurrentlyOpenStack supports Virtual Machine management through a number of hypervisors, the most widespread being KVM and Xen.

As it turns out, in certain circumstances, using virtualization is not optimal—for example, if there are substantial requirements for performance (e.g., I/O and CPU) that are not compatible with the overhead of virtualization. However, it’s still very convenient to utilize OpenStack features such as instance management, image management, authentication services and so forth for IaaS use cases that require provisioning on bare metal. In addressing these cases we implemented a driver for OpenStack compute, Nova, to support bare-metal provisioning.

Review of the status of bare-metal provisioning in OpenStack

When we undertook our first bare-metal provisioning implementation, there was code implemented byUSC/ISI to support bare-metal provisioning on Tilera hardware. We weren’t going to be targeting Tilera hardware, but the other bits of the bare-metal implementation were pretty useful. NTT Docomo also had code to support a more generic scheme using PXE boot and an IPMI-based power manager, but unfortunately it took some time to open source it, so we had to start development of the generic backend before the NTT Docomo code was open sourced.

A blueprint on bare-metal provisioning can be found on the OpenStack Wiki here:General Bare Metal Provisioning Framework.

Bare-metal provisioning framework architecture

Our driver implements the standard driver interface for the OpenStack hypervisor driver, with the difference that it doesn’t actually talk to any hypervisor. Instead it manages a pool of physical nodes. Each physical node could be used to provision only one “Virtual” (sorry for the pun) Machine (VM) instance. When a new provisioning request arrives, the driver chooses a physical host from a pool to place this VM on and it stays there until destroyed. The operator can add, remove, and modify the physical nodes in the pool.

bare-metal provisioning architecture

The main components related to the bare-metal provisioning support are:

nova-computewith the bare-metal driver: The bare-metal driver itself consists of several components:
- The power manager is responsible for operations such as setting boot devices, powering up and down nodes, etc. It’s robust enough to support several management protocol implementations (we developed two, based on IPMIoool and FreeIPMI to support a wider range of hardware).
- The network manager interacts with the rack switch and is responsible for switching nodes back and forth between the service and projects’ networks (the service network is used to deploy the bare-metal instance via PXE/TFTP). Currently we have an implementation for the Juniper switches. More details on that will be provided in another post devoted to networking support.
- dnsmasq is a Netboot environment for instance provisioning.

nova-baremetal-agent: This is the agent that is supposed to be run onbootstrap-linux (see the next bullet) and executes various provisioning tasks spawned by the bare-metal driver.
bootstrap-linux: A tiny Linux image to be booted over the network and perform basic initialization. It is based on theTiny Core Linux and contains a basic set of packages such as Python to runnova-baremetal-agent (which is implemented in Python) and curl to be able to download an image from Glance. Additionally, it contains an init script that downloads nova-baremetal-agent using curl and executes it.
nova-baremetal-service: A service that is responsible for orchestration of the provisioning tasks (tasks are applied by nova-baremetal-agent directly to the bare-metal server it is running on).

Let’s see what each component actually does in the course of provisioning a new VM (i.e., when you callnova boot). I won’t focus on the details of this request until it reachesnova-compute and the spawn request has reached our bare-metal driver.

The following diagram illustrates this workflow:

bare-metal provisioning flow

The driver chooses a free physical node from the pool.
It’s plugged into the service network (there is a detailed blog post on networking forthcoming, so I will skip that for now).
The driver places a spawn task for the agent, which contains all the necessary information, such as what image to boot from.
The driver issues IPMI commands to enable network boot for a node and power it up.
Bootstrap Linux boots over the network from an image served by dnsmasq.
Bootstrap Linux initialization scripts fetch an agent code from nova-baremetal-service (which provides a REST interface for that).
nova-baremetal-agent polls the nova-baremetal-service REST service for tasks.
nova-baremetal-service sees a task for this node and sends a response with the task, which includes a URL for the image from Glance and the authentication token to be able to fetch it.
nova-baremetal-agent fetches an image from the URL specified in the task and ‘dd’s it to the hard drive and then informsnova-baremetal-service that it’s done with the task.
As soon as nova-baremetal-service is notified about task completion, it informs the driver that it’s time to reboot the node.
The driver sees that the provisioning is almost complete, so it switches network to the project’s network.
It sets booting from the hard drive and reboots the node.
The node is up.

Configuration

A typical configuration for the compute will look like this:

. . .

-- connection_type = baremetal # baremetal support

-- baremetal_driver = generic # target a generic hardware, i.e. IPMI management and PXE boot

-- networkmgr_driver = nova . virt . baremetal . networkmgr . juniper . JuniperNetworkManager # use Juniper network manager

-- powermanager_driver = nova . virt . baremetal . powermgr . freeipmi . FreeIPMIPowerManager # use freeimpi-based power management

. . .

But before the system becomes useful, we have to register switches and nodes. Information about them is stored in the database. We have created an extension for OpenStack REST API to manage these objects and two CLI clients for it:nova-baremetal-switchmanager and nova-baremetal-nodemanager. Let’s use them to show how to add new switches and nodes.

Switches could be added using a command like this:

1	nova - baremetal - switchmanager add < ip > < user > < passwd > < driver > < description >

You have to specify the IP address of the switch, credentials for the manager user, which switch driver to use, and an optional description.

nova-baremetal-switchmanager also supports other essential commands like list and delete. Once we have at least one switch, we can start adding nodes:

1	nova - baremetal - nodemanager add < ip > < mac_addr > < cpus > < ram > < hdd > < ipmihost > < ipmiuser > < ipmipass > < switchid > < switchport >

As you can see, it has a few more options: IP address of the node, MAC address of its first network interface (used to identify the node), number of CPUs, amount of RAM in Mb, HDD capacity in Gb, IPMI information, switch ID of the switch it’s connected to, and a name of the port on the switch.

After successful execution of this command, the specified node will be added to the pool. Withnova-baremetal-nodemanager you can also list and remove nodes in the pool with list and delete commands respectively.

Summary

Bare metal has proved to be a useful and stable feature for our customers. It has other specific features, such as networking management and image preparation, that we will cover in upcoming posts.

http://www.mirantis.com/blog/baremetal-provisioning-multi-tenancy-placement-control-isolation/

In a previous post, we introduced the bare-metal uses cases for OpenStack Cloud, using its capabilities. Here, we’re going to talk about how you can apply some of these approaches to a scenario mixing virtualization with isolation of key components.

Isolation requirements are pretty common for OpenStack deployments. And in fact, one can just say: “Without proper resource isolation you can wave goodbye to the public cloud”. OpenStack tries to fulfill this need in a number of ways. This involves (among many other things):

GUI & API authentication with Keystone
private images in Glance
security groups

However, if we go under the hood of OpenStack, we will see a bunch of well known open source components, such as KVM, iptables, bridges, iSCSI shares. How does OpenStack treat these components in terms of security? I could say that it does hardly anything here. It is up to the sysadmin to go to each compute node and harden the underlying components on his own.

At Mirantis, one OpenStack deployment we dealt with had especially heavy security requirements. There was a need for all the systems to comply with several governmental standards involved in processing sensitive data. Still we had to provide multitenancy. To observe the standards we decided that for “sensitive” tenants, isolated compute nodes with a hardened config should be provided.

The component responsible for distribution of the instances across OpenStack cluster is nova-scheduler. Its most sophisticated scheduler type, called FilterScheduler allows to enforce many policies on instance placement based on “filters”. For a given user request to spawn an instance, filters determine a set of compute nodes capable of running it. There are a number of filters already provided with the default nova-scheduler installation (they are listed here). However none of them fully satisfied our requirements, so we decided to implement our own, and called it “PlacementFilter”.

The main goal of the PlacementFilter is to “reserve” a whole compute node only for one tenant’s instances, thus making them isolated from other tenants’ instances on the hardware level. Upon tenant creation it can be specified if it is isolated from others or not (default). For isolated tenants only designated compute-nodes should be used for VM instances provisioning. We define and assign these nodes to specific tenants manually, by creating a number of host aggregates. In short – host aggregates is a way to group compute-nodes with similar capabilities/purpose. The goal of the PlacementFilter is to pick a proper aggregate (set of compute nodes) for a given tenant. Usual (non-isolated) tenants will be using “shared” compute-nodes for VMs provisioning. In this deployment we were using OpenStack to also provision baremetal nodes. Bare-metal nodes are isolated by their nature so there’s no need to designate them to pool of isolated nodes for isolated tenants. (In fact, this post builds a bit on one of my previous posts about bare-metal provisioning)

Solution architecture

During the initial cloud configuration, all servers dedicated for compute should be split into 3 pools:

servers for multi-tenant VMs
servers for the single-tenant VMs
servers for bare-metal provisioning

Such grouping is required to introduce two types of tenants: “isolated tenant” and “common tenant”. For “isolated tenants” aggregates are used to create dedicated sets of compute nodes. The aggregates are later taken into account in the scheduling phase by the PlacementFilter.

The PlacementFilter has two missions:

schedule VM on a compute node dedicated to the specific tenant or on one of default compute nodes if tenant is non-isolated
schedule VM on a bare-metal host if a bare-metal instance was requested (no aggregate is required here, as bare-metal instance is isolated from other instances by nature – on the hardware level)

Placement filter passes only bare-metal hosts if a ‘bare_metal’ value was given for ‘compute_type’ parameter in scheduler_hints.

NOTE: We can instruct the scheduler to take into account our provisioning requirements by giving it so-called “hints” (“–hint” option to “nova” command); e.g., to specify compute node’s CPU architecture: –hint arch=i386. In the above case, the hint for bare-metal will be: nova boot …. –hint compute_type=bare_metal

If a non bare-metal instance is requested – filter searches aggregate for the project this instance belongs to, and passes only hosts from its aggregate. If aggregate for project is not found, then a host from the default aggregate is chosen.

The following diagram illustrates how the PlacementFilter works for both bare-metal and virtual instances:

(1) A member of project#1 requests an instance on his own isolated set of compute nodes. The instance lands within his dedicated host aggregate.
(2) A member of project#1 requests a bare-metal instance. This time no aggregate is needed as bare-metal nodes are by nature isolated on the hardware level, so the bare-metal node is taken from the general pool.
(3) Instances of tenants not assigned any host aggregate, land in the default “public” aggregate, where compute nodes can be shared among the tenant instances.

PlacementFilter setup

This is the procedure we follow to implement instance placement control:

create a default aggregate for non-isolated instances and add compute-nodes to it:

1

2

nova aggregate-create default nova

nova aggregate-add-host 1 compute-1
add a host where <bare-metal driver> runs to the default aggregate.
install placement filter from the packages or source code. Add the following flags to nova.conf file:

1

2

3

--scheduler_driver=nova.scheduler.filter_scheduler.FilterScheduler

--scheduler_available_filters=placement_filter.PlacementFilter

--scheduler_default_filters=PlacementFilter
create an isolated tenant:

1

keystone tenant-create --name <project_name>
create a dedicated aggregate for this tenant:

1

2

nova aggregate-create <aggregate_name> nova

nova aggregate-set-metadata <aggregate_id> project_id=<tenant_id>
add hosts to the dedicated aggregate:

1

nova aggregate-add-host <aggregate_id> <host_name>
spawn instance:

1

nova boot --image <image_id> --flavor <flavor_id> <instance_name>

(instance will be spawned on one of the hosts dedicated for current ten)
spawn bare-metal instance:

1

nova boot --image <image_id> --flavor <flavor_id> --hint compute_type=bare_metal <instance_name>
Summary

With the advent of FilterScheduler, implementing custom scheduling policies has become quite simple. Filter organization in OpenStack makes it formally as simple as overriding a single function called “host_passes”. However, the design of the filter itself can become quite complex and is left to the fantasiesof sysadmins/devs (ha!). As for host aggregates, until recently there was no filter which would take them into account (that’s why we implemented PlacementFilter). However, recently (in August 2012) a new filter appeared, calledAggregateInstanceExtraSpecsFilter which seems to do similar job.

http://www.mirantis.com/blog/baremetal-provisioning-part3-images-preparation/

Recently we began a series of blog posts on OpenStack provisioning bare-metal instances (see: Beyond virtual machines and hypervisors and Placement control and multi-tenancy isolation). While installing VMs is relatively easy, as they support common image formats like OVF and qcow2, bare-metal servers are not always that simple.

This article describes how to prepare images for bare-metal nodes. In the case of Linux-based images, this process should work just fine because the generic kernel by default includes general hardware types and features. However, Windows-based images require special attention; the problem is that, once installed, Windows OS only works with the same hardware/architecture it was created for. Thus, images should be prepared on the same hardware nodes you are planning to use them on.

Image preparation overview

The simplest way to prepare images for bare metal consists of three general steps:
- Basic preparations: Establishing a connection to the bare-metal server and hardware management console.
- Installation of the OS on the remote node using the CD-ROM with the installation media, which is redirected using the hardware management console.
- Rebooting the system and booting from LiveCD in order to create the installed system’s image using the standard Unix utility dd and sending the image to the remote host (e.g., a host with the Glance client installed) with netcat.
Details

First you need an access to a bare-metal node. In our case, we had a Dell 6105 server with an IPMI-interface.

The IPMI-interface allows you to interact with a remote host through a Java applet (JViewer in Dell’s case). Aside from the screen redirection and keyboard forwarding, it has a very important facility—redirection of the ISO image or CD-ROM to the server side. In other words, the hardware management console allows us to use a local CD-ROM on the remote end.

To make a node boot from CD-ROM you might need to change the boot priority in the BIOS.

Once that is done, the server should be rebooted through the IPMI-interface and you can begin the installation process. The installation process will typically take longer, since the installation image physically resides on the operator’s workstation and depends on the capacity of the data channel.

After a successful installation of the system, the configuration should be changed. In the case of Windows Server 2008 there are some tricks that I will cover below.

Next, we need to boot from LiveCD. The boot process is similar to the process described above for the beginning of the installation. We used Ubuntu. Preconfigure the network in /etc/network/interfaces to send image data to the remote server through netcat, then get the available hard drives with the command

1

# fdisk -l

and make sure that they are not mounted

1

# df

Now you can proceed directly to image creation. On the client side execute:

1

# nc -l 3333 > name_of_your_snapshot.img

netcat will be waiting for an incoming connection on a specified port. At the same time, on the remote host execute:

1

# dd if=/dev/sda conv=sync,noerror bs=8M count=5000 | nc 192.168.11.1 3333

where:
- /dev/sda is a partition in the installed system,
- bs is the size of the block,
- count is the number of blocks, and
- 192.168.11.1 3333 is the ip-address and port of the client.
Grab 5,000 blocks, 8 megabytes each, from the beginning of the sda disk (40 GBs in total, the size of the Windows installation) and send it with netcat to the remote server.

When everything is completed successfully you get:

1

2

3

524288 + 0 records in

524288 + 0 records out

268435456 bytes transferred in 231.665702 secs ( 1158719 bytes / sec )

The image has been created and transferred to the local host.
Difficulties we faced
There were three ’gotchas’ we encountered; we’ve listed them in the hope you’ll be able to use this to avoid these problems.
1. Dell’s JViewer
First, when using JViewer, we had the following problems:
- It duplicated characters when typing, making it impossible to enter any passwords or logins.
- It was unable to process Latin character input.
These problems were solved by creating a virtual machine on the server physically allocated nearby with bare metal and working with IPMI through that. It could also have been related to the installed version of Java, which we cured by installing JRE version 6.
2. Location of the MBR boot sector
The second issues was during the installation of Windows family systems. We encountered a possible problem associated with the location of the MBR boot sector.
Windows allows you to work with hard disk partitions of approximately 2 terabytes. If the hard disk or RAID-array assigned for installation is larger than 2 terabytes, then the disk will be divided into logical sectors smaller than the maximum allowed.
In this case the system partition and MBR will be allocated on a different logical disk than the installed system, and nowhere else. This makes the image of the system made from /dev/sda/ unusable since the system partition and MBR are on /dev/sdb/.
Thus, you should make sure to install Windows on a partition less than 2 terabytes.
3. Problem with Windows firewall
The third glitch we faced during the testing of images was related to Windows Server 2008’s firewall. The image launched through OpenStack wasn’t accessible through the network, but we solved it by switching off the firewall at the command prompt:

1

# netsh advfirewall set allprofiles state off