NVMe Emulation

QEMU provides NVMe emulation through the nvme, nvme-ns and nvme-subsys devices.

See the following sections for specific information on

Adding NVMe Devices

Controller Emulation

The QEMU emulated NVMe controller implements version 1.4 of the NVM Express specification. All mandatory features are implement with a couple of exceptions and limitations:

  • Accounting numbers in the SMART/Health log page are reset when the device is power cycled.

  • Interrupt Coalescing is not supported and is disabled by default.

The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the following parameters:

-drive file=nvm.img,if=none,id=nvm
-device nvme,serial=deadbeef,drive=nvm

There are a number of optional general parameters for the nvme device. Some are mentioned here, but see -device nvme,help to list all possible parameters.

max_ioqpairs=UINT32 (default: 64)

Set the maximum number of allowed I/O queue pairs. This replaces the deprecated num_queues parameter.

msix_qsize=UINT16 (default: 65)

The number of MSI-X vectors that the device should support.

mdts=UINT8 (default: 7)

Set the Maximum Data Transfer Size of the device.

use-intel-id (default: off)

Since QEMU 5.2, the device uses a QEMU allocated “Red Hat” PCI Device and Vendor ID. Set this to on to revert to the unallocated Intel ID previously used.

Additional Namespaces

In the simplest possible invocation sketched above, the device only support a single namespace with the namespace identifier 1. To support multiple namespaces and additional features, the nvme-ns device must be used.

-device nvme,id=nvme-ctrl-0,serial=deadbeef
-drive file=nvm-1.img,if=none,id=nvm-1
-device nvme-ns,drive=nvm-1
-drive file=nvm-2.img,if=none,id=nvm-2
-device nvme-ns,drive=nvm-2

The namespaces defined by the nvme-ns device will attach to the most recently defined nvme-bus that is created by the nvme device. Namespace identifiers are allocated automatically, starting from 1.

There are a number of parameters available:

nsid (default: 0)

Explicitly set the namespace identifier.

uuid (default: autogenerated)

Set the UUID of the namespace. This will be reported as a “Namespace UUID” descriptor in the Namespace Identification Descriptor List.

eui64

Set the EUI-64 of the namespace. This will be reported as a “IEEE Extended Unique Identifier” descriptor in the Namespace Identification Descriptor List. Since machine type 6.1 a non-zero default value is used if the parameter is not provided. For earlier machine types the field defaults to 0.

bus

If there are more nvme devices defined, this parameter may be used to attach the namespace to a specific nvme device (identified by an id parameter on the controller device).

NVM Subsystems

Additional features becomes available if the controller device (nvme) is linked to an NVM Subsystem device (nvme-subsys).

The NVM Subsystem emulation allows features such as shared namespaces and multipath I/O.

-device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
-device nvme,serial=deadbeef,subsys=nvme-subsys-0
-device nvme,serial=deadbeef,subsys=nvme-subsys-0

This will create an NVM subsystem with two controllers. Having controllers linked to an nvme-subsys device allows additional nvme-ns parameters:

shared (default: on since 6.2)

Specifies that the namespace will be attached to all controllers in the subsystem. If set to off, the namespace will remain a private namespace and may only be attached to a single controller at a time. Shared namespaces are always automatically attached to all controllers (also when controllers are hotplugged).

detached (default: off)

If set to on, the namespace will be be available in the subsystem, but not attached to any controllers initially. A shared namespace with this set to on will never be automatically attached to controllers.

Thus, adding

-drive file=nvm-1.img,if=none,id=nvm-1
-device nvme-ns,drive=nvm-1,nsid=1
-drive file=nvm-2.img,if=none,id=nvm-2
-device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on

will cause NSID 1 will be a shared namespace that is initially attached to both controllers. NSID 3 will be a private namespace due to shared=off and only attachable to a single controller at a time. Additionally it will not be attached to any controller initially (due to detached=on) or to hotplugged controllers.

Optional Features

Controller Memory Buffer

nvme device parameters related to the Controller Memory Buffer support:

cmb_size_mb=UINT32 (default: 0)

This adds a Controller Memory Buffer of the given size at offset zero in BAR 2.

legacy-cmb (default: off)

By default, the device uses the “v1.4 scheme” for the Controller Memory Buffer support (i.e, the CMB is initially disabled and must be explicitly enabled by the host). Set this to on to behave as a v1.3 device wrt. the CMB.

Simple Copy

The device includes support for TP 4065 (“Simple Copy Command”). A number of additional nvme-ns device parameters may be used to control the Copy command limits:

mssrl=UINT16 (default: 128)

Set the Maximum Single Source Range Length (MSSRL). This is the maximum number of logical blocks that may be specified in each source range.

mcl=UINT32 (default: 128)

Set the Maximum Copy Length (MCL). This is the maximum number of logical blocks that may be specified in a Copy command (the total for all source ranges).

msrc=UINT8 (default: 127)

Set the Maximum Source Range Count (MSRC). This is the maximum number of source ranges that may be used in a Copy command. This is a 0’s based value.

Zoned Namespaces

A namespaces may be “Zoned” as defined by TP 4053 (“Zoned Namespaces”). Set zoned=on on an nvme-ns device to configure it as a zoned namespace.

The namespace may be configured with additional parameters

zoned.zone_size=SIZE (default: 128MiB)

Define the zone size (ZSZE).

zoned.zone_capacity=SIZE (default: 0)

Define the zone capacity (ZCAP). If left at the default (0), the zone capacity will equal the zone size.

zoned.descr_ext_size=UINT32 (default: 0)

Set the Zone Descriptor Extension Size (ZDES). Must be a multiple of 64 bytes.

zoned.cross_read=BOOL (default: off)

Set to on to allow reads to cross zone boundaries.

zoned.max_active=UINT32 (default: 0)

Set the maximum number of active resources (MAR). The default (0) allows all zones to be active.

zoned.max_open=UINT32 (default: 0)

Set the maximum number of open resources (MOR). The default (0) allows all zones to be open. If zoned.max_active is specified, this value must be less than or equal to that.

zoned.zasl=UINT8 (default: 0)

Set the maximum data transfer size for the Zone Append command. Like mdts, the value is specified as a power of two (2^n) and is in units of the minimum memory page size (CAP.MPSMIN). The default value (0) has this property inherit the mdts value.

Flexible Data Placement

The device may be configured to support TP4146 (“Flexible Data Placement”) by configuring it (fdp=on) on the subsystem:

-device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16

The subsystem emulates a single Endurance Group, on which Flexible Data Placement will be supported. Also note that the device emulation deviates slightly from the specification, by always enabling the “FDP Mode” feature on the controller if the subsystems is configured for Flexible Data Placement.

Enabling Flexible Data Placement on the subsyste enables the following parameters:

fdp.nrg (default: 1)

Set the number of Reclaim Groups.

fdp.nruh (default: 0)

Set the number of Reclaim Unit Handles. This is a mandatory parameter and must be non-zero.

fdp.runs (default: 96M)

Set the Reclaim Unit Nominal Size. Defaults to 96 MiB.

Namespaces within this subsystem may requests Reclaim Unit Handles:

-device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST

The RUHLIST is a semicolon separated list (i.e. 0;1;2;3) and may include ranges (i.e. 0;8-15). If no reclaim unit handle list is specified, the controller will assign the controller-specified reclaim unit handle to placement handle identifier 0.

Metadata

The virtual namespace device supports LBA metadata in the form separate metadata (MPTR-based) and extended LBAs.

ms=UINT16 (default: 0)

Defines the number of metadata bytes per LBA.

mset=UINT8 (default: 0)

Set to 1 to enable extended LBAs.

End-to-End Data Protection

The virtual namespace device supports DIF- and DIX-based protection information (depending on mset).

pi=UINT8 (default: 0)

Enable protection information of the specified type (type 1, 2 or 3).

pil=UINT8 (default: 0)

Controls the location of the protection information within the metadata. Set to 1 to transfer protection information as the first bytes of metadata. Otherwise, the protection information is transferred as the last bytes of metadata.

pif=UINT8 (default: 0)

By default, the namespace device uses 16 bit guard protection information format (pif=0). Set to 2 to enable 64 bit guard protection information format. This requires at least 16 bytes of metadata. Note that pif=1 (32 bit guards) is currently not supported.

Virtualization Enhancements and SR-IOV (Experimental Support)

The nvme device supports Single Root I/O Virtualization and Sharing along with Virtualization Enhancements. The controller has to be linked to an NVM Subsystem device (nvme-subsys) for use with SR-IOV.

A number of parameters are present (please note, that they may be subject to change):

sriov_max_vfs (default: 0)

Indicates the maximum number of PCIe virtual functions supported by the controller. Specifying a non-zero value enables reporting of both SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities by the NVMe device. Virtual function controllers will not report SR-IOV.

sriov_vq_flexible

Indicates the total number of flexible queue resources assignable to all the secondary controllers. Implicitly sets the number of primary controller’s private resources to (max_ioqpairs - sriov_vq_flexible).

sriov_vi_flexible

Indicates the total number of flexible interrupt resources assignable to all the secondary controllers. Implicitly sets the number of primary controller’s private resources to (msix_qsize - sriov_vi_flexible).

sriov_max_vi_per_vf (default: 0)

Indicates the maximum number of virtual interrupt resources assignable to a secondary controller. The default 0 resolves to (sriov_vi_flexible / sriov_max_vfs)

sriov_max_vq_per_vf (default: 0)

Indicates the maximum number of virtual queue resources assignable to a secondary controller. The default 0 resolves to (sriov_vq_flexible / sriov_max_vfs)

The simplest possible invocation enables the capability to set up one VF controller and assign an admin queue, an IO queue, and a MSI-X interrupt.

-device nvme-subsys,id=subsys0
-device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
 sriov_vq_flexible=2,sriov_vi_flexible=1

The minimum steps required to configure a functional NVMe secondary controller are:

  • unbind flexible resources from the primary controller

 nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
 nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0

* perform a Function Level Reset on the primary controller to actually
  release the resources
 echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset

* enable VF
 echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs

* assign the flexible resources to the VF and set it ONLINE
 nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
 nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
 nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0

* bind the NVMe driver to the VF
echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind