Architecture & Design Verification: 1. PCIe for DV Engineers

Part 1 of the PCIe for DV Engineers series

Welcome to the PCIe for DV Engineers series! This series provides verification engineers with the architectural understanding and protocol knowledge needed to effectively verify PCIe-based designs. We'll cover everything from fundamentals to advanced verification scenarios, with practical SystemVerilog examples throughout.

Why PCIe Knowledge Matters for DV

PCIe is everywhere in modern computing. As a DV engineer, you'll encounter it in:

SoC Verification — Most SoCs include PCIe controllers for external connectivity
IP Verification — Endpoint, Root Complex, or Switch IP blocks
System-Level Testing — End-to-end transaction flows across the fabric
Performance Validation — Identifying bottlenecks in high-speed data paths
Compliance Testing — Ensuring spec-compliant behavior across all scenarios

DV Insight: Understanding PCIe architecture isn't just about the protocol—it's about knowing where bugs hide. Most PCIe bugs occur at layer boundaries and during state transitions.

What is PCIe?

PCIe (Peripheral Component Interconnect Express) is a high-speed serial interconnect standard that replaced the legacy parallel PCI bus. The key architectural shift was from a shared bus to dedicated point-to-point links.

Feature	PCI (Legacy)	PCIe
Topology	Shared parallel bus	Point-to-point serial
Bandwidth	Fixed (shared among devices)	Scalable (dedicated per link)
Arbitration	Bus arbitration required	No arbitration needed
Hot-plug	Limited support	Native support
Error Handling	Basic	Advanced (AER)
Lanes	N/A	x1, x2, x4, x8, x16, x32

DV Insight: The point-to-point nature simplifies arbitration verification but introduces new challenges: link training, speed negotiation, and lane-width negotiation must all be thoroughly verified.

PCIe Generations at a Glance

Each PCIe generation doubles the transfer rate of its predecessor:

Gen	Transfer Rate	Encoding	x1 Bandwidth	x16 Bandwidth	Year
Gen1	2.5 GT/s	8b/10b	250 MB/s	4 GB/s	2003
Gen2	5.0 GT/s	8b/10b	500 MB/s	8 GB/s	2007
Gen3	8.0 GT/s	128b/130b	~1 GB/s	~16 GB/s	2010
Gen4	16.0 GT/s	128b/130b	~2 GB/s	~32 GB/s	2017
Gen5	32.0 GT/s	128b/130b	~4 GB/s	~64 GB/s	2019
Gen6	64.0 GT/s	PAM4/FLIT	~8 GB/s	~128 GB/s	2022
Gen7	128 GT/s	PAM4	~16 GB/s	~242 GB/s	2025

Key Verification Points:

8b/10b encoding has 20% overhead; 128b/130b reduces this to ~1.5%
Gen6 introduced PAM4 signaling (4 voltage levels) and FLIT-based encoding
Backward compatibility must work across all generations

Verification Coverage: Speed Negotiation

One of the most critical verification scenarios is speed negotiation. Here's a SystemVerilog coverage example:

// PCIe Speed Negotiation Coverage
covergroup cg_speed_negotiation @(posedge link_up);

  // Cover all supported speeds
  cp_link_speed: coverpoint current_link_speed {
    bins gen1 = {3'b001};
    bins gen2 = {3'b010};
    bins gen3 = {3'b011};
    bins gen4 = {3'b100};
    bins gen5 = {3'b101};
  }

  // Cover speed transitions (training scenarios)
  cp_speed_transition: coverpoint {prev_speed, current_link_speed} {
    bins gen4_to_gen3 = {6'b100_011};
    bins gen4_to_gen2 = {6'b100_010};
    bins gen4_to_gen1 = {6'b100_001};
    bins gen3_to_gen2 = {6'b011_010};
    bins gen3_to_gen1 = {6'b011_001};
    // Add more as needed
  }

  // Cover advertised vs negotiated
  cp_capability_vs_negotiated: cross cp_advertised_speed, cp_link_speed;

endgroup

DV Insight: Always test the "downgrade" scenarios. A Gen5 device connecting to a Gen2 slot is a common real-world case that reveals negotiation bugs.

Lane Configurations

PCIe links aggregate multiple lanes (x1, x2, x4, x8, x16, x32). Each lane is a differential pair providing full-duplex communication.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#e0f2fe', 'primaryTextColor': '#0f172a', 'primaryBorderColor': '#0369a1', 'lineColor': '#0369a1', 'fontSize': '13px'}}}%%
flowchart LR
    subgraph DevA[" "]
        direction TB
        HA["<b>Device A</b>"]
        TXA["TX Lane 0<br/>TX Lane 1<br/>TX Lane 2<br/>TX Lane 3"]
        RXA["RX Lane 0<br/>RX Lane 1<br/>RX Lane 2<br/>RX Lane 3"]
    end
    subgraph DevB[" "]
        direction TB
        HB["<b>Device B</b>"]
        RXB["RX Lane 0<br/>RX Lane 1<br/>RX Lane 2<br/>RX Lane 3"]
        TXB["TX Lane 0<br/>TX Lane 1<br/>TX Lane 2<br/>TX Lane 3"]
    end
    TXA ==>|"x4 Link"| RXB
    TXB ==>|"x4 Link"| RXA
    style HA fill:#0c4a6e,color:#fff,stroke:#0c4a6e
    style HB fill:#0c4a6e,color:#fff,stroke:#0c4a6e
    style TXA fill:#e0f2fe,stroke:#0369a1
    style RXA fill:#e0f2fe,stroke:#0369a1
    style TXB fill:#e0f2fe,stroke:#0369a1
    style RXB fill:#e0f2fe,stroke:#0369a1

Width Negotiation Coverage

// PCIe Link Width Coverage
covergroup cg_link_width @(posedge link_up);

  cp_negotiated_width: coverpoint link_width {
    bins x1  = {6'b00_0001};
    bins x2  = {6'b00_0010};
    bins x4  = {6'b00_0100};
    bins x8  = {6'b00_1000};
    bins x16 = {6'b01_0000};
    bins x32 = {6'b10_0000};
  }

  // Width degradation scenarios
  cp_width_degradation: coverpoint {max_width, link_width} {
    bins x16_to_x8  = {{5'd16, 6'b00_1000}};
    bins x16_to_x4  = {{5'd16, 6'b00_0100}};
    bins x16_to_x1  = {{5'd16, 6'b00_0001}};
    bins x8_to_x4   = {{5'd8,  6'b00_0100}};
    bins x8_to_x1   = {{5'd8,  6'b00_0001}};
  }

  // Cross speed and width
  cp_speed_width_cross: cross cp_link_speed, cp_negotiated_width;

endgroup

DV Insight: Lane reversal and polarity inversion are often overlooked. PCIe allows physical lanes to be reversed (Lane 0 ↔ Lane 15) and polarity inverted—your testbench must cover these.

System Topology

PCIe forms a tree hierarchy with the Root Complex at the apex:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#e0f2fe', 'primaryTextColor': '#0f172a', 'primaryBorderColor': '#0369a1', 'lineColor': '#334155', 'fontSize': '13px'}}}%%
flowchart TB
    subgraph HOST["Host System"]
        CPU(["CPU"]) ~~~ MEM(["Memory"])
    end
    RC["<b>Root Complex</b><br/><i>Type 1 Config</i>"]
    HOST --- RC
    RC ---|"Link"| SW["<b>Switch</b><br/><i>Upstream + Downstream Ports</i>"]
    RC ---|"Link"| EP1["<b>Endpoint</b><br/>GPU<br/><i>Type 0 Config</i>"]
    SW ---|"Link"| EP2["<b>Endpoint</b><br/>NVMe SSD"]
    SW ---|"Link"| EP3["<b>Endpoint</b><br/>Network Card"]
    style RC fill:#0c4a6e,color:#fff,stroke:#0c4a6e
    style SW fill:#0369a1,color:#fff,stroke:#0369a1
    style EP1 fill:#e0f2fe,stroke:#0369a1
    style EP2 fill:#e0f2fe,stroke:#0369a1
    style EP3 fill:#e0f2fe,stroke:#0369a1
    style HOST fill:#f1f5f9,stroke:#94a3b8

Component Summary

Component	Config Type	Description	Key DV Focus
Root Complex	Type 1	Root of hierarchy; connects CPU/Memory to fabric	Configuration routing, memory mapping
Switch	Type 1	Packet router with upstream + downstream ports	Routing tables, peer-to-peer, arbitration
Endpoint	Type 0	Terminal device (GPU, NVMe, NIC)	BAR programming, TLP handling
Bridge	Type 1	Legacy PCI/PCI-X connection	Protocol translation

BDF Addressing

Every PCIe function is identified by a Bus:Device:Function (BDF) address:

// BDF Address Structure
typedef struct packed {
  logic [7:0] bus;      // 256 buses max
  logic [4:0] device;   // 32 devices per bus
  logic [2:0] function; // 8 functions per device
} pcie_bdf_t;

// Example: GPU at Bus 1, Device 0, Function 0
pcie_bdf_t gpu_bdf = '{bus: 8'h01, device: 5'h00, function: 3'h0};

Protocol Stack

PCIe uses a three-layer architecture, each with distinct responsibilities:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#e0f2fe', 'primaryTextColor': '#0f172a', 'primaryBorderColor': '#0369a1', 'lineColor': '#334155', 'fontSize': '13px'}}}%%
flowchart TB
    subgraph STACK["PCIe Protocol Stack"]
        direction TB
        TL["<b>Transaction Layer</b><br/><i>TLPs: Memory, IO, Config, Message</i>"]
        DL["<b>Data Link Layer</b><br/><i>Seq#, LCRC, ACK/NAK, Flow Control</i>"]
        PL["<b>Physical Layer</b><br/><i>Encoding, LTSSM, Electrical</i>"]
        TL --- DL --- PL
    end
    PL <-.->|"Serial Link"| REMOTE(["Remote Device"])
    style TL fill:#bae6fd,stroke:#0369a1
    style DL fill:#7dd3fc,stroke:#0369a1
    style PL fill:#38bdf8,stroke:#0369a1,color:#0f172a
    style STACK fill:#f0f9ff,stroke:#0369a1

Layer	Packet Unit	Key Functions	Verification Focus
Transaction	TLP	Request/Completion, Ordering, Virtual Channels	TLP format, ordering rules, VC arbitration
Data Link	DLLP	Reliable delivery, Credit management	ACK/NAK timing, replay buffer, credit flow
Physical	Symbol	Encoding, Link training, Electrical	LTSSM states, encoding errors, electrical compliance

Layer Verification Mapping

// Simplified PCIe Monitor Structure
class pcie_monitor extends uvm_monitor;

  // Layer-specific analysis ports
  uvm_analysis_port #(pcie_tlp_t)  tlp_ap;   // Transaction Layer
  uvm_analysis_port #(pcie_dllp_t) dllp_ap;  // Data Link Layer
  uvm_analysis_port #(pcie_os_t)   os_ap;    // Physical Layer (Ordered Sets)

  virtual task run_phase(uvm_phase phase);
    fork
      monitor_transaction_layer();  // TLP decode & check
      monitor_datalink_layer();     // DLLP, ACK/NAK, credits
      monitor_physical_layer();     // LTSSM, training sequences
    join
  endtask

endclass

Verification Assertions

Here are essential assertions for PCIe architecture verification:

// Link must train within specification timeout
property p_link_training_timeout;
  @(posedge clk)
  $rose(ltssm_detect_quiet) |->
    ##[1:TRAINING_TIMEOUT_CYCLES] link_up;
endproperty
assert property (p_link_training_timeout) else
  $error("Link training timeout exceeded");

// Negotiated speed must not exceed advertised capability
property p_speed_within_capability;
  @(posedge clk)
  link_up |-> (negotiated_speed <= advertised_max_speed);
endproperty
assert property (p_speed_within_capability);

// Negotiated width must not exceed physical lanes
property p_width_within_capability;
  @(posedge clk)
  link_up |-> (negotiated_width <= max_physical_lanes);
endproperty
assert property (p_width_within_capability);

// BDF must be valid (no reserved combinations)
property p_valid_bdf;
  @(posedge clk)
  config_request |-> (bdf.device < 5'd32) && (bdf.function < 3'd8);
endproperty
assert property (p_valid_bdf);

Interview Corner

Q1: What's the difference between PCI and PCIe?

PCI uses a shared parallel bus where all devices compete for bandwidth. PCIe uses dedicated point-to-point serial links, providing each device with guaranteed bandwidth and eliminating bus arbitration.

Q2: How does PCIe achieve backward compatibility across generations?

During link training, both sides advertise their maximum supported speed. They negotiate down to the highest common speed. A Gen5 device in a Gen2 slot will operate at Gen2 speeds.

Q3: What happens if some lanes fail during training?

PCIe supports link width degradation. If a x16 link has lane failures, it can train at x8, x4, x2, or x1. The LTSSM handles this during the Configuration state.

Q4: Why did PCIe move from 8b/10b to 128b/130b encoding?

8b/10b has 20% overhead (10 bits transmitted for 8 bits of data). 128b/130b reduces overhead to ~1.5%, enabling higher effective bandwidth without increasing the raw transfer rate.

Q5: What is a Root Complex and why does it matter for verification?

The Root Complex connects the CPU/Memory subsystem to the PCIe fabric. It originates configuration transactions and is the ultimate destination for upstream memory writes. For verification, it's crucial because it defines the memory map and handles enumeration.

Real-World: Linux RC ↔ EP Interactions

The PCIe concepts above aren't abstract — they're exercised billions of times per second on every Linux machine. This section traces real RC ↔ EP interactions using lspci, dmesg, and sysfs output to show exactly how the architecture works in practice.

1. Linux Boot: How the RC Discovers Every Endpoint

Teaches: BDF Addressing, Config Space, Type 0/1 Headers, Topology Discovery

When a Linux system boots, the Root Complex performs enumeration — systematically scanning every possible Bus:Device:Function address by issuing Configuration Read TLPs. Each responding device returns its Vendor/Device ID, header type, and BAR layout.

$ lspci -tv
-[0000:00]-+-00.0  Intel Root Complex
           +-01.0--[01]----00.0  NVIDIA GPU
           +-1c.0--[02]----00.0  Intel NIC
           \-1d.0--[03]----00.0  Samsung NVMe

Reading this tree: the root bus is 00. Device 01.0 on bus 00 is a bridge (Type 1 header) — it creates secondary bus 01, behind which sits the GPU at 01:00.0 (Type 0 header, an endpoint). The same pattern repeats for the NIC and NVMe.

Bus numbers are assigned top-down during enumeration. The RC writes each bridge's Secondary Bus Number and Subordinate Bus Number registers so that TLPs can be routed to the correct device. When the RC reads a BDF that has no physical device, no response comes back within the timeout — this is a Master Abort, and the RC marks that slot as empty.

$ lspci -s 01:00.0 -vvv | head -3
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090]
        Control: I/O+ Mem+ BusMaster+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-

The Mem+ and BusMaster+ flags confirm that the kernel enabled memory-space access and bus mastering for this endpoint — without these, the device can't respond to Memory TLPs or initiate DMA.

Explains: BDF addressing isn't arbitrary bookkeeping — it's the routing label on every Configuration TLP. Type 1 headers (bridges) create the bus hierarchy; Type 0 headers (endpoints) terminate it. Enumeration is the RC methodically building a map of what exists and where.

2. NVMe SSD Read: The Full TLP Conversation

Teaches: Memory-Mapped I/O, Posted vs Non-Posted TLPs, Completions, MSI-X Interrupts

When Linux reads a 4 KB block from an NVMe SSD, the interaction between RC and EP is a precise sequence of TLPs. Here's what actually happens:

First, check where the NVMe controller's registers are mapped:

$ lspci -s 03:00.0 -vv | grep Region
Region 0: Memory at fb000000 (64-bit, non-prefetchable) [size=16K]

The kernel mapped the NVMe controller's BAR to host physical address 0xfb000000. Every register access is a PCIe transaction.

The TLP sequence for a single NVMe read:

Step	Direction	TLP Type	Purpose
1	RC → EP	Memory Write (posted)	Driver writes doorbell register at BAR+0x1000 to notify SSD of new submission queue entry
2	EP → RC	Memory Read (non-posted)	SSD fetches the 64-byte submission queue entry from host memory
3	RC → EP	Completion with Data	RC returns the submission queue entry data to the SSD
4	EP → RC	Memory Read (non-posted)	SSD reads the actual 4 KB data block from host memory (may split into multiple TLPs)
5	RC → EP	Completion with Data	RC returns data — up to 4 completions for a 4 KB read with 128-byte max payload
6	EP → RC	Memory Write (posted)	SSD writes a completion queue entry to host memory
7	EP → RC	Memory Write (posted)	SSD sends MSI-X interrupt — a memory write to a special RC address

Notice the asymmetry:

Posted writes (steps 1, 6, 7) fire and forget — no response TLP is needed. This is why PCIe writes are fast.
Non-posted reads (steps 2, 4) require Completion TLPs to carry the data back. The requester blocks until the completion arrives or a timeout fires.

$ cat /proc/interrupts | grep nvme
 45:      18923  IR-PCI-MSI 524288-edge  nvme0q0
 46:      94521  IR-PCI-MSI 524289-edge  nvme0q1

Each MSI-X interrupt is a Memory Write TLP targeting a specific address in the RC's interrupt controller. The 524288 is the MSI-X vector number. Multiple queues get independent vectors — no sharing, no polling.

Explains: The posted/non-posted distinction is fundamental to PCIe performance. Writes are fire-and-forget (posted) because the data flows one way. Reads must have completions because the requester needs data back. MSI-X interrupts are elegant — they're just memory writes, using the same TLP mechanism as regular data transfer.

3. SR-IOV: One Physical NIC Becomes Many Endpoints

Teaches: Physical Functions, Virtual Functions, BDF Expansion, BAR Allocation

SR-IOV (Single Root I/O Virtualization) allows a single physical NIC to present multiple independent endpoints to the RC. Each Virtual Function (VF) gets its own BDF address, BAR space, and config space — the RC treats them as separate devices:

$ echo 4 > /sys/bus/pci/devices/0000:05:00.0/sriov_numvfs

$ lspci | grep Ethernet
05:00.0 Ethernet: Intel X710 (PF)
05:02.0 Ethernet: Intel X710 Virtual Function
05:02.1 Ethernet: Intel X710 Virtual Function
05:02.2 Ethernet: Intel X710 Virtual Function
05:02.3 Ethernet: Intel X710 Virtual Function

The Physical Function (PF) at 05:00.0 created 4 VFs. Notice VFs appear at 05:02.x — the device number offset is configured in the SR-IOV capability structure. Each VF has independent config space:

$ lspci -s 05:02.0 -vv | grep Region
Region 0: Memory at fb100000 (64-bit, non-prefetchable) [size=4K]
Region 3: Memory at fb110000 (64-bit, prefetchable) [size=4K]

The RC allocated a separate BAR range for each VF. The SR-IOV capability structure in the PF specifies the VF BAR base address and VF Stride — the RC uses these to compute each VF's BAR:

VF[n] BAR = VF BAR Base + (n × VF Stride)

$ lspci -s 05:00.0 -vv | grep -A2 "SR-IOV"
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
  Initial VFs: 64, Total VFs: 64, Number of VFs: 4
  VF offset: 16, VF stride: 1

The VF offset: 16 means VFs start at device 2 (offset 16 functions = 2 devices × 8 functions), hence 05:02.0. The VF stride: 1 means consecutive function numbers.

Explains: SR-IOV works entirely within the PCIe spec's existing BDF and BAR model. The RC doesn't need special SR-IOV awareness for data plane operations — VFs look like regular endpoints. The PF manages VF lifecycle through its SR-IOV capability registers, and the RC's normal enumeration and BAR allocation handles the rest.

4. VFIO GPU Passthrough: A VM Talks Directly to Hardware

Teaches: IOMMU/Address Translation, BAR Remapping, TLP Routing Through Virtualization

When a GPU is passed through to a KVM/QEMU virtual machine via VFIO, the guest OS gets direct access to the physical PCIe endpoint. The setup involves unbinding the host driver and binding VFIO:

# Host unbinds GPU from native driver
$ echo 0000:01:00.0 > /sys/bus/pci/drivers/nvidia/unbind

# Bind to VFIO-PCI (a stub driver that grants userspace access)
$ echo "vfio-pci" > /sys/bus/pci/devices/0000:01:00.0/driver_override
$ echo 0000:01:00.0 > /sys/bus/pci/drivers/vfio-pci/bind

Inside the VM, the guest sees what appears to be a normal PCIe endpoint:

guest$ lspci
00:05.0 VGA compatible controller: NVIDIA Corporation [10de:2204]

guest$ lspci -s 00:05.0 -vv | grep Region
Region 0: Memory at fc000000 (64-bit, non-prefetchable) [size=16M]
Region 1: Memory at e0000000 (64-bit, prefetchable) [size=256M]
Region 3: Memory at ee000000 (64-bit, non-prefetchable) [size=32M]

The guest BAR addresses (0xfc000000) are guest-physical — not the same as host-physical addresses. Here's the TLP path:

Guest CPU writes to 0xfc000000 (guest-physical)
    ↓
Virtual RC (QEMU) generates Memory Write TLP
    ↓
Host IOMMU intercepts, translates guest-physical → host-physical
    ↓
Real RC forwards TLP to physical GPU endpoint

When the GPU DMAs data back to "system memory," the reverse happens:

GPU EP sends Memory Write TLP to address X
    ↓
Host IOMMU intercepts, validates X belongs to this VM's IOMMU domain
    ↓
TLP delivered to correct host-physical page (mapped into guest memory)

The IOMMU provides isolation — the GPU can only DMA to memory pages explicitly mapped for that VM's IOMMU domain. A rogue GPU (or buggy driver) cannot access other VMs' memory:

$ dmesg | grep -i iommu
DMAR: IOMMU enabled
DMAR: Setting RMRR for device 0000:01:00.0 [0xef000000 - 0xefffffff]

Explains: IOMMU intercepts every TLP that crosses the RC boundary. It provides the address translation that makes virtualization possible — guest-physical ≠ host-physical, and the IOMMU bridges this gap on every transaction. Without it, GPU passthrough would require trusting the device to only access the correct memory, which is not viable for multi-tenant systems.

5. AER: How the RC Reports Endpoint Errors

Teaches: Advanced Error Reporting, Correctable vs Uncorrectable Errors, Error TLP Flow

PCIe defines a three-tier error hierarchy through Advanced Error Reporting (AER). When something goes wrong on the link, the RC detects it and reports through standard Linux interfaces:

$ dmesg | grep AER
pcieport 0000:00:1c.0: AER: Corrected error received: 0000:03:00.0
pcieport 0000:00:1c.0: AER: device [03:00.0] error status: 0x00000040 [Bad TLP]

The three error tiers and their consequences:

Tier	Examples	Impact	RC Response
Correctable	Bad TLP, Bad DLLP, Replay Timer Timeout	None — hardware retransmits	Log and increment counter
Non-Fatal Uncorrectable	Completion Timeout, Unexpected Completion, ACS Violation	Specific transaction fails	Notify OS, driver handles
Fatal Uncorrectable	Flow Control Protocol Error, Malformed TLP, ECRC Check Failed	Link is unreliable	Reset the link (Secondary Bus Reset)

Error counters are exposed through sysfs:

$ cat /sys/bus/pci/devices/0000:03:00.0/aer_dev_correctable
RxErr 0
BadTLP 3
BadDLLP 0
Rollover 1
Timeout 0
NonFatalErr 0
CorrIntErr 0
HeaderOF 0

The error flow through the PCIe topology:

EP detects or causes an error condition
EP sets the corresponding bit in its Device Status Register (config space offset 0x0A)
If error reporting is enabled, EP sends an Error Message TLP upstream to the RC
RC receives the error message and logs it via the AER driver
For fatal errors, the RC may issue a Secondary Bus Reset to recover the link

$ setpci -s 03:00.0 CAP_EXP+0x0a.w
0011
# Bit 0: Correctable Error Detected
# Bit 1: Non-Fatal Error Detected

Explains: PCIe's error model is layered by severity. Correctable errors are invisible to software — the Data Link Layer's ACK/NAK protocol handles retransmission automatically. Non-fatal errors kill individual transactions but leave the link operational. Fatal errors indicate the link itself is compromised. This hierarchy lets PCIe degrade gracefully rather than failing catastrophically.

6. Suspend/Resume: Tearing Down and Rebuilding Every Link

Teaches: Power States (D0–D3), PM TLPs, Link State Transitions, Config Space Save/Restore

When a Linux laptop suspends, the kernel systematically powers down every PCIe device, and on resume, reconstructs the entire bus state from saved data:

$ dmesg | grep -i "power state"
pci 0000:03:00.0: power state changed [D0 -> D3hot]
pci 0000:02:00.0: power state changed [D0 -> D3hot]
pci 0000:01:00.0: power state changed [D0 -> D3cold]

The suspend sequence:

Save config space — the kernel reads and stores all 256 bytes (or 4096 for PCIe extended) of each device's configuration space: BARs, Command register, MSI-X table, and capability structures.
Transition devices to D3 — the RC writes to each device's Power Management Control/Status Register (PMCSR), setting the power state field:
- D3hot: Device powered but non-operational, Vcc still supplied, config space partially accessible
- D3cold: Power completely removed, device is inert, config space inaccessible
Links enter low-power states — as devices enter D3, links transition to L2 (low-power with aux power) or L3 (completely off).

The resume sequence:

$ dmesg | grep -i "pci\|restoring"
pci 0000:01:00.0: power state changed [D3cold -> D0]
pci 0000:01:00.0: restoring config space at offset 0x4 (was 0x0, writing 0x100406)
pci 0000:01:00.0: restoring config space at offset 0x10 (was 0x0, writing 0xfb000000)
pci 0000:01:00.0: restoring config space at offset 0x3c (was 0x0, writing 0x1ff)

Re-power links — the RC re-applies power, and each link's LTSSM goes through the full training sequence: Detect → Polling → Configuration → L0.
Restore config space — the kernel writes back all saved register values. Offset 0x04 is the Command register (re-enabling memory access and bus mastering). Offset 0x10 is BAR0 (restoring the memory mapping). Offset 0x3C is the Interrupt Line/Pin register.
Re-enable interrupts and resume drivers — MSI-X vectors are reprogrammed, and device drivers resume normal operation.

Explains: Config space is volatile in D3cold — the device loses all register state when power is removed. This is why the kernel must save and restore it. The resume process is essentially a fast re-enumeration: the bus topology is known, so the kernel skips discovery and directly programs each device back to its pre-suspend state. The LTSSM must fully retrain every link, which is why resume from deep sleep takes measurable time.

7. Live Migration: Serializing a Virtual EP's PCIe State

Teaches: Config Space as Stateful Contract, BAR Semantics, Interrupt State

When a VM with an emulated PCIe NIC (e.g., virtio-net) is live-migrated from Host A to Host B, the hypervisor must serialize and transfer the complete state of the virtual endpoint:

State that must be captured:

Config space (256 or 4096 bytes): Vendor/Device IDs, BARs, Command register, Status register, all capability structures
MSI-X state: Interrupt vector table entries, pending bit array
Device-specific state: TX/RX queue pointers, ring buffer descriptors, device status flags

After migration, the guest sees an identical device:

# Before migration (Host A)              # After migration (Host B)
guest$ lspci -s 00:03.0 -vv              guest$ lspci -s 00:03.0 -vv
00:03.0 Ethernet: Red Hat virtio          00:03.0 Ethernet: Red Hat virtio
  Region 0: Memory at feb40000            Region 0: Memory at feb40000
  Region 1: Memory at fe000000            Region 1: Memory at fe000000
  Capabilities: [98] MSI-X: Enable+       Capabilities: [98] MSI-X: Enable+

Same BDF, same BAR addresses, same capabilities. The guest kernel never re-enumerates — from its perspective, the RC ↔ EP relationship was never interrupted.

Why config space is a contract:

The guest OS programmed BARs to specific addresses and set up DMA mappings accordingly. If any of these values changed during migration:

Changing BAR0 would invalidate all existing MMIO mappings → guest driver crash
Changing the MSI-X table would lose interrupt routing → device appears dead
Changing the BDF would require full re-enumeration → potential OS panic

Hardware passthrough makes this harder:

For VFIO passthrough devices (real hardware, not emulated), migration requires:

Draining all in-flight DMA transactions on Host A
Saving the physical device's actual config space and device-specific state
Programming an identical physical device on Host B with the exact same state
Ensuring the Host B IOMMU mappings match what the guest expects

This is why live migration of emulated devices is routine, while live migration of passthrough hardware is an active area of development — the EP state lives in silicon on Host A and must be perfectly replicated in silicon on Host B.

Explains: PCIe config space isn't just a set of registers — it's the contractual interface between host software and the endpoint. Every BAR address, every capability bit, every interrupt vector represents an agreement. Breaking any part of that contract mid-operation violates assumptions baked into driver code, DMA mappings, and interrupt routing. This is why PCIe's seemingly rigid config space model enables the flexibility of live migration — predictability is the feature.

Key Takeaways

PCIe replaced shared parallel buses with dedicated point-to-point serial links
Each generation doubles bandwidth; Gen6 reaches ~128 GB/s at x16
The three-layer stack (Transaction, Data Link, Physical) maps to distinct verification concerns
Speed and width negotiation are critical verification scenarios—always test downgrades
BDF addressing (Bus:Device:Function) uniquely identifies every PCIe function
Backward compatibility testing across generations is essential

What's Next

In Part 2, we'll dive deep into the Physical Layer and LTSSM (Link Training and Status State Machine). We'll explore:

All LTSSM states and transitions
Link training sequences (TS1, TS2)
Recovery mechanisms
Practical verification strategies for physical layer testing

Next: Part 2 — Physical Layer & LTSSM

1. PCIe for DV Engineers - Architecture & Overview