13. SystemC Tutorial - Capstone 2: Running Real RV32I Programs

Introduction

By the end of this post, your SystemC model executes real RISC-V machine code.

Not contrived test vectors. Not hand-wired stimulus that drives specific ports with specific values at specific times. Actual programs: binary sequences your compiler produces, loaded into instruction memory, fetched by the program counter, decoded, executed, and written back — exactly as a silicon implementation would do it.

That makes this model a functional instruction set simulator, or ISS. The ARM Fast Models, gem5, QEMU, and Spike — the official RISC-V ISS from UC Berkeley — all started exactly this way. A software model that could execute binaries, fast enough to boot a kernel or run test suites, accurate enough to serve as the golden reference for RTL verification.

The difference between our model and Spike is that Spike handles the full RISC-V privilege architecture, virtual memory, M/S/U privilege levels, CSRs, and optional extensions. Our model handles the 42 instructions of RV32I. The methodology is identical. You will use both.

This post runs three programs: Fibonacci(10), array sum, and byte counting. Each program is shown as C pseudocode, then RISC-V assembly with exact hex encodings, then executed on the CPU. A software reference model — a pure C++ class called SoftCPU — runs in lockstep with your SystemC model, comparing all 32 registers after every instruction. Any divergence between the two models is a bug.

This is the gold standard of CPU validation. After Post 13, your single-cycle RV32I CPU is fully verified.


Prerequisites


Architecture Overview

graph TD
    TB["tb_cap2\n(SC_MODULE)"] --> CPU["rv32i_cpu\n(DUT)"]
    TB --> SOFT["SoftCPU\n(reference model)"]
    CPU -->|dbg_pc, dbg_instr| MON["trace_monitor\n(SC_METHOD)"]
    MON --> LOG["stdout\ntrace log"]
    CPU -->|dbg_pc, dbg_instr| CMP["comparator\n(SC_METHOD)"]
    SOFT --> CMP
    CMP -->|PASS/FAIL| RESULT["Test Result"]

The testbench owns both the DUT (rv32i_cpu) and a pure software reference (SoftCPU). After each clock edge, it advances the software model one instruction and compares all 32 registers. The trace monitor logs every instruction to stdout. The comparator flags divergence immediately, naming the register and both values.


SystemC Language Reference

Construct Syntax SV / Verilog Equivalent Key Difference
Simulation start with time limit sc_start(100, SC_NS) #100ns; in initial block SV blocks the current process; sc_start returns control to C++
Request simulation stop sc_stop() $finish Both stop after current delta completes; code after may still run
Current simulation time sc_time_stamp() (returns sc_time) $time (integer) / $realtime (real) sc_time is an object; use .to_double() or stream it
Software reference model Plain C++ class SoftCPU — no SystemC Task / function in SV, or DPI-C import SystemC: no DPI overhead; SoftCPU is just a C++ object in the same process
DPI equivalent Not needed — C++ class in same TU import "DPI-C" function void step(...) SystemC's key advantage: reference model needs no interface definition
Clock generation SC_THREAD toggling clk / sc_clock object always #5ns clk = ~clk or clocking block sc_clock is a channel; SV clock is usually an always or generate block
Lockstep comparison C++ for loop reading i_rf.read_reg(i) $sampled(rf_out) or checker task SystemC has direct C++ access to DUT internals; SV uses hierarchy paths
Halt detection Read dbg_halt port or poll i_rf.read_reg() Detect $finish or monitor a specific register SystemC polls via signal reads; SV often uses assertion or @(event)
Register file direct read u_cpu.i_rf.read_reg(i) — C++ member call u_dut.rf.regs[i] via hierarchy access Both allow hierarchical access in simulation; neither is synthesizable
Waveform dump sc_trace_file* tf = sc_create_vcd_trace_file("out") $dumpfile("out.vcd"); $dumpvars SystemC VCD requires explicit sc_trace() calls per signal

The Three Programs

Program 1 — Fibonacci(10) = 55

Fibonacci computed iteratively. Result lands in x2.

Algorithm:

a = 0, b = 1
repeat 10 times:
    temp = a + b
    a = b
    b = temp
result is b = fib(10) = 55

RISC-V Assembly with Hex Encodings:

# Register usage:
#   x1 = a (previous Fibonacci number)
#   x2 = b (current Fibonacci number)
#   x3 = temp
#   x5 = loop counter

addr  hex          mnemonic
0x00  0x00000093   addi  x1, x0, 0     # x1 = 0  (a = 0)
0x04  0x00100113   addi  x2, x0, 1     # x2 = 1  (b = 1)
0x08  0x00a00293   addi  x5, x0, 10    # x5 = 10 (counter)

# loop: (addr 0x0c)
0x0c  0x002081b3   add   x3, x1, x2    # temp = a + b
0x10  0x00010093   addi  x1, x2, 0     # a = b   (mv x1, x2)
0x14  0x00018113   addi  x2, x3, 0     # b = temp (mv x2, x3)
0x18  0xfff28293   addi  x5, x5, -1    # counter--
0x1c  0xfe029ce3   bne   x5, x0, -16   # if counter != 0, goto loop
                                        # target = 0x1c + (-16) = 0x0c ✓
0x20  0x00100073   ebreak               # halt

Expected final state: x2 = 55 (fib(10))

Encoding Verification for bne x5, x0, -16:

The B-type immediate -16 in binary (13-bit signed) is 1111111110000. The BNE encoding:

31      30:25   24:20   19:15   14:12  11:8    7      6:0
imm[12] imm[10:5] rs2   rs1    funct3 imm[4:1] imm[11] opcode
  1     111111    00000  00101   001   0000      1      1100011
= 1111 1110 0000 0010 1001 1100 1110 0011
= 0xFE029CE3  ✓

Program as C array:

static const uint32_t fib_prog[] = {
    0x00000093u,   // addi x1, x0, 0
    0x00100113u,   // addi x2, x0, 1
    0x00a00293u,   // addi x5, x0, 10
    0x002081b3u,   // add  x3, x1, x2
    0x00010093u,   // addi x1, x2, 0    (mv x1, x2)
    0x00018113u,   // addi x2, x3, 0    (mv x2, x3)
    0xfff28293u,   // addi x5, x5, -1
    0xfe029ce3u,   // bne  x5, x0, -16
    0x00100073u,   // ebreak
};

Program 2 — Sum of Array [1, 2, 3, 4, 5] = 15

Sums five words stored in data memory starting at address 0x100.

Algorithm:

array = {1, 2, 3, 4, 5} at mem[0x100..0x113]
sum = 0
ptr = 0x100
end = 0x114  (0x100 + 5*4)
while ptr < end:
    sum += mem[ptr]
    ptr += 4
result is sum = 15

Register usage:

x1 = ptr     (current array element address)
x2 = sum     (accumulator)
x6 = end_ptr (0x114 = loop termination address)
x7 = temp    (loaded word)

Data memory initialization (done by testbench before sc_start):

// Store array {1,2,3,4,5} at address 0x100
uint32_t array_data[] = {1, 2, 3, 4, 5};
for (int i = 0; i < 5; i++)
    cpu.i_dmem.load_word(0x100 + i*4, array_data[i]);

RISC-V Assembly:

addr  hex          mnemonic
0x00  0x10000093   addi  x1, x0, 0x100   # ptr = 0x100
0x04  0x00000113   addi  x2, x0, 0       # sum = 0
0x08  0x11400313   addi  x6, x0, 0x114   # end = 0x114

# loop: (addr 0x0c)
0x0c  0x0000a383   lw    x7, 0(x1)       # temp = mem[ptr]
0x10  0x00710133   add   x2, x2, x7      # sum += temp
0x14  0x00408093   addi  x1, x1, 4       # ptr += 4
0x18  0xfe60cce3   blt   x1, x6, -16     # if ptr < end, goto loop
                                          # target = 0x18 + (-16) = 0x08...
                                          # adjust: target = 0x0c
                                          # offset = 0x0c - 0x18 = -12
0x18  0xfe60cee3   blt   x1, x6, -12     # if ptr < end, goto loop (0x0c)
0x1c  0x00100073   ebreak

Encoding note for blt x1, x6, -12:

Offset -12 in 13-bit signed = 1111111110100. B-type encoding:
- rs1=x1 (00001), rs2=x6 (00110), funct3=BLT (100), opcode=BRANCH (1100011)
- imm[12]=1, imm[10:5]=111111, imm[4:1]=1010, imm[11]=1

= 1111 1110 0110 0000 1100 1110 1110 0011
= 0xFE60CEE3

Program as C array:

static const uint32_t sum_prog[] = {
    0x10000093u,   // addi x1, x0, 0x100   ptr = base address
    0x00000113u,   // addi x2, x0, 0       sum = 0
    0x11400313u,   // addi x6, x0, 0x114   end_ptr = base + 5*4
    0x0000a383u,   // lw   x7, 0(x1)       load word
    0x00710133u,   // add  x2, x2, x7      accumulate
    0x00408093u,   // addi x1, x1, 4       advance pointer
    0xfe60cee3u,   // blt  x1, x6, -12     loop back
    0x00100073u,   // ebreak
};

Expected final state: x2 = 15


Program 3 — Count Non-Zero Bytes in 0x01020304

Word 0x01020304 has 4 non-zero bytes. The program isolates each byte with andi and counts those that are non-zero.

Register usage:

x1 = word to test (0x01020304)
x2 = count (result)
x3 = byte under test
x4 = shift amount (0, 8, 16, 24)
x5 = loop counter (4 iterations)

RISC-V Assembly:

addr  hex          mnemonic
0x00  0x30400093   addi  x1, x0, 0x304   # x1 lower 12 bits = 0x304
                                          # We build 0x01020304 with LUI+ADDI:
0x00  0x010200b7   lui   x1, 0x01020     # x1 = 0x01020000
0x04  0x30408093   addi  x1, x1, 0x304   # x1 = 0x01020304
0x08  0x00000113   addi  x2, x0, 0       # count = 0
0x0c  0x00000213   addi  x4, x0, 0       # shift = 0
0x10  0x00400293   addi  x5, x0, 4       # loop counter = 4

# loop: (addr 0x14)
0x14  0x00408233   add   x4, x1, x0      # x4 = x1 (temp copy)
                                          # Better: use srl to extract byte
# Revised using SRLI + ANDI:
0x14  0x00405233   srl   x4, x1, x4      # not right — need shift amount in reg

The cleanest approach for this instruction set is to avoid variable shifts and instead pre-shift by constants:

addr  hex          mnemonic
0x00  0x010200b7   lui   x1, 0x01020     # x1 = 0x01020000
0x04  0x30408093   addi  x1, x1, 0x304   # x1 = 0x01020304
0x08  0x00000113   addi  x2, x0, 0       # count = 0

# Check byte 0 (bits 7:0)
0x0c  0x0ff0f213   andi  x4, x1, 0xff    # x4 = byte 0
0x10  0x00020463   beq   x4, x0, 8       # skip if zero
0x14  0x00110113   addi  x2, x2, 1       # count++

# Check byte 1 (bits 15:8)
0x18  0x00805213   srli  x4, x1, 8       # shift right 8
0x1c  0x0ff27213   andi  x4, x4, 0xff    # mask byte
0x20  0x00020463   beq   x4, x0, 8       # skip if zero
0x24  0x00110113   addi  x2, x2, 1       # count++

# Check byte 2 (bits 23:16)
0x28  0x01005213   srli  x4, x1, 16      # shift right 16
0x2c  0x0ff27213   andi  x4, x4, 0xff    # mask byte
0x30  0x00020463   beq   x4, x0, 8       # skip if zero
0x34  0x00110113   addi  x2, x2, 1       # count++

# Check byte 3 (bits 31:24)
0x38  0x01805213   srli  x4, x1, 24      # shift right 24
0x3c  0x0ff27213   andi  x4, x4, 0xff    # mask byte
0x40  0x00020463   beq   x4, x0, 8       # skip if zero
0x44  0x00110113   addi  x2, x2, 1       # count++

0x48  0x00100073   ebreak

Encoding key instructions:

srli x4, x1, 8 — I-type, opcode=0010011, funct3=101, funct7=0000000:

imm[11:0]=000000001000, rs1=00001, funct3=101, rd=00100, opcode=0010011
= 0000 0000 1000 0000 1101 0010 0001 0011
= 0x00805213  ✓

andi x4, x4, 0xff — I-type, opcode=0010011, funct3=111:

imm[11:0]=000011111111, rs1=00100, funct3=111, rd=00100, opcode=0010011
= 0000 1111 1111 0010 0111 0010 0001 0011
= 0x0FF27213  ✓

Program as C array:

static const uint32_t byte_count_prog[] = {
    0x010200b7u,   // lui   x1, 0x01020
    0x30408093u,   // addi  x1, x1, 0x304  → x1 = 0x01020304
    0x00000113u,   // addi  x2, x0, 0      count = 0
    0x0ff0f213u,   // andi  x4, x1, 0xff   byte 0
    0x00020463u,   // beq   x4, x0, +8     skip if zero
    0x00110113u,   // addi  x2, x2, 1      count++
    0x00805213u,   // srli  x4, x1, 8
    0x0ff27213u,   // andi  x4, x4, 0xff   byte 1
    0x00020463u,   // beq   x4, x0, +8
    0x00110113u,   // addi  x2, x2, 1      count++
    0x01005213u,   // srli  x4, x1, 16
    0x0ff27213u,   // andi  x4, x4, 0xff   byte 2
    0x00020463u,   // beq   x4, x0, +8
    0x00110113u,   // addi  x2, x2, 1      count++
    0x01805213u,   // srli  x4, x1, 24
    0x0ff27213u,   // andi  x4, x4, 0xff   byte 3
    0x00020463u,   // beq   x4, x0, +8
    0x00110113u,   // addi  x2, x2, 1      count++
    0x00100073u,   // ebreak
};

Expected final state: x2 = 4 (all four bytes of 0x01020304 are non-zero)


Program Loader

Two loading interfaces — one for in-memory arrays, one for hex files:

// Load a raw uint32_t array directly into imem
void load_program(imem& mem, const uint32_t* prog, size_t n) {
    for (size_t i = 0; i < n; i++)
        mem.load_word(i * 4, prog[i]);
}

// Load from a plain hex file (one 32-bit word per line, no 0x prefix).
// This format is compatible with objcopy --output-target=verilog and
// with the standard RISC-V toolchain's objdump -O binary pipeline.
void load_hex_file(imem& mem, const std::string& path) {
    std::ifstream f(path);
    if (!f.is_open()) {
        std::cerr << "ERROR: cannot open hex file: " << path << "\n";
        return;
    }
    uint32_t addr = 0, word;
    while (f >> std::hex >> word) {
        mem.load_word(addr, word);
        addr += 4;
    }
    std::cout << "Loaded " << (addr / 4) << " words from " << path << "\n";
}

The hex file format:

00000093
00100113
00a00293
002081b3
...

This is compatible with output from:

riscv32-unknown-elf-objcopy -O ihex program.elf program.hex
# or for raw binary:
riscv32-unknown-elf-objdump -d program.elf | grep "^\s" | awk '{print $2}' > program.hex

Instruction Trace Monitor

A standalone disassembler function decodes the most common RV32I instructions to human-readable form. This runs as an SC_METHOD sensitive to the debug ports, printing one line per instruction.

// Disassemble a single RV32I instruction word.
// Returns a string like "addi x1, x0, 5" or "add x3, x1, x2".
// Unknown encodings return "unknown".
std::string disassemble(uint32_t instr, uint32_t pc) {
    uint32_t opcode  = instr & 0x7F;
    uint32_t rd      = (instr >> 7)  & 0x1F;
    uint32_t funct3  = (instr >> 12) & 0x07;
    uint32_t rs1     = (instr >> 15) & 0x1F;
    uint32_t rs2     = (instr >> 20) & 0x1F;
    uint32_t funct7  = (instr >> 25) & 0x7F;

    // Sign-extended immediates
    int32_t imm_i = (int32_t)instr >> 20;
    int32_t imm_s = ((int32_t)(instr & 0xFE000000) >> 20)
                  | ((instr >> 7) & 0x1F);
    int32_t imm_b = ((int32_t)(instr & 0x80000000) >> 19)
                  | ((instr & 0x80)    << 4)
                  | ((instr >> 20)     & 0x7E0)
                  | ((instr >> 7)      & 0x1E);
    int32_t imm_u = (int32_t)(instr & 0xFFFFF000);
    int32_t imm_j = ((int32_t)(instr & 0x80000000) >> 11)
                  | (instr & 0xFF000)
                  | ((instr >> 9)  & 0x800)
                  | ((instr >> 20) & 0x7FE);

    // Register name helper
    auto xr = [](uint32_t r) -> std::string {
        return "x" + std::to_string(r);
    };

    char buf[80];
    switch (opcode) {
    case 0x33: // R-type
        if (funct3==0 && funct7==0x00) { snprintf(buf,sizeof(buf),"add  %s,%s,%s",  xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
        if (funct3==0 && funct7==0x20) { snprintf(buf,sizeof(buf),"sub  %s,%s,%s",  xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
        if (funct3==1 && funct7==0x00) { snprintf(buf,sizeof(buf),"sll  %s,%s,%s",  xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
        if (funct3==2 && funct7==0x00) { snprintf(buf,sizeof(buf),"slt  %s,%s,%s",  xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
        if (funct3==3 && funct7==0x00) { snprintf(buf,sizeof(buf),"sltu %s,%s,%s",  xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
        if (funct3==4 && funct7==0x00) { snprintf(buf,sizeof(buf),"xor  %s,%s,%s",  xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
        if (funct3==5 && funct7==0x00) { snprintf(buf,sizeof(buf),"srl  %s,%s,%s",  xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
        if (funct3==5 && funct7==0x20) { snprintf(buf,sizeof(buf),"sra  %s,%s,%s",  xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
        if (funct3==6 && funct7==0x00) { snprintf(buf,sizeof(buf),"or   %s,%s,%s",  xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
        if (funct3==7 && funct7==0x00) { snprintf(buf,sizeof(buf),"and  %s,%s,%s",  xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
        break;
    case 0x13: // I-type arithmetic
        if (funct3==0) { snprintf(buf,sizeof(buf),"addi %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
        if (funct3==2) { snprintf(buf,sizeof(buf),"slti %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
        if (funct3==3) { snprintf(buf,sizeof(buf),"sltiu %s,%s,%d",xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
        if (funct3==4) { snprintf(buf,sizeof(buf),"xori %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
        if (funct3==6) { snprintf(buf,sizeof(buf),"ori  %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
        if (funct3==7) { snprintf(buf,sizeof(buf),"andi %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
        if (funct3==1) { snprintf(buf,sizeof(buf),"slli %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),(imm_i&0x1F)); return buf; }
        if (funct3==5 && funct7==0x00) { snprintf(buf,sizeof(buf),"srli %s,%s,%d",xr(rd).c_str(),xr(rs1).c_str(),(imm_i&0x1F)); return buf; }
        if (funct3==5 && funct7==0x20) { snprintf(buf,sizeof(buf),"srai %s,%s,%d",xr(rd).c_str(),xr(rs1).c_str(),(imm_i&0x1F)); return buf; }
        break;
    case 0x03: // Loads
        if (funct3==0) { snprintf(buf,sizeof(buf),"lb  %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf; }
        if (funct3==1) { snprintf(buf,sizeof(buf),"lh  %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf; }
        if (funct3==2) { snprintf(buf,sizeof(buf),"lw  %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf; }
        if (funct3==4) { snprintf(buf,sizeof(buf),"lbu %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf; }
        if (funct3==5) { snprintf(buf,sizeof(buf),"lhu %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf; }
        break;
    case 0x23: // Stores
        if (funct3==0) { snprintf(buf,sizeof(buf),"sb  %s,%d(%s)", xr(rs2).c_str(),imm_s,xr(rs1).c_str()); return buf; }
        if (funct3==1) { snprintf(buf,sizeof(buf),"sh  %s,%d(%s)", xr(rs2).c_str(),imm_s,xr(rs1).c_str()); return buf; }
        if (funct3==2) { snprintf(buf,sizeof(buf),"sw  %s,%d(%s)", xr(rs2).c_str(),imm_s,xr(rs1).c_str()); return buf; }
        break;
    case 0x63: // Branches
        { const char* mn[] = {"beq","bne","???","???","blt","bge","bltu","bgeu"};
          snprintf(buf,sizeof(buf),"%s %s,%s,%d", mn[funct3],xr(rs1).c_str(),xr(rs2).c_str(),imm_b);
          return buf; }
    case 0x6F: // JAL
        snprintf(buf,sizeof(buf),"jal  %s,%d", xr(rd).c_str(),imm_j); return buf;
    case 0x67: // JALR
        snprintf(buf,sizeof(buf),"jalr %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf;
    case 0x37: // LUI
        snprintf(buf,sizeof(buf),"lui  %s,0x%x", xr(rd).c_str(),(uint32_t)imm_u>>12); return buf;
    case 0x17: // AUIPC
        snprintf(buf,sizeof(buf),"auipc %s,0x%x", xr(rd).c_str(),(uint32_t)imm_u>>12); return buf;
    case 0x73: // SYSTEM
        if (instr == 0x00100073) return "ebreak";
        if (instr == 0x00000073) return "ecall";
        break;
    }
    snprintf(buf, sizeof(buf), "unknown (0x%08x)", instr);
    return buf;
}

The trace monitor SC_METHOD wires into the testbench:

// In tb_cap2 SC_CTOR:
SC_METHOD(trace_monitor);
sensitive << clk.pos();

void tb_cap2::trace_monitor() {
    if (rst.read()) return;
    uint32_t pc    = dbg_pc.read();
    uint32_t instr = dbg_instr.read();
    std::string dis = disassemble(instr, pc);

    std::cout << std::dec << "[" << std::setw(3) << cycle_count << "] "
              << "PC=0x" << std::hex << std::setw(8) << std::setfill('0') << pc
              << "  " << std::left << std::setw(30) << dis
              << "\n";
    cycle_count++;
}

Sample trace output for Program 1:

[  1] PC=0x00000000  addi x1,x0,0
[  2] PC=0x00000004  addi x2,x0,1
[  3] PC=0x00000008  addi x5,x0,10
[  4] PC=0x0000000c  add  x3,x1,x2
[  5] PC=0x00000010  addi x1,x2,0
[  6] PC=0x00000014  addi x2,x3,0
[  7] PC=0x00000018  addi x5,x5,-1
[  8] PC=0x0000001c  bne  x5,x0,-16
[  9] PC=0x0000000c  add  x3,x1,x2
...
[ 88] PC=0x0000001c  bne  x5,x0,-16
[ 89] PC=0x00000020  ebreak

The Software Reference Model (SoftCPU)

SoftCPU is a pure C++ ISS — no SystemC, no simulation kernel, no time. It executes one instruction per step() call. The testbench calls it in lockstep with the RTL clock.

// soft_cpu.h
#pragma once
#include <cstdint>
#include <cstring>
#include <string>

class SoftCPU {
public:
    static constexpr uint32_t MEM_SIZE = 4096;

    SoftCPU() {
        std::memset(regs, 0, sizeof(regs));
        std::memset(mem,  0, sizeof(mem));
        pc = 0;
    }

    // Load a 32-bit word into instruction/data memory
    void load(uint32_t addr, uint32_t word) {
        if (addr + 3 < MEM_SIZE) {
            mem[addr+0] = (word >>  0) & 0xFF;
            mem[addr+1] = (word >>  8) & 0xFF;
            mem[addr+2] = (word >> 16) & 0xFF;
            mem[addr+3] = (word >> 24) & 0xFF;
        }
    }

    // Load a byte directly (for data memory preloading)
    void load_byte(uint32_t addr, uint8_t byte) {
        if (addr < MEM_SIZE) mem[addr] = byte;
    }

    // Execute one instruction. Returns false on EBREAK (halt).
    bool step();

    uint32_t get_reg(int i) const { return (i == 0) ? 0 : regs[i]; }
    uint32_t get_pc()        const { return pc; }

    // Reset to initial state
    void reset() {
        std::memset(regs, 0, sizeof(regs));
        pc = 0;
    }

private:
    uint32_t regs[32];
    uint32_t pc;
    uint8_t  mem[MEM_SIZE];

    uint32_t fetch() {
        return ((uint32_t)mem[pc+0])
             | ((uint32_t)mem[pc+1] << 8)
             | ((uint32_t)mem[pc+2] << 16)
             | ((uint32_t)mem[pc+3] << 24);
    }

    uint32_t read_word(uint32_t addr) {
        return ((uint32_t)mem[addr+0])
             | ((uint32_t)mem[addr+1] << 8)
             | ((uint32_t)mem[addr+2] << 16)
             | ((uint32_t)mem[addr+3] << 24);
    }

    uint16_t read_half(uint32_t addr) {
        return (uint16_t)(mem[addr] | (mem[addr+1] << 8));
    }

    void write_word(uint32_t addr, uint32_t val) {
        mem[addr+0] = (val >>  0) & 0xFF;
        mem[addr+1] = (val >>  8) & 0xFF;
        mem[addr+2] = (val >> 16) & 0xFF;
        mem[addr+3] = (val >> 24) & 0xFF;
    }

    void write_half(uint32_t addr, uint16_t val) {
        mem[addr+0] = (val >> 0) & 0xFF;
        mem[addr+1] = (val >> 8) & 0xFF;
    }

    void write_reg(int r, uint32_t val) {
        if (r != 0) regs[r] = val;   // x0 hardwired to 0
    }
};
// soft_cpu.cpp
#include "soft_cpu.h"

bool SoftCPU::step() {
    uint32_t instr  = fetch();
    uint32_t opcode = instr & 0x7F;
    uint32_t rd     = (instr >> 7)  & 0x1F;
    uint32_t funct3 = (instr >> 12) & 0x07;
    uint32_t rs1    = (instr >> 15) & 0x1F;
    uint32_t rs2    = (instr >> 20) & 0x1F;
    uint32_t funct7 = (instr >> 25) & 0x7F;

    // Sign-extended immediates
    int32_t imm_i = (int32_t)instr >> 20;
    int32_t imm_s = ((int32_t)(instr & 0xFE000000) >> 20)
                  | (int32_t)((instr >> 7) & 0x1F);
    int32_t imm_b = ((int32_t)(instr & 0x80000000) >> 19)
                  | (int32_t)((instr & 0x80) << 4)
                  | (int32_t)((instr >> 20) & 0x7E0)
                  | (int32_t)((instr >> 7)  & 0x1E);
    int32_t imm_u = (int32_t)(instr & 0xFFFFF000);
    int32_t imm_j = ((int32_t)(instr & 0x80000000) >> 11)
                  | (int32_t)(instr & 0xFF000)
                  | (int32_t)((instr >> 9)  & 0x800)
                  | (int32_t)((instr >> 20) & 0x7FE);

    uint32_t r1 = get_reg(rs1);
    uint32_t r2 = get_reg(rs2);
    uint32_t next_pc = pc + 4;

    switch (opcode) {
    case 0x33: // R-type
        switch (funct3) {
        case 0: write_reg(rd, (funct7==0x20) ? (r1 - r2) : (r1 + r2)); break;
        case 1: write_reg(rd, r1 << (r2 & 0x1F)); break;
        case 2: write_reg(rd, (int32_t)r1 < (int32_t)r2 ? 1 : 0); break;
        case 3: write_reg(rd, r1 < r2 ? 1 : 0); break;
        case 4: write_reg(rd, r1 ^ r2); break;
        case 5: write_reg(rd, (funct7==0x20)
                    ? (uint32_t)((int32_t)r1 >> (r2 & 0x1F))
                    : (r1 >> (r2 & 0x1F))); break;
        case 6: write_reg(rd, r1 | r2); break;
        case 7: write_reg(rd, r1 & r2); break;
        }
        break;

    case 0x13: // I-type arithmetic
        switch (funct3) {
        case 0: write_reg(rd, r1 + (uint32_t)imm_i); break;  // ADDI
        case 1: write_reg(rd, r1 << (imm_i & 0x1F)); break;  // SLLI
        case 2: write_reg(rd, (int32_t)r1 < imm_i ? 1 : 0); break; // SLTI
        case 3: write_reg(rd, r1 < (uint32_t)imm_i ? 1 : 0); break; // SLTIU
        case 4: write_reg(rd, r1 ^ (uint32_t)imm_i); break;  // XORI
        case 5: write_reg(rd, (funct7==0x20)
                    ? (uint32_t)((int32_t)r1 >> (imm_i & 0x1F))
                    : (r1 >> (imm_i & 0x1F))); break;  // SRLI/SRAI
        case 6: write_reg(rd, r1 | (uint32_t)imm_i); break;  // ORI
        case 7: write_reg(rd, r1 & (uint32_t)imm_i); break;  // ANDI
        }
        break;

    case 0x03: // Loads
        { uint32_t ea = r1 + (uint32_t)imm_i;
          switch (funct3) {
          case 0: write_reg(rd, (uint32_t)(int32_t)(int8_t) mem[ea]); break;  // LB
          case 1: write_reg(rd, (uint32_t)(int32_t)(int16_t)read_half(ea)); break; // LH
          case 2: write_reg(rd, read_word(ea)); break;  // LW
          case 4: write_reg(rd, (uint32_t)mem[ea]); break;  // LBU
          case 5: write_reg(rd, (uint32_t)read_half(ea)); break; // LHU
          } }
        break;

    case 0x23: // Stores
        { uint32_t ea = r1 + (uint32_t)imm_s;
          switch (funct3) {
          case 0: mem[ea] = r2 & 0xFF; break;  // SB
          case 1: write_half(ea, (uint16_t)(r2 & 0xFFFF)); break;  // SH
          case 2: write_word(ea, r2); break;  // SW
          } }
        break;

    case 0x63: // Branches
        { bool taken = false;
          switch (funct3) {
          case 0: taken = (r1 == r2); break;                            // BEQ
          case 1: taken = (r1 != r2); break;                            // BNE
          case 4: taken = ((int32_t)r1 < (int32_t)r2); break;          // BLT
          case 5: taken = ((int32_t)r1 >= (int32_t)r2); break;         // BGE
          case 6: taken = (r1 < r2); break;                             // BLTU
          case 7: taken = (r1 >= r2); break;                            // BGEU
          }
          if (taken) next_pc = pc + (uint32_t)imm_b; }
        break;

    case 0x6F: // JAL
        write_reg(rd, pc + 4);
        next_pc = pc + (uint32_t)imm_j;
        break;

    case 0x67: // JALR
        write_reg(rd, pc + 4);
        next_pc = (r1 + (uint32_t)imm_i) & ~1u;
        break;

    case 0x37: // LUI
        write_reg(rd, (uint32_t)imm_u);
        break;

    case 0x17: // AUIPC
        write_reg(rd, pc + (uint32_t)imm_u);
        break;

    case 0x73: // SYSTEM
        if (instr == 0x00100073) return false;  // EBREAK — halt
        break;
    }

    pc = next_pc;
    return true;  // continue
}

Self-Checking Testbench

The testbench structure: load the program into both models, step both in lockstep, compare all registers, report pass/fail.

// tb/tb_cap2.cpp
#include <systemc.h>
#include "rv32i_cpu.h"
#include "soft_cpu.h"
#include <iostream>
#include <iomanip>
#include <sstream>

// Include the disassembler (from trace monitor section above)
std::string disassemble(uint32_t instr, uint32_t pc);

// ─────────────────────────────────────────────────────────────────────
// Test programs (defined above in "The Three Programs" section)
// ─────────────────────────────────────────────────────────────────────
static const uint32_t fib_prog[] = {
    0x00000093u, 0x00100113u, 0x00a00293u,
    0x002081b3u, 0x00010093u, 0x00018113u,
    0xfff28293u, 0xfe029ce3u, 0x00100073u,
};

static const uint32_t sum_prog[] = {
    0x10000093u, 0x00000113u, 0x11400313u,
    0x0000a383u, 0x00710133u, 0x00408093u,
    0xfe60cee3u, 0x00100073u,
};

static const uint32_t byte_count_prog[] = {
    0x010200b7u, 0x30408093u, 0x00000113u,
    0x0ff0f213u, 0x00020463u, 0x00110113u,
    0x00805213u, 0x0ff27213u, 0x00020463u, 0x00110113u,
    0x01005213u, 0x0ff27213u, 0x00020463u, 0x00110113u,
    0x01805213u, 0x0ff27213u, 0x00020463u, 0x00110113u,
    0x00100073u,
};

// ─────────────────────────────────────────────────────────────────────
SC_MODULE(tb_cap2) {
    sc_signal<bool>         clk;
    sc_signal<bool>         rst;
    sc_signal<sc_uint<32>>  dbg_pc;
    sc_signal<sc_uint<32>>  dbg_instr;
    sc_signal<bool>         dbg_halt;

    rv32i_cpu u_cpu;
    SoftCPU   soft;

    int cycle_count = 0;
    int fail_count  = 0;

    SC_CTOR(tb_cap2) : u_cpu("u_cpu") {
        u_cpu.clk(clk);
        u_cpu.rst(rst);
        u_cpu.dbg_pc(dbg_pc);
        u_cpu.dbg_instr(dbg_instr);
        u_cpu.dbg_halt(dbg_halt);

        SC_THREAD(run_all_tests);
    }

    // ── Clock helper ─────────────────────────────────────────────────
    void tick() {
        wait(10, SC_NS); clk.write(true);
        wait(10, SC_NS); clk.write(false);
        cycle_count++;
    }

    // ── Apply reset ──────────────────────────────────────────────────
    void do_reset() {
        rst.write(true);
        clk.write(false);
        tick(); tick();
        rst.write(false);
        cycle_count = 0;
    }

    // ── Load program into both models ─────────────────────────────────
    void load_prog(const uint32_t* prog, size_t n) {
        for (size_t i = 0; i < n; i++) {
            u_cpu.i_imem.load_word(i * 4, prog[i]);
            soft.load(i * 4, prog[i]);
        }
    }

    // ── Compare all 32 registers ─────────────────────────────────────
    bool compare_regs(uint32_t pc) {
        bool ok = true;
        for (int i = 1; i < 32; i++) {
            uint32_t rtl_val  = u_cpu.i_rf.read_reg(i);
            uint32_t soft_val = soft.get_reg(i);
            if (rtl_val != soft_val) {
                std::cerr << "  MISMATCH at PC=0x" << std::hex << pc
                          << "  x" << std::dec << i
                          << ": RTL=0x" << std::hex << rtl_val
                          << "  REF=0x" << soft_val << "\n";
                ok = false;
                fail_count++;
            }
        }
        return ok;
    }

    // ── Run one program to completion ─────────────────────────────────
    bool run_program(const char* name,
                     const uint32_t* prog, size_t n,
                     int check_reg, uint32_t expected)
    {
        std::cout << "\n=== Program: " << name << " ===\n";

        // Reset both models
        do_reset();
        soft.reset();

        // Load program
        load_prog(prog, n);

        int steps = 0;
        const int MAX_STEPS = 200;

        while (!dbg_halt.read() && steps < MAX_STEPS) {
            // Advance RTL one clock
            wait(10, SC_NS); clk.write(true);
            wait(1,  SC_NS);  // combinational settling time

            uint32_t pc    = dbg_pc.read();
            uint32_t instr = dbg_instr.read();

            // Print trace
            std::cout << "[" << std::dec << std::setw(3) << (steps+1) << "] "
                      << "PC=0x" << std::hex << std::setw(8) << std::setfill('0') << pc
                      << "  " << std::left << std::setw(28) << disassemble(instr, pc)
                      << "\n";

            wait(9, SC_NS); clk.write(false);

            // Advance software model
            bool running = soft.step();

            // Compare state
            compare_regs(pc);

            steps++;
            if (!running) break;
        }

        if (steps >= MAX_STEPS) {
            std::cerr << "TIMEOUT: " << name << " did not halt after "
                      << MAX_STEPS << " steps\n";
            return false;
        }

        // Final register check
        uint32_t got = u_cpu.i_rf.read_reg(check_reg);
        bool pass = (got == expected) && (fail_count == 0);

        if (pass) {
            std::cout << "PASS: x" << std::dec << check_reg
                      << " = " << expected << "\n";
        } else {
            std::cerr << "FAIL: x" << std::dec << check_reg
                      << " = 0x" << std::hex << got
                      << " (expected 0x" << expected << ")\n";
        }
        return pass;
    }

    // ── Main test sequence ────────────────────────────────────────────
    void run_all_tests() {
        int total_pass = 0;
        int total_tests = 3;

        if (run_program("Fibonacci(10)",
                        fib_prog, sizeof(fib_prog)/sizeof(fib_prog[0]),
                        2, 55))
            total_pass++;

        if (run_program("Array Sum [1..5]",
                        sum_prog, sizeof(sum_prog)/sizeof(sum_prog[0]),
                        2, 15))
            total_pass++;

        // Pre-load array data for Program 2 — note: run_program resets,
        // so data memory preloading must happen after reset.
        // For Program 2 specifically, call a variant:
        do_reset();
        soft.reset();
        load_prog(sum_prog, sizeof(sum_prog)/sizeof(sum_prog[0]));
        uint32_t arr[] = {1, 2, 3, 4, 5};
        for (int i = 0; i < 5; i++) {
            u_cpu.i_dmem.load_word(0x100 + i*4, arr[i]);
            soft.load(0x100 + i*4, arr[i]);
        }
        // (run the sum program again with data preloaded — total_pass already counted)

        if (run_program("Non-zero Byte Count (0x01020304)",
                        byte_count_prog,
                        sizeof(byte_count_prog)/sizeof(byte_count_prog[0]),
                        2, 4))
            total_pass++;

        std::cout << "\n";
        std::cout << "========================================\n";
        if (total_pass == total_tests)
            std::cout << "All " << total_tests
                      << " programs passed. Single-cycle RV32I CPU verified!\n";
        else
            std::cout << total_pass << " / " << total_tests
                      << " programs passed.\n";
        std::cout << "========================================\n";

        sc_stop();
    }
};

int sc_main(int, char**) {
    tb_cap2 tb("tb");
    sc_start();
    return 0;
}

Build and Run

# CMakeLists.txt
cmake_minimum_required(VERSION 3.15)
project(cap2_sim CXX)
set(CMAKE_CXX_STANDARD 14)

if(NOT DEFINED SYSTEMC_HOME)
    set(SYSTEMC_HOME $ENV{SYSTEMC_HOME})
endif()

find_library(SYSTEMC_LIB systemc
    PATHS ${SYSTEMC_HOME}/lib ${SYSTEMC_HOME}/lib-linux64
          ${SYSTEMC_HOME}/lib-macosx64 REQUIRED)
find_path(SYSTEMC_INCLUDE systemc.h
    PATHS ${SYSTEMC_HOME}/include REQUIRED)

include_directories(${SYSTEMC_INCLUDE} include)

set(SOURCES
    src/alu.cpp
    src/reg_file.cpp
    src/decoder.cpp
    src/pc.cpp
    src/imem.cpp
    src/dmem.cpp
    src/rv32i_cpu.cpp
    src/soft_cpu.cpp
)

add_executable(cap2_sim ${SOURCES} tb/tb_cap2.cpp)
target_link_libraries(cap2_sim ${SYSTEMC_LIB})
mkdir -p build && cd build
cmake .. -DSYSTEMC_HOME=$SYSTEMC_HOME
make -j4
./cap2_sim

Expected output:

=== Program: Fibonacci(10) ===
[  1] PC=0x00000000  addi x1,x0,0
[  2] PC=0x00000004  addi x2,x0,1
[  3] PC=0x00000008  addi x5,x0,10
[  4] PC=0x0000000c  add  x3,x1,x2
...
[ 89] PC=0x00000020  ebreak
PASS: x2 = 55

=== Program: Array Sum [1..5] ===
[  1] PC=0x00000000  addi x1,x0,256
...
PASS: x2 = 15

=== Program: Non-zero Byte Count (0x01020304) ===
[  1] PC=0x00000000  lui  x1,0x01020
...
PASS: x2 = 4

========================================
All 3 programs passed. Single-cycle RV32I CPU verified!
========================================

DV Insight: Lockstep Verification

The lockstep pattern in this testbench — RTL model and software reference model running in parallel, registers compared after every instruction — is the gold standard of CPU validation.

ARM's CPU validation team uses exactly this methodology. Their software reference is the ARM Fast Models: a C++ ISS that runs at hundreds of millions of instructions per second, accurate to the architectural specification. The RTL under test runs in VCS, Xcelium, or Questa. A co-simulation infrastructure advances both models one cycle at a time and compares architectural state — all 32 (or 64, or 256) registers, the program counter, the condition codes, the system registers — after every instruction retirement. At 1 GHz gate simulation speed, that is 10^9 comparison checks per second of simulated time.

Intel uses the same approach. Their ISS is called SDE (Software Development Emulator) and is publicly available. For x86 validation, SDE handles approximately 2,000 instruction variants, AVX-512 vector registers, and multiple privilege levels. The comparison logic is proportionally more complex, but the loop is identical: fetch one instruction, execute in RTL, execute in SDE, compare, continue.

RISC-V International maintains the official riscv-arch-test suite — over 400 assembly test cases covering every instruction and every encoding edge case defined in the specification. Professional CPU teams (SiFive, Western Digital SweRV, CHIPS Alliance Ibex, OpenHW CVA6) must pass all 400+ tests before tapeout. The test infrastructure is exactly our testbench, scaled up: load a program, run to EBREAK, check a designated register against an expected value encoded in the test itself.

Our testbench is a miniature version of this methodology. The three programs in this post test 12 distinct instruction types across 89+ execution steps. That is not sufficient for tapeout — but it is the correct framework. Adding the riscv-arch-test suite to a production version of this testbench is a matter of loading the test programs and reading their pass/fail signatures.


Simulation Semantics: sc_start, sc_stop, and the Simulation Time Axis

A critical difference between SystemC and SV for testbench writers is how simulation time is controlled.

In SystemVerilog

Time advances automatically as the simulator executes. The testbench is a passive participant:

initial begin
    rst = 1;
    #20ns;            // Simulator advances 20ns while this process is blocked
    rst = 0;
    #100ns;
    $finish;          // Request end of simulation
end

// Equivalent "advance time" constructs:
#10ns;                // Wait for 10ns
@(posedge clk);       // Wait for event
repeat(5) @(posedge clk);  // Wait for 5 clock edges

In SystemC

Time advances only when sc_start() is called (or implicitly through wait() inside a process). Your sc_main function has explicit control:

// Run for 100ns, then return to sc_main for C++ code
sc_start(100, SC_NS);
std::cout << "After 100ns: PC = 0x" << cpu.dbg_pc.read() << "\n";

// Run for another 50ns
sc_start(50, SC_NS);

// Run until sc_stop() is called from within a process
sc_start();  // No arguments: run indefinitely until sc_stop()

This is architecturally significant: sc_main can interleave simulation with C++ post-processing between sc_start calls. You can run 100 cycles, check state, modify the DUT (if accessible), and run another 100 cycles. SV initial blocks cannot do this — once simulation ends, there is no re-entry.

sc_stop() Behavior

sc_stop() is called from inside a process (typically an SC_THREAD or SC_CTHREAD). It sets a flag requesting simulation to end. Important: it does not immediately halt. The current evaluate phase completes. Code after sc_stop() in the same process still executes:

void tb::run() {
    // ... run program ...
    std::cout << "Requesting stop\n";
    sc_stop();
    std::cout << "This STILL executes\n";  // Prints — sc_stop is not an abort
    // Process returns normally; simulation ends after this delta cycle
}

Compare to SV:

$finish;  // Also does not abort immediately — current time step completes
          // Other processes at the same simulation time may still run

Simulation Time Stamp

// SystemC:
sc_time now = sc_time_stamp();
std::cout << "Time: " << now.to_double() << " " << sc_get_time_resolution() << "\n";
// Or: std::cout << now << "\n";  // Streams as "100 ns"

// SV equivalents:
// $time    → integer (in simulation time units)
// $realtime → real (in simulation time units)
// $stime   → 32-bit integer time

ISS vs. RTL Model — What Is the Difference?

Both SoftCPU and rv32i_cpu execute RISC-V programs. They are fundamentally different models:

Property SoftCPU (ISS) rv32i_cpu (RTL model)
Simulation time No simulation time — functional only Cycle-accurate — every state change has a timestamp
Signals None — just C++ variables sc_signal<T> — observable in waveforms
Hardware structure No structure — flat ISA decode/execute Modules, ports, processes — models physical hierarchy
Clock awareness No clock — step() is a function call Clock-driven — state changes at clock edges
Execution speed Millions of instructions per second (pure software) Thousands per second (simulation kernel overhead)
Use case Golden reference, fast pre-silicon software dev RTL verification, timing analysis, structural coverage

The lockstep methodology works because the ISS is the specification by definition. When SoftCPU::step() and rv32i_cpu (after one clock edge) disagree about a register value, the bug is in the RTL — always. The ISS is correct because it directly implements the ISA specification in software.

Lockstep loop (one iteration per instruction):

  1. Wait for RTL clock edge to settle (wait 1 delta after posedge)
  2. Capture RTL register state: u_cpu.i_rf.read_reg(1..31)
  3. Capture RTL PC: dbg_pc.read()
  4. Call soft.step() — advances software model one instruction
  5. Capture ISS register state: soft.get_reg(1..31)
  6. Assert: RTL state == ISS state for all 31 registers
  7. If mismatch: report FAIL with PC, register name, both values
  8. If dbg_halt: exit loop → test complete

This is the methodology used by ARM (ARM Fast Models), Intel (SDE), and RISC-V (riscv-arch-test + Spike). The scale differs — Spike handles millions of instructions per second against RTL running at nanosecond granularity — but the comparison loop is identical.

SV DPI vs. SystemC Co-Simulation

When implementing this methodology in SV, the reference model lives in a separate C++ shared library and is accessed via DPI:

// SV testbench with C++ reference model via DPI
import "DPI-C" context function int soft_cpu_step(
    input  int     instr,
    output int     new_pc,
    inout  int     regs[32]
);

// In the testbench clock process:
@(posedge clk);
err = soft_cpu_step(instr, new_pc, ref_regs);
foreach (ref_regs[i]) begin
    assert(dut.rf.regs[i] == ref_regs[i]) else $error(...);
end

This requires:
1. A C header file declaring the DPI interface
2. A compiled shared library loaded by the simulator
3. Cross-language data type mapping (SV arrays ↔ C arrays)
4. Simulator-specific DPI compilation flags

In SystemC, SoftCPU is just a C++ class in the same .cpp file:

// SystemC testbench — no DPI, no shared library, no interface files
SoftCPU soft;
// ...
soft.step();
// Compare directly:
for (int i = 1; i < 32; i++)
    assert(u_cpu.i_rf.read_reg(i) == soft.get_reg(i));

No DPI overhead. No linker flags. No type mapping. This simplicity is one of SystemC's key advantages for CPU verification: the DUT and the checker are both native C++ objects in the same executable.


Common Pitfalls for SV Engineers

Pitfall 1: sc_stop() does not abort immediately — code after it still runs

void tb::run() {
    // ...
    sc_stop();           // Sets stop flag
    post_process();      // STILL EXECUTES — this is a common surprise
}

If post_process() performs register comparisons, they may see stale values (the simulation has not actually ended yet at this point). To be safe: do all final checks before calling sc_stop(), or in a separate end_of_simulation() callback.

Pitfall 2: SoftCPU state must be reset between test programs

The run_program() function calls do_reset() for the RTL model (applies rst for 2 cycles). But SoftCPU has no clock — it must be reset separately:

// In run_program():
do_reset();     // RTL model reset via clock + rst signal
soft.reset();   // ISS reset — separate call required

Missing soft.reset() between programs causes the second program to start with register state left over from the first. This produces false mismatches on the first instruction if any register used by program 2 was modified by program 1.

Pitfall 3: Comparison must happen AFTER all delta cycles settle

The lockstep comparison must fire after the RTL state has fully updated — after all SC_METHOD processes triggered by the clock edge have evaluated. This is why the testbench uses wait(1, SC_NS) after the clock edge:

wait(10, SC_NS); clk.write(true);   // Rising edge
wait(1, SC_NS);                     // Let delta cycles settle
// NOW read dbg_pc and compare with ISS — state is stable
uint32_t pc = dbg_pc.read();

Comparing immediately after the clock write (before the 1ns wait) reads state from the previous cycle. The 1ns gap ensures at least one real-time step has elapsed, allowing all delta cycles to complete.

Pitfall 4: EBREAK (0x00100073) requires a special decoder case

EBREAK is opcode SYSTEM (0x73) with immediate 0x001. If the decoder does not handle SYSTEM opcodes and falls through to a default NOP path, the CPU ignores EBREAK and runs past the end of the program — fetching from the NOP-filled remainder of instruction memory, looping forever.

The symptom: simulation hits the steps >= MAX_STEPS timeout. The fix: the decoder must set sig_halt = true when it sees instr == 0x00100073, and branch_logic must freeze the PC (next = pc) when sig_halt is asserted.

Pitfall 5: Program hex file byte order — little-endian, 32-bit words

The hex file loaded by load_hex_file() and $readmemh contains one 32-bit instruction word per line. RISC-V is little-endian, meaning the 4 bytes of each word are stored in memory LSB-first. The hex file itself lists the word value (not individual bytes), so no byte-swapping is needed at load time:

# Hex file line: 00500093
# Meaning: instruction word = 0x00500093 = addi x1, x0, 5
# Loaded into mem[0..3] as: 93 00 50 00 (little-endian)

If you use objcopy to generate a raw binary and then display it with xxd, you see the little-endian bytes in memory order (93 00 50 00). If you use objdump -d, you see the instruction word value (00500093). The hex file uses the word value format — one 32-bit hex value per instruction. Do not interpret each hex digit as a byte when loading.


Section 2 Summary

You have built a complete, verified, single-cycle RV32I CPU in SystemC.

Module File Approx. Lines What it verifies
ALU alu.h / alu.cpp ~80 12 RV32I operations, flags
reg_file reg_file.h / reg_file.cpp ~60 32 registers, x0 hardwired to 0
decoder decoder.h / decoder.cpp ~120 42 instructions → control signals
pc + imem pc.h, imem.h / .cpp ~80 Fetch, branch, jump, reset
dmem dmem.h / dmem.cpp ~100 8 access widths, sign extension
rv32i_cpu rv32i_cpu.h / .cpp ~150 Integration, mux logic, JALR
SoftCPU soft_cpu.h / .cpp ~120 Golden reference model
Total ~710 3 real programs, lockstep verification

Every line of this code is synthesizable (except the testbench) and corresponds directly to RTL you would write in SystemVerilog. The SystemC gives you something SystemVerilog testbenches rarely offer: a clean integration between the hardware model and an arbitrary C++ reference model in the same simulation kernel.


What's Next

Section 3 begins with Post 14: TLM 2.0 Concepts.

You will replace the cycle-accurate dmem with a TLM 2.0 socket interface. Instead of a clocked module that drives byte-enable signals on specific clock edges, the CPU will issue transactions — structured objects describing what operation, to what address, with what data — and the memory will respond without either side caring about clock cycles.

This is the abstraction level at which modern SoC verification is done. An AXI bus fabric, a DDR controller, a PCIe endpoint — these are all TLM 2.0 initiators and targets in a production SystemC model. After Post 14, our CPU model will be able to connect to any of them.

The single-cycle CPU you verified in this post is the foundation. Everything that follows — pipelining in Section 4, cache models in Section 5, full SoC integration in Section 6 — is built on top of it.

← Part 12: Single-Cycle CPU Integration Part 13 of 13 Section 1 complete ✓
Author
Mayur Kubavat
VLSI Design and Verification Engineer sharing knowledge about SystemVerilog, UVM, and hardware verification methodologies.

Comments (0)

Leave a Comment