13. SystemC Tutorial - Capstone 2: Running Real RV32I Programs
Introduction
By the end of this post, your SystemC model executes real RISC-V machine code.
Not contrived test vectors. Not hand-wired stimulus that drives specific ports with specific values at specific times. Actual programs: binary sequences your compiler produces, loaded into instruction memory, fetched by the program counter, decoded, executed, and written back — exactly as a silicon implementation would do it.
That makes this model a functional instruction set simulator, or ISS. The ARM Fast Models, gem5, QEMU, and Spike — the official RISC-V ISS from UC Berkeley — all started exactly this way. A software model that could execute binaries, fast enough to boot a kernel or run test suites, accurate enough to serve as the golden reference for RTL verification.
The difference between our model and Spike is that Spike handles the full RISC-V privilege architecture, virtual memory, M/S/U privilege levels, CSRs, and optional extensions. Our model handles the 42 instructions of RV32I. The methodology is identical. You will use both.
This post runs three programs: Fibonacci(10), array sum, and byte counting. Each program is shown as C pseudocode, then RISC-V assembly with exact hex encodings, then executed on the CPU. A software reference model — a pure C++ class called SoftCPU — runs in lockstep with your SystemC model, comparing all 32 registers after every instruction. Any divergence between the two models is a bug.
This is the gold standard of CPU validation. After Post 13, your single-cycle RV32I CPU is fully verified.
Prerequisites
- Post 12 — Single-cycle CPU integration (
rv32i_cpu, all five submodules wired) - Code for this post: GitHub — section2/post13
Architecture Overview
graph TD
TB["tb_cap2\n(SC_MODULE)"] --> CPU["rv32i_cpu\n(DUT)"]
TB --> SOFT["SoftCPU\n(reference model)"]
CPU -->|dbg_pc, dbg_instr| MON["trace_monitor\n(SC_METHOD)"]
MON --> LOG["stdout\ntrace log"]
CPU -->|dbg_pc, dbg_instr| CMP["comparator\n(SC_METHOD)"]
SOFT --> CMP
CMP -->|PASS/FAIL| RESULT["Test Result"]
The testbench owns both the DUT (rv32i_cpu) and a pure software reference (SoftCPU). After each clock edge, it advances the software model one instruction and compares all 32 registers. The trace monitor logs every instruction to stdout. The comparator flags divergence immediately, naming the register and both values.
SystemC Language Reference
| Construct | Syntax | SV / Verilog Equivalent | Key Difference |
|---|---|---|---|
| Simulation start with time limit | sc_start(100, SC_NS) |
#100ns; in initial block |
SV blocks the current process; sc_start returns control to C++ |
| Request simulation stop | sc_stop() |
$finish |
Both stop after current delta completes; code after may still run |
| Current simulation time | sc_time_stamp() (returns sc_time) |
$time (integer) / $realtime (real) |
sc_time is an object; use .to_double() or stream it |
| Software reference model | Plain C++ class SoftCPU — no SystemC |
Task / function in SV, or DPI-C import | SystemC: no DPI overhead; SoftCPU is just a C++ object in the same process |
| DPI equivalent | Not needed — C++ class in same TU | import "DPI-C" function void step(...) |
SystemC's key advantage: reference model needs no interface definition |
| Clock generation | SC_THREAD toggling clk / sc_clock object |
always #5ns clk = ~clk or clocking block |
sc_clock is a channel; SV clock is usually an always or generate block |
| Lockstep comparison | C++ for loop reading i_rf.read_reg(i) |
$sampled(rf_out) or checker task |
SystemC has direct C++ access to DUT internals; SV uses hierarchy paths |
| Halt detection | Read dbg_halt port or poll i_rf.read_reg() |
Detect $finish or monitor a specific register |
SystemC polls via signal reads; SV often uses assertion or @(event) |
| Register file direct read | u_cpu.i_rf.read_reg(i) — C++ member call |
u_dut.rf.regs[i] via hierarchy access |
Both allow hierarchical access in simulation; neither is synthesizable |
| Waveform dump | sc_trace_file* tf = sc_create_vcd_trace_file("out") |
$dumpfile("out.vcd"); $dumpvars |
SystemC VCD requires explicit sc_trace() calls per signal |
The Three Programs
Program 1 — Fibonacci(10) = 55
Fibonacci computed iteratively. Result lands in x2.
Algorithm:
a = 0, b = 1
repeat 10 times:
temp = a + b
a = b
b = temp
result is b = fib(10) = 55
RISC-V Assembly with Hex Encodings:
# Register usage:
# x1 = a (previous Fibonacci number)
# x2 = b (current Fibonacci number)
# x3 = temp
# x5 = loop counter
addr hex mnemonic
0x00 0x00000093 addi x1, x0, 0 # x1 = 0 (a = 0)
0x04 0x00100113 addi x2, x0, 1 # x2 = 1 (b = 1)
0x08 0x00a00293 addi x5, x0, 10 # x5 = 10 (counter)
# loop: (addr 0x0c)
0x0c 0x002081b3 add x3, x1, x2 # temp = a + b
0x10 0x00010093 addi x1, x2, 0 # a = b (mv x1, x2)
0x14 0x00018113 addi x2, x3, 0 # b = temp (mv x2, x3)
0x18 0xfff28293 addi x5, x5, -1 # counter--
0x1c 0xfe029ce3 bne x5, x0, -16 # if counter != 0, goto loop
# target = 0x1c + (-16) = 0x0c ✓
0x20 0x00100073 ebreak # halt
Expected final state: x2 = 55 (fib(10))
Encoding Verification for bne x5, x0, -16:
The B-type immediate -16 in binary (13-bit signed) is 1111111110000. The BNE encoding:
31 30:25 24:20 19:15 14:12 11:8 7 6:0
imm[12] imm[10:5] rs2 rs1 funct3 imm[4:1] imm[11] opcode
1 111111 00000 00101 001 0000 1 1100011
= 1111 1110 0000 0010 1001 1100 1110 0011
= 0xFE029CE3 ✓
Program as C array:
static const uint32_t fib_prog[] = {
0x00000093u, // addi x1, x0, 0
0x00100113u, // addi x2, x0, 1
0x00a00293u, // addi x5, x0, 10
0x002081b3u, // add x3, x1, x2
0x00010093u, // addi x1, x2, 0 (mv x1, x2)
0x00018113u, // addi x2, x3, 0 (mv x2, x3)
0xfff28293u, // addi x5, x5, -1
0xfe029ce3u, // bne x5, x0, -16
0x00100073u, // ebreak
};
Program 2 — Sum of Array [1, 2, 3, 4, 5] = 15
Sums five words stored in data memory starting at address 0x100.
Algorithm:
array = {1, 2, 3, 4, 5} at mem[0x100..0x113]
sum = 0
ptr = 0x100
end = 0x114 (0x100 + 5*4)
while ptr < end:
sum += mem[ptr]
ptr += 4
result is sum = 15
Register usage:
x1 = ptr (current array element address)
x2 = sum (accumulator)
x6 = end_ptr (0x114 = loop termination address)
x7 = temp (loaded word)
Data memory initialization (done by testbench before sc_start):
// Store array {1,2,3,4,5} at address 0x100
uint32_t array_data[] = {1, 2, 3, 4, 5};
for (int i = 0; i < 5; i++)
cpu.i_dmem.load_word(0x100 + i*4, array_data[i]);
RISC-V Assembly:
addr hex mnemonic
0x00 0x10000093 addi x1, x0, 0x100 # ptr = 0x100
0x04 0x00000113 addi x2, x0, 0 # sum = 0
0x08 0x11400313 addi x6, x0, 0x114 # end = 0x114
# loop: (addr 0x0c)
0x0c 0x0000a383 lw x7, 0(x1) # temp = mem[ptr]
0x10 0x00710133 add x2, x2, x7 # sum += temp
0x14 0x00408093 addi x1, x1, 4 # ptr += 4
0x18 0xfe60cce3 blt x1, x6, -16 # if ptr < end, goto loop
# target = 0x18 + (-16) = 0x08...
# adjust: target = 0x0c
# offset = 0x0c - 0x18 = -12
0x18 0xfe60cee3 blt x1, x6, -12 # if ptr < end, goto loop (0x0c)
0x1c 0x00100073 ebreak
Encoding note for blt x1, x6, -12:
Offset -12 in 13-bit signed = 1111111110100. B-type encoding:
- rs1=x1 (00001), rs2=x6 (00110), funct3=BLT (100), opcode=BRANCH (1100011)
- imm[12]=1, imm[10:5]=111111, imm[4:1]=1010, imm[11]=1
= 1111 1110 0110 0000 1100 1110 1110 0011
= 0xFE60CEE3
Program as C array:
static const uint32_t sum_prog[] = {
0x10000093u, // addi x1, x0, 0x100 ptr = base address
0x00000113u, // addi x2, x0, 0 sum = 0
0x11400313u, // addi x6, x0, 0x114 end_ptr = base + 5*4
0x0000a383u, // lw x7, 0(x1) load word
0x00710133u, // add x2, x2, x7 accumulate
0x00408093u, // addi x1, x1, 4 advance pointer
0xfe60cee3u, // blt x1, x6, -12 loop back
0x00100073u, // ebreak
};
Expected final state: x2 = 15
Program 3 — Count Non-Zero Bytes in 0x01020304
Word 0x01020304 has 4 non-zero bytes. The program isolates each byte with andi and counts those that are non-zero.
Register usage:
x1 = word to test (0x01020304)
x2 = count (result)
x3 = byte under test
x4 = shift amount (0, 8, 16, 24)
x5 = loop counter (4 iterations)
RISC-V Assembly:
addr hex mnemonic
0x00 0x30400093 addi x1, x0, 0x304 # x1 lower 12 bits = 0x304
# We build 0x01020304 with LUI+ADDI:
0x00 0x010200b7 lui x1, 0x01020 # x1 = 0x01020000
0x04 0x30408093 addi x1, x1, 0x304 # x1 = 0x01020304
0x08 0x00000113 addi x2, x0, 0 # count = 0
0x0c 0x00000213 addi x4, x0, 0 # shift = 0
0x10 0x00400293 addi x5, x0, 4 # loop counter = 4
# loop: (addr 0x14)
0x14 0x00408233 add x4, x1, x0 # x4 = x1 (temp copy)
# Better: use srl to extract byte
# Revised using SRLI + ANDI:
0x14 0x00405233 srl x4, x1, x4 # not right — need shift amount in reg
The cleanest approach for this instruction set is to avoid variable shifts and instead pre-shift by constants:
addr hex mnemonic
0x00 0x010200b7 lui x1, 0x01020 # x1 = 0x01020000
0x04 0x30408093 addi x1, x1, 0x304 # x1 = 0x01020304
0x08 0x00000113 addi x2, x0, 0 # count = 0
# Check byte 0 (bits 7:0)
0x0c 0x0ff0f213 andi x4, x1, 0xff # x4 = byte 0
0x10 0x00020463 beq x4, x0, 8 # skip if zero
0x14 0x00110113 addi x2, x2, 1 # count++
# Check byte 1 (bits 15:8)
0x18 0x00805213 srli x4, x1, 8 # shift right 8
0x1c 0x0ff27213 andi x4, x4, 0xff # mask byte
0x20 0x00020463 beq x4, x0, 8 # skip if zero
0x24 0x00110113 addi x2, x2, 1 # count++
# Check byte 2 (bits 23:16)
0x28 0x01005213 srli x4, x1, 16 # shift right 16
0x2c 0x0ff27213 andi x4, x4, 0xff # mask byte
0x30 0x00020463 beq x4, x0, 8 # skip if zero
0x34 0x00110113 addi x2, x2, 1 # count++
# Check byte 3 (bits 31:24)
0x38 0x01805213 srli x4, x1, 24 # shift right 24
0x3c 0x0ff27213 andi x4, x4, 0xff # mask byte
0x40 0x00020463 beq x4, x0, 8 # skip if zero
0x44 0x00110113 addi x2, x2, 1 # count++
0x48 0x00100073 ebreak
Encoding key instructions:
srli x4, x1, 8 — I-type, opcode=0010011, funct3=101, funct7=0000000:
imm[11:0]=000000001000, rs1=00001, funct3=101, rd=00100, opcode=0010011
= 0000 0000 1000 0000 1101 0010 0001 0011
= 0x00805213 ✓
andi x4, x4, 0xff — I-type, opcode=0010011, funct3=111:
imm[11:0]=000011111111, rs1=00100, funct3=111, rd=00100, opcode=0010011
= 0000 1111 1111 0010 0111 0010 0001 0011
= 0x0FF27213 ✓
Program as C array:
static const uint32_t byte_count_prog[] = {
0x010200b7u, // lui x1, 0x01020
0x30408093u, // addi x1, x1, 0x304 → x1 = 0x01020304
0x00000113u, // addi x2, x0, 0 count = 0
0x0ff0f213u, // andi x4, x1, 0xff byte 0
0x00020463u, // beq x4, x0, +8 skip if zero
0x00110113u, // addi x2, x2, 1 count++
0x00805213u, // srli x4, x1, 8
0x0ff27213u, // andi x4, x4, 0xff byte 1
0x00020463u, // beq x4, x0, +8
0x00110113u, // addi x2, x2, 1 count++
0x01005213u, // srli x4, x1, 16
0x0ff27213u, // andi x4, x4, 0xff byte 2
0x00020463u, // beq x4, x0, +8
0x00110113u, // addi x2, x2, 1 count++
0x01805213u, // srli x4, x1, 24
0x0ff27213u, // andi x4, x4, 0xff byte 3
0x00020463u, // beq x4, x0, +8
0x00110113u, // addi x2, x2, 1 count++
0x00100073u, // ebreak
};
Expected final state: x2 = 4 (all four bytes of 0x01020304 are non-zero)
Program Loader
Two loading interfaces — one for in-memory arrays, one for hex files:
// Load a raw uint32_t array directly into imem
void load_program(imem& mem, const uint32_t* prog, size_t n) {
for (size_t i = 0; i < n; i++)
mem.load_word(i * 4, prog[i]);
}
// Load from a plain hex file (one 32-bit word per line, no 0x prefix).
// This format is compatible with objcopy --output-target=verilog and
// with the standard RISC-V toolchain's objdump -O binary pipeline.
void load_hex_file(imem& mem, const std::string& path) {
std::ifstream f(path);
if (!f.is_open()) {
std::cerr << "ERROR: cannot open hex file: " << path << "\n";
return;
}
uint32_t addr = 0, word;
while (f >> std::hex >> word) {
mem.load_word(addr, word);
addr += 4;
}
std::cout << "Loaded " << (addr / 4) << " words from " << path << "\n";
}
The hex file format:
00000093
00100113
00a00293
002081b3
...
This is compatible with output from:
riscv32-unknown-elf-objcopy -O ihex program.elf program.hex
# or for raw binary:
riscv32-unknown-elf-objdump -d program.elf | grep "^\s" | awk '{print $2}' > program.hex
Instruction Trace Monitor
A standalone disassembler function decodes the most common RV32I instructions to human-readable form. This runs as an SC_METHOD sensitive to the debug ports, printing one line per instruction.
// Disassemble a single RV32I instruction word.
// Returns a string like "addi x1, x0, 5" or "add x3, x1, x2".
// Unknown encodings return "unknown".
std::string disassemble(uint32_t instr, uint32_t pc) {
uint32_t opcode = instr & 0x7F;
uint32_t rd = (instr >> 7) & 0x1F;
uint32_t funct3 = (instr >> 12) & 0x07;
uint32_t rs1 = (instr >> 15) & 0x1F;
uint32_t rs2 = (instr >> 20) & 0x1F;
uint32_t funct7 = (instr >> 25) & 0x7F;
// Sign-extended immediates
int32_t imm_i = (int32_t)instr >> 20;
int32_t imm_s = ((int32_t)(instr & 0xFE000000) >> 20)
| ((instr >> 7) & 0x1F);
int32_t imm_b = ((int32_t)(instr & 0x80000000) >> 19)
| ((instr & 0x80) << 4)
| ((instr >> 20) & 0x7E0)
| ((instr >> 7) & 0x1E);
int32_t imm_u = (int32_t)(instr & 0xFFFFF000);
int32_t imm_j = ((int32_t)(instr & 0x80000000) >> 11)
| (instr & 0xFF000)
| ((instr >> 9) & 0x800)
| ((instr >> 20) & 0x7FE);
// Register name helper
auto xr = [](uint32_t r) -> std::string {
return "x" + std::to_string(r);
};
char buf[80];
switch (opcode) {
case 0x33: // R-type
if (funct3==0 && funct7==0x00) { snprintf(buf,sizeof(buf),"add %s,%s,%s", xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
if (funct3==0 && funct7==0x20) { snprintf(buf,sizeof(buf),"sub %s,%s,%s", xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
if (funct3==1 && funct7==0x00) { snprintf(buf,sizeof(buf),"sll %s,%s,%s", xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
if (funct3==2 && funct7==0x00) { snprintf(buf,sizeof(buf),"slt %s,%s,%s", xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
if (funct3==3 && funct7==0x00) { snprintf(buf,sizeof(buf),"sltu %s,%s,%s", xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
if (funct3==4 && funct7==0x00) { snprintf(buf,sizeof(buf),"xor %s,%s,%s", xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
if (funct3==5 && funct7==0x00) { snprintf(buf,sizeof(buf),"srl %s,%s,%s", xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
if (funct3==5 && funct7==0x20) { snprintf(buf,sizeof(buf),"sra %s,%s,%s", xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
if (funct3==6 && funct7==0x00) { snprintf(buf,sizeof(buf),"or %s,%s,%s", xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
if (funct3==7 && funct7==0x00) { snprintf(buf,sizeof(buf),"and %s,%s,%s", xr(rd).c_str(),xr(rs1).c_str(),xr(rs2).c_str()); return buf; }
break;
case 0x13: // I-type arithmetic
if (funct3==0) { snprintf(buf,sizeof(buf),"addi %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
if (funct3==2) { snprintf(buf,sizeof(buf),"slti %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
if (funct3==3) { snprintf(buf,sizeof(buf),"sltiu %s,%s,%d",xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
if (funct3==4) { snprintf(buf,sizeof(buf),"xori %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
if (funct3==6) { snprintf(buf,sizeof(buf),"ori %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
if (funct3==7) { snprintf(buf,sizeof(buf),"andi %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),imm_i); return buf; }
if (funct3==1) { snprintf(buf,sizeof(buf),"slli %s,%s,%d", xr(rd).c_str(),xr(rs1).c_str(),(imm_i&0x1F)); return buf; }
if (funct3==5 && funct7==0x00) { snprintf(buf,sizeof(buf),"srli %s,%s,%d",xr(rd).c_str(),xr(rs1).c_str(),(imm_i&0x1F)); return buf; }
if (funct3==5 && funct7==0x20) { snprintf(buf,sizeof(buf),"srai %s,%s,%d",xr(rd).c_str(),xr(rs1).c_str(),(imm_i&0x1F)); return buf; }
break;
case 0x03: // Loads
if (funct3==0) { snprintf(buf,sizeof(buf),"lb %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf; }
if (funct3==1) { snprintf(buf,sizeof(buf),"lh %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf; }
if (funct3==2) { snprintf(buf,sizeof(buf),"lw %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf; }
if (funct3==4) { snprintf(buf,sizeof(buf),"lbu %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf; }
if (funct3==5) { snprintf(buf,sizeof(buf),"lhu %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf; }
break;
case 0x23: // Stores
if (funct3==0) { snprintf(buf,sizeof(buf),"sb %s,%d(%s)", xr(rs2).c_str(),imm_s,xr(rs1).c_str()); return buf; }
if (funct3==1) { snprintf(buf,sizeof(buf),"sh %s,%d(%s)", xr(rs2).c_str(),imm_s,xr(rs1).c_str()); return buf; }
if (funct3==2) { snprintf(buf,sizeof(buf),"sw %s,%d(%s)", xr(rs2).c_str(),imm_s,xr(rs1).c_str()); return buf; }
break;
case 0x63: // Branches
{ const char* mn[] = {"beq","bne","???","???","blt","bge","bltu","bgeu"};
snprintf(buf,sizeof(buf),"%s %s,%s,%d", mn[funct3],xr(rs1).c_str(),xr(rs2).c_str(),imm_b);
return buf; }
case 0x6F: // JAL
snprintf(buf,sizeof(buf),"jal %s,%d", xr(rd).c_str(),imm_j); return buf;
case 0x67: // JALR
snprintf(buf,sizeof(buf),"jalr %s,%d(%s)", xr(rd).c_str(),imm_i,xr(rs1).c_str()); return buf;
case 0x37: // LUI
snprintf(buf,sizeof(buf),"lui %s,0x%x", xr(rd).c_str(),(uint32_t)imm_u>>12); return buf;
case 0x17: // AUIPC
snprintf(buf,sizeof(buf),"auipc %s,0x%x", xr(rd).c_str(),(uint32_t)imm_u>>12); return buf;
case 0x73: // SYSTEM
if (instr == 0x00100073) return "ebreak";
if (instr == 0x00000073) return "ecall";
break;
}
snprintf(buf, sizeof(buf), "unknown (0x%08x)", instr);
return buf;
}
The trace monitor SC_METHOD wires into the testbench:
// In tb_cap2 SC_CTOR:
SC_METHOD(trace_monitor);
sensitive << clk.pos();
void tb_cap2::trace_monitor() {
if (rst.read()) return;
uint32_t pc = dbg_pc.read();
uint32_t instr = dbg_instr.read();
std::string dis = disassemble(instr, pc);
std::cout << std::dec << "[" << std::setw(3) << cycle_count << "] "
<< "PC=0x" << std::hex << std::setw(8) << std::setfill('0') << pc
<< " " << std::left << std::setw(30) << dis
<< "\n";
cycle_count++;
}
Sample trace output for Program 1:
[ 1] PC=0x00000000 addi x1,x0,0
[ 2] PC=0x00000004 addi x2,x0,1
[ 3] PC=0x00000008 addi x5,x0,10
[ 4] PC=0x0000000c add x3,x1,x2
[ 5] PC=0x00000010 addi x1,x2,0
[ 6] PC=0x00000014 addi x2,x3,0
[ 7] PC=0x00000018 addi x5,x5,-1
[ 8] PC=0x0000001c bne x5,x0,-16
[ 9] PC=0x0000000c add x3,x1,x2
...
[ 88] PC=0x0000001c bne x5,x0,-16
[ 89] PC=0x00000020 ebreak
The Software Reference Model (SoftCPU)
SoftCPU is a pure C++ ISS — no SystemC, no simulation kernel, no time. It executes one instruction per step() call. The testbench calls it in lockstep with the RTL clock.
// soft_cpu.h
#pragma once
#include <cstdint>
#include <cstring>
#include <string>
class SoftCPU {
public:
static constexpr uint32_t MEM_SIZE = 4096;
SoftCPU() {
std::memset(regs, 0, sizeof(regs));
std::memset(mem, 0, sizeof(mem));
pc = 0;
}
// Load a 32-bit word into instruction/data memory
void load(uint32_t addr, uint32_t word) {
if (addr + 3 < MEM_SIZE) {
mem[addr+0] = (word >> 0) & 0xFF;
mem[addr+1] = (word >> 8) & 0xFF;
mem[addr+2] = (word >> 16) & 0xFF;
mem[addr+3] = (word >> 24) & 0xFF;
}
}
// Load a byte directly (for data memory preloading)
void load_byte(uint32_t addr, uint8_t byte) {
if (addr < MEM_SIZE) mem[addr] = byte;
}
// Execute one instruction. Returns false on EBREAK (halt).
bool step();
uint32_t get_reg(int i) const { return (i == 0) ? 0 : regs[i]; }
uint32_t get_pc() const { return pc; }
// Reset to initial state
void reset() {
std::memset(regs, 0, sizeof(regs));
pc = 0;
}
private:
uint32_t regs[32];
uint32_t pc;
uint8_t mem[MEM_SIZE];
uint32_t fetch() {
return ((uint32_t)mem[pc+0])
| ((uint32_t)mem[pc+1] << 8)
| ((uint32_t)mem[pc+2] << 16)
| ((uint32_t)mem[pc+3] << 24);
}
uint32_t read_word(uint32_t addr) {
return ((uint32_t)mem[addr+0])
| ((uint32_t)mem[addr+1] << 8)
| ((uint32_t)mem[addr+2] << 16)
| ((uint32_t)mem[addr+3] << 24);
}
uint16_t read_half(uint32_t addr) {
return (uint16_t)(mem[addr] | (mem[addr+1] << 8));
}
void write_word(uint32_t addr, uint32_t val) {
mem[addr+0] = (val >> 0) & 0xFF;
mem[addr+1] = (val >> 8) & 0xFF;
mem[addr+2] = (val >> 16) & 0xFF;
mem[addr+3] = (val >> 24) & 0xFF;
}
void write_half(uint32_t addr, uint16_t val) {
mem[addr+0] = (val >> 0) & 0xFF;
mem[addr+1] = (val >> 8) & 0xFF;
}
void write_reg(int r, uint32_t val) {
if (r != 0) regs[r] = val; // x0 hardwired to 0
}
};
// soft_cpu.cpp
#include "soft_cpu.h"
bool SoftCPU::step() {
uint32_t instr = fetch();
uint32_t opcode = instr & 0x7F;
uint32_t rd = (instr >> 7) & 0x1F;
uint32_t funct3 = (instr >> 12) & 0x07;
uint32_t rs1 = (instr >> 15) & 0x1F;
uint32_t rs2 = (instr >> 20) & 0x1F;
uint32_t funct7 = (instr >> 25) & 0x7F;
// Sign-extended immediates
int32_t imm_i = (int32_t)instr >> 20;
int32_t imm_s = ((int32_t)(instr & 0xFE000000) >> 20)
| (int32_t)((instr >> 7) & 0x1F);
int32_t imm_b = ((int32_t)(instr & 0x80000000) >> 19)
| (int32_t)((instr & 0x80) << 4)
| (int32_t)((instr >> 20) & 0x7E0)
| (int32_t)((instr >> 7) & 0x1E);
int32_t imm_u = (int32_t)(instr & 0xFFFFF000);
int32_t imm_j = ((int32_t)(instr & 0x80000000) >> 11)
| (int32_t)(instr & 0xFF000)
| (int32_t)((instr >> 9) & 0x800)
| (int32_t)((instr >> 20) & 0x7FE);
uint32_t r1 = get_reg(rs1);
uint32_t r2 = get_reg(rs2);
uint32_t next_pc = pc + 4;
switch (opcode) {
case 0x33: // R-type
switch (funct3) {
case 0: write_reg(rd, (funct7==0x20) ? (r1 - r2) : (r1 + r2)); break;
case 1: write_reg(rd, r1 << (r2 & 0x1F)); break;
case 2: write_reg(rd, (int32_t)r1 < (int32_t)r2 ? 1 : 0); break;
case 3: write_reg(rd, r1 < r2 ? 1 : 0); break;
case 4: write_reg(rd, r1 ^ r2); break;
case 5: write_reg(rd, (funct7==0x20)
? (uint32_t)((int32_t)r1 >> (r2 & 0x1F))
: (r1 >> (r2 & 0x1F))); break;
case 6: write_reg(rd, r1 | r2); break;
case 7: write_reg(rd, r1 & r2); break;
}
break;
case 0x13: // I-type arithmetic
switch (funct3) {
case 0: write_reg(rd, r1 + (uint32_t)imm_i); break; // ADDI
case 1: write_reg(rd, r1 << (imm_i & 0x1F)); break; // SLLI
case 2: write_reg(rd, (int32_t)r1 < imm_i ? 1 : 0); break; // SLTI
case 3: write_reg(rd, r1 < (uint32_t)imm_i ? 1 : 0); break; // SLTIU
case 4: write_reg(rd, r1 ^ (uint32_t)imm_i); break; // XORI
case 5: write_reg(rd, (funct7==0x20)
? (uint32_t)((int32_t)r1 >> (imm_i & 0x1F))
: (r1 >> (imm_i & 0x1F))); break; // SRLI/SRAI
case 6: write_reg(rd, r1 | (uint32_t)imm_i); break; // ORI
case 7: write_reg(rd, r1 & (uint32_t)imm_i); break; // ANDI
}
break;
case 0x03: // Loads
{ uint32_t ea = r1 + (uint32_t)imm_i;
switch (funct3) {
case 0: write_reg(rd, (uint32_t)(int32_t)(int8_t) mem[ea]); break; // LB
case 1: write_reg(rd, (uint32_t)(int32_t)(int16_t)read_half(ea)); break; // LH
case 2: write_reg(rd, read_word(ea)); break; // LW
case 4: write_reg(rd, (uint32_t)mem[ea]); break; // LBU
case 5: write_reg(rd, (uint32_t)read_half(ea)); break; // LHU
} }
break;
case 0x23: // Stores
{ uint32_t ea = r1 + (uint32_t)imm_s;
switch (funct3) {
case 0: mem[ea] = r2 & 0xFF; break; // SB
case 1: write_half(ea, (uint16_t)(r2 & 0xFFFF)); break; // SH
case 2: write_word(ea, r2); break; // SW
} }
break;
case 0x63: // Branches
{ bool taken = false;
switch (funct3) {
case 0: taken = (r1 == r2); break; // BEQ
case 1: taken = (r1 != r2); break; // BNE
case 4: taken = ((int32_t)r1 < (int32_t)r2); break; // BLT
case 5: taken = ((int32_t)r1 >= (int32_t)r2); break; // BGE
case 6: taken = (r1 < r2); break; // BLTU
case 7: taken = (r1 >= r2); break; // BGEU
}
if (taken) next_pc = pc + (uint32_t)imm_b; }
break;
case 0x6F: // JAL
write_reg(rd, pc + 4);
next_pc = pc + (uint32_t)imm_j;
break;
case 0x67: // JALR
write_reg(rd, pc + 4);
next_pc = (r1 + (uint32_t)imm_i) & ~1u;
break;
case 0x37: // LUI
write_reg(rd, (uint32_t)imm_u);
break;
case 0x17: // AUIPC
write_reg(rd, pc + (uint32_t)imm_u);
break;
case 0x73: // SYSTEM
if (instr == 0x00100073) return false; // EBREAK — halt
break;
}
pc = next_pc;
return true; // continue
}
Self-Checking Testbench
The testbench structure: load the program into both models, step both in lockstep, compare all registers, report pass/fail.
// tb/tb_cap2.cpp
#include <systemc.h>
#include "rv32i_cpu.h"
#include "soft_cpu.h"
#include <iostream>
#include <iomanip>
#include <sstream>
// Include the disassembler (from trace monitor section above)
std::string disassemble(uint32_t instr, uint32_t pc);
// ─────────────────────────────────────────────────────────────────────
// Test programs (defined above in "The Three Programs" section)
// ─────────────────────────────────────────────────────────────────────
static const uint32_t fib_prog[] = {
0x00000093u, 0x00100113u, 0x00a00293u,
0x002081b3u, 0x00010093u, 0x00018113u,
0xfff28293u, 0xfe029ce3u, 0x00100073u,
};
static const uint32_t sum_prog[] = {
0x10000093u, 0x00000113u, 0x11400313u,
0x0000a383u, 0x00710133u, 0x00408093u,
0xfe60cee3u, 0x00100073u,
};
static const uint32_t byte_count_prog[] = {
0x010200b7u, 0x30408093u, 0x00000113u,
0x0ff0f213u, 0x00020463u, 0x00110113u,
0x00805213u, 0x0ff27213u, 0x00020463u, 0x00110113u,
0x01005213u, 0x0ff27213u, 0x00020463u, 0x00110113u,
0x01805213u, 0x0ff27213u, 0x00020463u, 0x00110113u,
0x00100073u,
};
// ─────────────────────────────────────────────────────────────────────
SC_MODULE(tb_cap2) {
sc_signal<bool> clk;
sc_signal<bool> rst;
sc_signal<sc_uint<32>> dbg_pc;
sc_signal<sc_uint<32>> dbg_instr;
sc_signal<bool> dbg_halt;
rv32i_cpu u_cpu;
SoftCPU soft;
int cycle_count = 0;
int fail_count = 0;
SC_CTOR(tb_cap2) : u_cpu("u_cpu") {
u_cpu.clk(clk);
u_cpu.rst(rst);
u_cpu.dbg_pc(dbg_pc);
u_cpu.dbg_instr(dbg_instr);
u_cpu.dbg_halt(dbg_halt);
SC_THREAD(run_all_tests);
}
// ── Clock helper ─────────────────────────────────────────────────
void tick() {
wait(10, SC_NS); clk.write(true);
wait(10, SC_NS); clk.write(false);
cycle_count++;
}
// ── Apply reset ──────────────────────────────────────────────────
void do_reset() {
rst.write(true);
clk.write(false);
tick(); tick();
rst.write(false);
cycle_count = 0;
}
// ── Load program into both models ─────────────────────────────────
void load_prog(const uint32_t* prog, size_t n) {
for (size_t i = 0; i < n; i++) {
u_cpu.i_imem.load_word(i * 4, prog[i]);
soft.load(i * 4, prog[i]);
}
}
// ── Compare all 32 registers ─────────────────────────────────────
bool compare_regs(uint32_t pc) {
bool ok = true;
for (int i = 1; i < 32; i++) {
uint32_t rtl_val = u_cpu.i_rf.read_reg(i);
uint32_t soft_val = soft.get_reg(i);
if (rtl_val != soft_val) {
std::cerr << " MISMATCH at PC=0x" << std::hex << pc
<< " x" << std::dec << i
<< ": RTL=0x" << std::hex << rtl_val
<< " REF=0x" << soft_val << "\n";
ok = false;
fail_count++;
}
}
return ok;
}
// ── Run one program to completion ─────────────────────────────────
bool run_program(const char* name,
const uint32_t* prog, size_t n,
int check_reg, uint32_t expected)
{
std::cout << "\n=== Program: " << name << " ===\n";
// Reset both models
do_reset();
soft.reset();
// Load program
load_prog(prog, n);
int steps = 0;
const int MAX_STEPS = 200;
while (!dbg_halt.read() && steps < MAX_STEPS) {
// Advance RTL one clock
wait(10, SC_NS); clk.write(true);
wait(1, SC_NS); // combinational settling time
uint32_t pc = dbg_pc.read();
uint32_t instr = dbg_instr.read();
// Print trace
std::cout << "[" << std::dec << std::setw(3) << (steps+1) << "] "
<< "PC=0x" << std::hex << std::setw(8) << std::setfill('0') << pc
<< " " << std::left << std::setw(28) << disassemble(instr, pc)
<< "\n";
wait(9, SC_NS); clk.write(false);
// Advance software model
bool running = soft.step();
// Compare state
compare_regs(pc);
steps++;
if (!running) break;
}
if (steps >= MAX_STEPS) {
std::cerr << "TIMEOUT: " << name << " did not halt after "
<< MAX_STEPS << " steps\n";
return false;
}
// Final register check
uint32_t got = u_cpu.i_rf.read_reg(check_reg);
bool pass = (got == expected) && (fail_count == 0);
if (pass) {
std::cout << "PASS: x" << std::dec << check_reg
<< " = " << expected << "\n";
} else {
std::cerr << "FAIL: x" << std::dec << check_reg
<< " = 0x" << std::hex << got
<< " (expected 0x" << expected << ")\n";
}
return pass;
}
// ── Main test sequence ────────────────────────────────────────────
void run_all_tests() {
int total_pass = 0;
int total_tests = 3;
if (run_program("Fibonacci(10)",
fib_prog, sizeof(fib_prog)/sizeof(fib_prog[0]),
2, 55))
total_pass++;
if (run_program("Array Sum [1..5]",
sum_prog, sizeof(sum_prog)/sizeof(sum_prog[0]),
2, 15))
total_pass++;
// Pre-load array data for Program 2 — note: run_program resets,
// so data memory preloading must happen after reset.
// For Program 2 specifically, call a variant:
do_reset();
soft.reset();
load_prog(sum_prog, sizeof(sum_prog)/sizeof(sum_prog[0]));
uint32_t arr[] = {1, 2, 3, 4, 5};
for (int i = 0; i < 5; i++) {
u_cpu.i_dmem.load_word(0x100 + i*4, arr[i]);
soft.load(0x100 + i*4, arr[i]);
}
// (run the sum program again with data preloaded — total_pass already counted)
if (run_program("Non-zero Byte Count (0x01020304)",
byte_count_prog,
sizeof(byte_count_prog)/sizeof(byte_count_prog[0]),
2, 4))
total_pass++;
std::cout << "\n";
std::cout << "========================================\n";
if (total_pass == total_tests)
std::cout << "All " << total_tests
<< " programs passed. Single-cycle RV32I CPU verified!\n";
else
std::cout << total_pass << " / " << total_tests
<< " programs passed.\n";
std::cout << "========================================\n";
sc_stop();
}
};
int sc_main(int, char**) {
tb_cap2 tb("tb");
sc_start();
return 0;
}
Build and Run
# CMakeLists.txt
cmake_minimum_required(VERSION 3.15)
project(cap2_sim CXX)
set(CMAKE_CXX_STANDARD 14)
if(NOT DEFINED SYSTEMC_HOME)
set(SYSTEMC_HOME $ENV{SYSTEMC_HOME})
endif()
find_library(SYSTEMC_LIB systemc
PATHS ${SYSTEMC_HOME}/lib ${SYSTEMC_HOME}/lib-linux64
${SYSTEMC_HOME}/lib-macosx64 REQUIRED)
find_path(SYSTEMC_INCLUDE systemc.h
PATHS ${SYSTEMC_HOME}/include REQUIRED)
include_directories(${SYSTEMC_INCLUDE} include)
set(SOURCES
src/alu.cpp
src/reg_file.cpp
src/decoder.cpp
src/pc.cpp
src/imem.cpp
src/dmem.cpp
src/rv32i_cpu.cpp
src/soft_cpu.cpp
)
add_executable(cap2_sim ${SOURCES} tb/tb_cap2.cpp)
target_link_libraries(cap2_sim ${SYSTEMC_LIB})
mkdir -p build && cd build
cmake .. -DSYSTEMC_HOME=$SYSTEMC_HOME
make -j4
./cap2_sim
Expected output:
=== Program: Fibonacci(10) ===
[ 1] PC=0x00000000 addi x1,x0,0
[ 2] PC=0x00000004 addi x2,x0,1
[ 3] PC=0x00000008 addi x5,x0,10
[ 4] PC=0x0000000c add x3,x1,x2
...
[ 89] PC=0x00000020 ebreak
PASS: x2 = 55
=== Program: Array Sum [1..5] ===
[ 1] PC=0x00000000 addi x1,x0,256
...
PASS: x2 = 15
=== Program: Non-zero Byte Count (0x01020304) ===
[ 1] PC=0x00000000 lui x1,0x01020
...
PASS: x2 = 4
========================================
All 3 programs passed. Single-cycle RV32I CPU verified!
========================================
DV Insight: Lockstep Verification
The lockstep pattern in this testbench — RTL model and software reference model running in parallel, registers compared after every instruction — is the gold standard of CPU validation.
ARM's CPU validation team uses exactly this methodology. Their software reference is the ARM Fast Models: a C++ ISS that runs at hundreds of millions of instructions per second, accurate to the architectural specification. The RTL under test runs in VCS, Xcelium, or Questa. A co-simulation infrastructure advances both models one cycle at a time and compares architectural state — all 32 (or 64, or 256) registers, the program counter, the condition codes, the system registers — after every instruction retirement. At 1 GHz gate simulation speed, that is 10^9 comparison checks per second of simulated time.
Intel uses the same approach. Their ISS is called SDE (Software Development Emulator) and is publicly available. For x86 validation, SDE handles approximately 2,000 instruction variants, AVX-512 vector registers, and multiple privilege levels. The comparison logic is proportionally more complex, but the loop is identical: fetch one instruction, execute in RTL, execute in SDE, compare, continue.
RISC-V International maintains the official riscv-arch-test suite — over 400 assembly test cases covering every instruction and every encoding edge case defined in the specification. Professional CPU teams (SiFive, Western Digital SweRV, CHIPS Alliance Ibex, OpenHW CVA6) must pass all 400+ tests before tapeout. The test infrastructure is exactly our testbench, scaled up: load a program, run to EBREAK, check a designated register against an expected value encoded in the test itself.
Our testbench is a miniature version of this methodology. The three programs in this post test 12 distinct instruction types across 89+ execution steps. That is not sufficient for tapeout — but it is the correct framework. Adding the riscv-arch-test suite to a production version of this testbench is a matter of loading the test programs and reading their pass/fail signatures.
Simulation Semantics: sc_start, sc_stop, and the Simulation Time Axis
A critical difference between SystemC and SV for testbench writers is how simulation time is controlled.
In SystemVerilog
Time advances automatically as the simulator executes. The testbench is a passive participant:
initial begin
rst = 1;
#20ns; // Simulator advances 20ns while this process is blocked
rst = 0;
#100ns;
$finish; // Request end of simulation
end
// Equivalent "advance time" constructs:
#10ns; // Wait for 10ns
@(posedge clk); // Wait for event
repeat(5) @(posedge clk); // Wait for 5 clock edges
In SystemC
Time advances only when sc_start() is called (or implicitly through wait() inside a process). Your sc_main function has explicit control:
// Run for 100ns, then return to sc_main for C++ code
sc_start(100, SC_NS);
std::cout << "After 100ns: PC = 0x" << cpu.dbg_pc.read() << "\n";
// Run for another 50ns
sc_start(50, SC_NS);
// Run until sc_stop() is called from within a process
sc_start(); // No arguments: run indefinitely until sc_stop()
This is architecturally significant: sc_main can interleave simulation with C++ post-processing between sc_start calls. You can run 100 cycles, check state, modify the DUT (if accessible), and run another 100 cycles. SV initial blocks cannot do this — once simulation ends, there is no re-entry.
sc_stop() Behavior
sc_stop() is called from inside a process (typically an SC_THREAD or SC_CTHREAD). It sets a flag requesting simulation to end. Important: it does not immediately halt. The current evaluate phase completes. Code after sc_stop() in the same process still executes:
void tb::run() {
// ... run program ...
std::cout << "Requesting stop\n";
sc_stop();
std::cout << "This STILL executes\n"; // Prints — sc_stop is not an abort
// Process returns normally; simulation ends after this delta cycle
}
Compare to SV:
$finish; // Also does not abort immediately — current time step completes
// Other processes at the same simulation time may still run
Simulation Time Stamp
// SystemC:
sc_time now = sc_time_stamp();
std::cout << "Time: " << now.to_double() << " " << sc_get_time_resolution() << "\n";
// Or: std::cout << now << "\n"; // Streams as "100 ns"
// SV equivalents:
// $time → integer (in simulation time units)
// $realtime → real (in simulation time units)
// $stime → 32-bit integer time
ISS vs. RTL Model — What Is the Difference?
Both SoftCPU and rv32i_cpu execute RISC-V programs. They are fundamentally different models:
| Property | SoftCPU (ISS) | rv32i_cpu (RTL model) |
|---|---|---|
| Simulation time | No simulation time — functional only | Cycle-accurate — every state change has a timestamp |
| Signals | None — just C++ variables | sc_signal<T> — observable in waveforms |
| Hardware structure | No structure — flat ISA decode/execute | Modules, ports, processes — models physical hierarchy |
| Clock awareness | No clock — step() is a function call |
Clock-driven — state changes at clock edges |
| Execution speed | Millions of instructions per second (pure software) | Thousands per second (simulation kernel overhead) |
| Use case | Golden reference, fast pre-silicon software dev | RTL verification, timing analysis, structural coverage |
The lockstep methodology works because the ISS is the specification by definition. When SoftCPU::step() and rv32i_cpu (after one clock edge) disagree about a register value, the bug is in the RTL — always. The ISS is correct because it directly implements the ISA specification in software.
Lockstep loop (one iteration per instruction):
1. Wait for RTL clock edge to settle (wait 1 delta after posedge)
2. Capture RTL register state: u_cpu.i_rf.read_reg(1..31)
3. Capture RTL PC: dbg_pc.read()
4. Call soft.step() — advances software model one instruction
5. Capture ISS register state: soft.get_reg(1..31)
6. Assert: RTL state == ISS state for all 31 registers
7. If mismatch: report FAIL with PC, register name, both values
8. If dbg_halt: exit loop → test complete
This is the methodology used by ARM (ARM Fast Models), Intel (SDE), and RISC-V (riscv-arch-test + Spike). The scale differs — Spike handles millions of instructions per second against RTL running at nanosecond granularity — but the comparison loop is identical.
SV DPI vs. SystemC Co-Simulation
When implementing this methodology in SV, the reference model lives in a separate C++ shared library and is accessed via DPI:
// SV testbench with C++ reference model via DPI
import "DPI-C" context function int soft_cpu_step(
input int instr,
output int new_pc,
inout int regs[32]
);
// In the testbench clock process:
@(posedge clk);
err = soft_cpu_step(instr, new_pc, ref_regs);
foreach (ref_regs[i]) begin
assert(dut.rf.regs[i] == ref_regs[i]) else $error(...);
end
This requires:
1. A C header file declaring the DPI interface
2. A compiled shared library loaded by the simulator
3. Cross-language data type mapping (SV arrays ↔ C arrays)
4. Simulator-specific DPI compilation flags
In SystemC, SoftCPU is just a C++ class in the same .cpp file:
// SystemC testbench — no DPI, no shared library, no interface files
SoftCPU soft;
// ...
soft.step();
// Compare directly:
for (int i = 1; i < 32; i++)
assert(u_cpu.i_rf.read_reg(i) == soft.get_reg(i));
No DPI overhead. No linker flags. No type mapping. This simplicity is one of SystemC's key advantages for CPU verification: the DUT and the checker are both native C++ objects in the same executable.
Common Pitfalls for SV Engineers
Pitfall 1: sc_stop() does not abort immediately — code after it still runs
void tb::run() {
// ...
sc_stop(); // Sets stop flag
post_process(); // STILL EXECUTES — this is a common surprise
}
If post_process() performs register comparisons, they may see stale values (the simulation has not actually ended yet at this point). To be safe: do all final checks before calling sc_stop(), or in a separate end_of_simulation() callback.
Pitfall 2: SoftCPU state must be reset between test programs
The run_program() function calls do_reset() for the RTL model (applies rst for 2 cycles). But SoftCPU has no clock — it must be reset separately:
// In run_program():
do_reset(); // RTL model reset via clock + rst signal
soft.reset(); // ISS reset — separate call required
Missing soft.reset() between programs causes the second program to start with register state left over from the first. This produces false mismatches on the first instruction if any register used by program 2 was modified by program 1.
Pitfall 3: Comparison must happen AFTER all delta cycles settle
The lockstep comparison must fire after the RTL state has fully updated — after all SC_METHOD processes triggered by the clock edge have evaluated. This is why the testbench uses wait(1, SC_NS) after the clock edge:
wait(10, SC_NS); clk.write(true); // Rising edge
wait(1, SC_NS); // Let delta cycles settle
// NOW read dbg_pc and compare with ISS — state is stable
uint32_t pc = dbg_pc.read();
Comparing immediately after the clock write (before the 1ns wait) reads state from the previous cycle. The 1ns gap ensures at least one real-time step has elapsed, allowing all delta cycles to complete.
Pitfall 4: EBREAK (0x00100073) requires a special decoder case
EBREAK is opcode SYSTEM (0x73) with immediate 0x001. If the decoder does not handle SYSTEM opcodes and falls through to a default NOP path, the CPU ignores EBREAK and runs past the end of the program — fetching from the NOP-filled remainder of instruction memory, looping forever.
The symptom: simulation hits the steps >= MAX_STEPS timeout. The fix: the decoder must set sig_halt = true when it sees instr == 0x00100073, and branch_logic must freeze the PC (next = pc) when sig_halt is asserted.
Pitfall 5: Program hex file byte order — little-endian, 32-bit words
The hex file loaded by load_hex_file() and $readmemh contains one 32-bit instruction word per line. RISC-V is little-endian, meaning the 4 bytes of each word are stored in memory LSB-first. The hex file itself lists the word value (not individual bytes), so no byte-swapping is needed at load time:
# Hex file line: 00500093
# Meaning: instruction word = 0x00500093 = addi x1, x0, 5
# Loaded into mem[0..3] as: 93 00 50 00 (little-endian)
If you use objcopy to generate a raw binary and then display it with xxd, you see the little-endian bytes in memory order (93 00 50 00). If you use objdump -d, you see the instruction word value (00500093). The hex file uses the word value format — one 32-bit hex value per instruction. Do not interpret each hex digit as a byte when loading.
Section 2 Summary
You have built a complete, verified, single-cycle RV32I CPU in SystemC.
| Module | File | Approx. Lines | What it verifies |
|---|---|---|---|
| ALU | alu.h / alu.cpp | ~80 | 12 RV32I operations, flags |
| reg_file | reg_file.h / reg_file.cpp | ~60 | 32 registers, x0 hardwired to 0 |
| decoder | decoder.h / decoder.cpp | ~120 | 42 instructions → control signals |
| pc + imem | pc.h, imem.h / .cpp | ~80 | Fetch, branch, jump, reset |
| dmem | dmem.h / dmem.cpp | ~100 | 8 access widths, sign extension |
| rv32i_cpu | rv32i_cpu.h / .cpp | ~150 | Integration, mux logic, JALR |
| SoftCPU | soft_cpu.h / .cpp | ~120 | Golden reference model |
| Total | ~710 | 3 real programs, lockstep verification |
Every line of this code is synthesizable (except the testbench) and corresponds directly to RTL you would write in SystemVerilog. The SystemC gives you something SystemVerilog testbenches rarely offer: a clean integration between the hardware model and an arbitrary C++ reference model in the same simulation kernel.
What's Next
Section 3 begins with Post 14: TLM 2.0 Concepts.
You will replace the cycle-accurate dmem with a TLM 2.0 socket interface. Instead of a clocked module that drives byte-enable signals on specific clock edges, the CPU will issue transactions — structured objects describing what operation, to what address, with what data — and the memory will respond without either side caring about clock cycles.
This is the abstraction level at which modern SoC verification is done. An AXI bus fabric, a DDR controller, a PCIe endpoint — these are all TLM 2.0 initiators and targets in a production SystemC model. After Post 14, our CPU model will be able to connect to any of them.
The single-cycle CPU you verified in this post is the foundation. Everything that follows — pipelining in Section 4, cache models in Section 5, full SoC integration in Section 6 — is built on top of it.
Comments (0)
Leave a Comment