Architecture & Design Verification: 11. SystemC Tutorial

Introduction

The instruction memory from Post 10 is a ROM: read-only, word-granular, simple. Data memory is different in almost every dimension. Programs write to it. They read bytes, halfwords, and words from arbitrary addresses. The values they read back depend on previous writes, on alignment, and on whether the read is signed or unsigned. A byte at address 0x1001 has a different meaning depending on whether the instruction is LB (load byte, sign-extend) or LBU (load byte, zero-extend) — the bit pattern in memory is identical; the value that lands in the destination register is not.

This sounds straightforward, but the details have produced real silicon bugs and real OS-level workarounds throughout computing history.

The DEC Alpha AXI architecture (1992) prohibited unaligned memory access in hardware entirely. A halfword load from an odd address raised an alignment fault. The VAX compatibility layer in Digital UNIX handled these faults in software, emulating the unaligned access with multiple aligned loads and a merge operation — adding a measurable performance penalty to any program with unaligned data. The DEC Alpha Architecture Reference Manual (Second Edition, 1995, section 4.1) states plainly: "Memory data references must be naturally aligned."

Early MIPS processors (R2000/R3000, 1985-1988) took a different approach: they added LWL/LWR (load word left/right) and SWL/SWR (store word left/right) instructions specifically to handle unaligned word access, at the cost of requiring two instructions per unaligned operation. The MIPS R4000 Microprocessor User's Manual documents these instructions as the standard idiom for unaligned access in C compiler output.

RISC-V takes the modern position: the base ISA (RV32I) requires naturally aligned accesses (word accesses aligned to 4 bytes, halfword accesses aligned to 2 bytes). Misaligned access raises an exception on implementations that do not support hardware misalignment handling. The RISC-V ISA Manual Volume I (version 2.2, section 2.6) explicitly states: "Instruction-address-misaligned exceptions are reported on taken branches and jumps... [and] load and store address-misaligned exceptions are generated for accesses where the effective address is not naturally aligned."

Our dmem module models a correctly-aligned byte-addressable memory with the full RV32I load/store instruction set. It will raise a simulation warning (not a hardware exception, which comes later with the control unit) on alignment violations. The focus here is on the data path: byte enables, sign extension, and the funct3 encoding that selects between eight different load/store widths.

Prerequisites

Post 10 — PC and IMEM (byte-addressed arrays, hex loading)
Post 9 — Instruction Decoder (funct3 encoding for load/store)
Code for this post: GitHub — section2/post11

RV32I Load/Store Instructions

RISC-V encodes memory access width and signedness entirely in the funct3 field of the instruction. The decoder (Post 9) extracts this field; dmem consumes it.

Load Instructions (I-type)

funct3	Mnemonic	Width	Sign	Example	Result for mem[addr]=0xFF
`000`	LB	Byte (8-bit)	Signed	`lb x1, 0(x2)`	`0xFFFFFFFF` (-1)
`001`	LH	Half (16-bit)	Signed	`lh x1, 0(x2)`	`0xFFFFFFFF` (-1)
`010`	LW	Word (32-bit)	n/a	`lw x1, 0(x2)`	`0xFFFFFFFF`
`100`	LBU	Byte (8-bit)	Unsigned	`lbu x1, 0(x2)`	`0x000000FF` (255)
`101`	LHU	Half (16-bit)	Unsigned	`lhu x1, 0(x2)`	`0x0000FFFF` (65535)

Store Instructions (S-type)

funct3	Mnemonic	Width	Byte enables	Writes
`000`	SB	Byte	`be = 0001 << (addr[1:0])`	1 byte at byte offset
`001`	SH	Half	`be = 0011 << (addr[1:0])`	2 bytes (aligned)
`010`	SW	Word	`be = 1111`	All 4 bytes

The funct3 encoding is the same field for both loads and stores — the mem_write/mem_read control signals from the decoder select the operation direction. Width and signedness are fully specified by funct3.

Memory Layout: Byte-Addressable, Little-Endian

RISC-V is little-endian by default (the ratification of the RISC-V spec in 2019 made little-endian mandatory for the base ISA, with big-endian as an optional extension). This means the least significant byte of a multi-byte value is stored at the lowest address.

Memory address:  0x100  0x101  0x102  0x103
                  ┌──────┬──────┬──────┬──────┐
Content:          │ 0x78 │ 0x56 │ 0x34 │ 0x12 │
                  └──────┴──────┴──────┴──────┘

LW  at 0x100 → 0x12345678  (little-endian word)
LH  at 0x100 → 0x00005678  (lower 2 bytes, sign-extended: 0x00005678 — positive, MSB=0)
LH  at 0x102 → 0x00001234  (upper 2 bytes)
LB  at 0x100 → 0x00000078  (byte, sign-extended: positive)
LBU at 0x101 → 0x00000056  (byte, zero-extended)

A 4-byte array uint8_t mem[4] holding the word 0x12345678 contains:

mem[0] = 0x78;  // LSB — address 0x100
mem[1] = 0x56;
mem[2] = 0x34;
mem[3] = 0x12;  // MSB — address 0x103

This means assembling a word from bytes for a little-endian read is:

uint32_t word = mem[0] | (mem[1] << 8) | (mem[2] << 16) | (mem[3] << 24);

And writing a word back requires splitting bytes in the same order:

mem[0] = (word >> 0)  & 0xFF;
mem[1] = (word >> 8)  & 0xFF;
mem[2] = (word >> 16) & 0xFF;
mem[3] = (word >> 24) & 0xFF;

Alignment Rules and Why They Matter

Natural alignment means an N-byte access must occur at an address that is a multiple of N:

Access	Width	Address constraint	Check
LB/SB	1 byte	Any address	`(addr & 0) == 0` (always true)
LH/SH	2 bytes	Even address	`(addr & 1) == 0`
LW/SW	4 bytes	4-byte aligned	`(addr & 3) == 0`

Why does this matter in hardware?

Modern memory systems transfer data at bus granularity — a 32-bit bus transfers 4 bytes per cycle. If you request a word load from address 0x1001 (unaligned), the word spans two bus transfers: bytes 1-3 from the first transfer and byte 0 (of the next word) from the second. The hardware must issue two bus transactions, merge the result, and return it. For cache-line based systems (e.g., ARM Cortex-A series with 64-byte cache lines), an unaligned access that crosses a cache line boundary requires two cache-line fills — a 2x bandwidth penalty.

The AMBA AHB specification (IHI0033B, section 3.3) handles this with the HSIZE signal: HSIZE[2:0] encodes the transfer size (byte, halfword, word, etc.) and the HADDR must be aligned to that size. An AHB slave is permitted to return an error response for misaligned addresses. Our dmem mirrors this concept: funct3 encodes the width (like HSIZE), and we check alignment.

SystemC Language Reference

Construct	Syntax	SV / Verilog Equivalent	Key Difference
Byte-addressable memory	`uint8_t mem[MEM_BYTES]`	`logic [7:0] mem [0:MEM_BYTES-1]`	Both are byte arrays; SV unpacked dimension syntax vs. C array
Combinational load	`SC_METHOD(read_proc); sensitive << addr << mem_read << funct3`	`always_comb` with implicit sensitivity	SV auto-senses all reads; SystemC needs explicit list
Synchronous store	`SC_CTHREAD(write_proc, clk.pos())`	`always_ff @(posedge clk)`	Identical semantic; different syntax
Sign extend byte (LB)	`(sc_int<32>)(sc_int<8>)raw_byte`	`{{24{raw_byte[7]}}, raw_byte}` (classic) / `$signed({{24{1'b0}}, raw_byte})` after zero-extend (SV)	C++ cast chain vs. explicit bit replication
Sign extend halfword (LH)	`(sc_int<32>)(sc_int<16>)raw_half`	`{{16{raw_half[15]}}, raw_half}`	Same pattern, different width
Zero extend (LBU/LHU)	`(uint32_t)mem[a]` — no cast chain needed	`{24'b0, raw_byte}` or `32'(raw_byte)` (SV)	C++ `uint8_t` → `uint32_t` widening always zero-extends
Little-endian word assembly	`mem[a]\\|(mem[a+1]<<8)\\|(mem[a+2]<<16)\\|(mem[a+3]<<24)`	`{mem[a+3],mem[a+2],mem[a+1],mem[a]}` (SV concatenation)	SV concatenation puts MSB on left; C++ shift-or is MSB-on-right
funct3 switch	`switch ((uint32_t)funct3.read())`	`unique case (funct3)` or `casez` (SV)	`unique case` enables parallel coverage; C++ `switch` has no equivalent
Byte enable generation	`sc_uint<4> be = 0x1; be <<= addr.range(1,0)`	`be = 4'b0001 << addr[1:0]`	Identical logic; `addr.range()` extracts sub-field in SystemC
Access width check	`check_alignment(a, 4, "LW")` — C++ function	`assert((addr & 2'b11) == 0)` or SVA `property`	SystemC runtime C++ call vs. SV elaboration-time or runtime assertion

Translation Table

Concept	SystemVerilog	C++	SystemC
Byte array	`logic [7:0] mem [0:MEM_SIZE-1]`	`uint8_t mem[MEM_SIZE]`	`uint8_t mem[MEM_SIZE]`
Combinational read	`always_comb`	function call	`SC_METHOD(read_proc)` + `sensitive << addr << ...`
Synchronous write	`always_ff @(posedge clk)`	n/a	`SC_CTHREAD(write_proc, clk.pos())`
Sign extend byte	`{{24{data[7]}}, data[7:0]}`	`(int32_t)(int8_t)byte`	`(sc_int<32>)(sc_int<8>)byte`
Sign extend halfword	`{{16{data[15]}}, data[15:0]}`	`(int32_t)(int16_t)half`	`(sc_int<32>)(sc_int<16>)half`
funct3 decode	`unique case (funct3)` / `casez`	`switch (funct3)`	`switch ((uint32_t)funct3.read())`
Byte enable for SH	`be = 4'b0011 << addr[1:0]`	`be = 0x3 << (addr & 3)`	`sc_uint<4> be = 0x3; be <<= addr.range(1,0)`
Little-endian byte assembly	`{mem[a+3],mem[a+2],mem[a+1],mem[a]}`	`mem[a]\\|(mem[a+1]<<8)\\|...`	same as C++

The C++/SystemC sign-extension trick ((int32_t)(int8_t)byte) works because the C++ standard guarantees that casting a smaller signed integer type to a larger signed integer type sign-extends. This is the most readable form and generates optimal machine code on every modern compiler.

dmem Module: Full Implementation

// File: dmem.h
// Data Memory module for RV32I single-cycle CPU
//
// Supports all 8 RV32I load/store widths via funct3 encoding:
//   Loads:  LB(000) LH(001) LW(010) LBU(100) LHU(101)
//   Stores: SB(000) SH(001) SW(010)
//
// Memory model: flat uint8_t array, byte-addressable, little-endian
// Read:  combinational (SC_METHOD)
// Write: synchronous, rising clock edge (SC_CTHREAD)
//
// SystemC 2.3.x compatible

#ifndef DMEM_H
#define DMEM_H

#include <systemc.h>
#include <cstdint>
#include <cstring>
#include <iostream>
#include <iomanip>
#include <stdexcept>

// funct3 encoding for loads and stores (matches RV32I ISA)
enum dmem_funct3_t : uint32_t {
    FUNCT3_LB  = 0b000,   // Load Byte, signed
    FUNCT3_LH  = 0b001,   // Load Halfword, signed
    FUNCT3_LW  = 0b010,   // Load Word
    FUNCT3_LBU = 0b100,   // Load Byte, unsigned
    FUNCT3_LHU = 0b101,   // Load Halfword, unsigned
    // Stores share the same encoding; direction controlled by mem_write
    FUNCT3_SB  = 0b000,
    FUNCT3_SH  = 0b001,
    FUNCT3_SW  = 0b010,
};

static const int DMEM_DEFAULT_BYTES = 4096;  // 4 KB

SC_MODULE(dmem) {
    // ── Clock ───────────────────────────────────────────────────────────────
    sc_in<bool>         clk;

    // ── Address and data ────────────────────────────────────────────────────
    sc_in<sc_uint<32>>  addr;       // Byte address
    sc_in<sc_uint<32>>  wr_data;    // Write data (stores)
    sc_out<sc_uint<32>> rd_data;    // Read data (loads)

    // ── Control ─────────────────────────────────────────────────────────────
    sc_in<bool>         mem_read;   // High: perform a load this cycle
    sc_in<bool>         mem_write;  // High: perform a store this cycle
    sc_in<sc_uint<3>>   funct3;     // Width + signedness selector

    // ── Memory array ────────────────────────────────────────────────────────
    static const int MEM_BYTES = DMEM_DEFAULT_BYTES;
    uint8_t mem[MEM_BYTES];

    // ── Processes ───────────────────────────────────────────────────────────
    void read_proc();    // SC_METHOD: combinational load
    void write_proc();   // SC_CTHREAD: synchronous store

    SC_CTOR(dmem) {
        memset(mem, 0, sizeof(mem));

        SC_METHOD(read_proc);
        sensitive << addr << mem_read << funct3;
        // Note: we also need to re-evaluate after a write completes.
        // In a real implementation you would add a "mem_updated" event.
        // For single-cycle operation (read and write never overlap), this
        // sensitivity is sufficient.

        SC_CTHREAD(write_proc, clk.pos());
    }

    // ── Alignment checker (reports warning, does not abort simulation) ───────
    bool check_alignment(uint32_t a, uint32_t width_bytes, const char* op) const {
        if ((a & (width_bytes - 1)) != 0) {
            std::cerr << "[dmem] ALIGNMENT WARNING: " << op
                      << " at 0x" << std::hex << std::setw(8) << std::setfill('0') << a
                      << " not " << std::dec << width_bytes << "-byte aligned" << std::endl;
            return false;
        }
        return true;
    }

    // ── Bounds checker ──────────────────────────────────────────────────────
    bool check_bounds(uint32_t a, uint32_t width_bytes, const char* op) const {
        if ((a + width_bytes) > (uint32_t)MEM_BYTES) {
            std::cerr << "[dmem] BOUNDS ERROR: " << op
                      << " at 0x" << std::hex << a
                      << " width=" << std::dec << width_bytes
                      << " exceeds MEM_BYTES=" << MEM_BYTES << std::endl;
            return false;
        }
        return true;
    }
};

#endif // DMEM_H

// File: dmem.cpp

#include "dmem.h"

// ─── Combinational load ───────────────────────────────────────────────────────
//
// Executes whenever addr, mem_read, or funct3 changes.
// If mem_read is low, output is undefined (set to 0 for cleanliness).
// Sign extension uses C++ signed-cast idiom which is both readable and
// generates optimal machine code (no manual bit masking needed).
//
void dmem::read_proc() {
    if (!mem_read.read()) {
        rd_data.write(0);
        return;
    }

    uint32_t a = (uint32_t)addr.read();
    uint32_t f = (uint32_t)funct3.read();
    uint32_t result = 0;

    switch (f) {

        // ── LB: load 1 byte, sign-extend to 32 bits ───────────────────────
        case FUNCT3_LB: {
            if (!check_bounds(a, 1, "LB")) { rd_data.write(0); return; }
            uint8_t raw = mem[a];
            // C++ guarantees: (int8_t)raw sign-extends to int32_t when widened
            result = (uint32_t)(int32_t)(int8_t)raw;
            break;
        }

        // ── LH: load 2 bytes, sign-extend to 32 bits (little-endian) ──────
        case FUNCT3_LH: {
            check_alignment(a, 2, "LH");
            if (!check_bounds(a, 2, "LH")) { rd_data.write(0); return; }
            uint16_t raw = (uint16_t)mem[a] | ((uint16_t)mem[a+1] << 8);
            result = (uint32_t)(int32_t)(int16_t)raw;
            break;
        }

        // ── LW: load 4 bytes (little-endian), no sign extension needed ────
        case FUNCT3_LW: {
            check_alignment(a, 4, "LW");
            if (!check_bounds(a, 4, "LW")) { rd_data.write(0); return; }
            result = (uint32_t)mem[a]
                   | ((uint32_t)mem[a+1] << 8)
                   | ((uint32_t)mem[a+2] << 16)
                   | ((uint32_t)mem[a+3] << 24);
            break;
        }

        // ── LBU: load 1 byte, zero-extend (no cast needed — uint8_t is always < 2^8) ─
        case FUNCT3_LBU: {
            if (!check_bounds(a, 1, "LBU")) { rd_data.write(0); return; }
            result = (uint32_t)mem[a];  // zero-extend: high bytes are 0
            break;
        }

        // ── LHU: load 2 bytes, zero-extend (little-endian) ────────────────
        case FUNCT3_LHU: {
            check_alignment(a, 2, "LHU");
            if (!check_bounds(a, 2, "LHU")) { rd_data.write(0); return; }
            result = (uint32_t)mem[a] | ((uint32_t)mem[a+1] << 8);
            break;
        }

        default:
            std::cerr << "[dmem] UNKNOWN funct3=" << f << " on read" << std::endl;
            result = 0;
            break;
    }

    rd_data.write(result);
}

// ─── Synchronous store ────────────────────────────────────────────────────────
//
// Executes on every rising clock edge. If mem_write is high, writes
// the appropriate bytes determined by funct3 and the byte offset within
// the aligned word (addr[1:0]).
//
// Byte enable logic:
//   SB: write 1 byte at addr[1:0] within the word
//   SH: write 2 bytes at addr[1:0] (must be 0 or 2)
//   SW: write all 4 bytes
//
void dmem::write_proc() {
    while (true) {
        wait();  // Rising clock edge

        if (!mem_write.read()) continue;

        uint32_t a    = (uint32_t)addr.read();
        uint32_t data = (uint32_t)wr_data.read();
        uint32_t f    = (uint32_t)funct3.read();

        switch (f) {

            // ── SB: store byte ────────────────────────────────────────────
            case FUNCT3_SB: {
                if (!check_bounds(a, 1, "SB")) break;
                mem[a] = (uint8_t)(data & 0xFF);
                break;
            }

            // ── SH: store halfword (2 bytes, little-endian) ───────────────
            case FUNCT3_SH: {
                check_alignment(a, 2, "SH");
                if (!check_bounds(a, 2, "SH")) break;
                mem[a]   = (uint8_t)(data & 0x00FF);
                mem[a+1] = (uint8_t)((data >> 8) & 0xFF);
                break;
            }

            // ── SW: store word (4 bytes, little-endian) ───────────────────
            case FUNCT3_SW: {
                check_alignment(a, 4, "SW");
                if (!check_bounds(a, 4, "SW")) break;
                mem[a]   = (uint8_t)(data & 0xFF);
                mem[a+1] = (uint8_t)((data >> 8)  & 0xFF);
                mem[a+2] = (uint8_t)((data >> 16) & 0xFF);
                mem[a+3] = (uint8_t)((data >> 24) & 0xFF);
                break;
            }

            default:
                std::cerr << "[dmem] UNKNOWN funct3=" << f << " on write" << std::endl;
                break;
        }
    }
}

Module Ports at a Glance

graph TD
    subgraph "dmem module"
        direction TB
        R["read_proc\n(SC_METHOD)\ncombinational"]
        W["write_proc\n(SC_CTHREAD)\nrising edge"]
        MEM["uint8_t mem[4096]\nbyte array"]
        R -->|reads| MEM
        W -->|writes| MEM
    end

    CLK["clk"] --> W
    ADDR["addr[31:0]\nbyte address"] --> R
    ADDR --> W
    MR["mem_read"] --> R
    MW["mem_write"] --> W
    F3["funct3[2:0]\n000=B 001=H 010=W\n100=BU 101=HU"] --> R
    F3 --> W
    WD["wr_data[31:0]"] --> W
    RD["rd_data[31:0]"] --> OUT["Load result\n(register file\nwrite-back)"]
    R --> RD

    style R fill:#10b981,color:#fff
    style W fill:#06b6d4,color:#fff
    style MEM fill:#f59e0b,color:#000

The ARM AMBA Parallel

In production SoCs, data memory is accessed through a bus protocol. ARM's AMBA AHB (Advanced High-performance Bus) is the industry standard for on-chip memory-mapped peripherals. The conceptual mapping to our dmem interface:

AMBA AHB Signal	Our dmem Signal	Description
`HADDR[31:0]`	`addr[31:0]`	Byte address
`HWDATA[31:0]`	`wr_data[31:0]`	Write data
`HRDATA[31:0]`	`rd_data[31:0]`	Read data
`HWRITE`	`mem_write`	1=write, 0=read
`HSIZE[2:0]`	`funct3[2:0]`	Transfer size (byte/half/word)
`HTRANS[1:0]`	`mem_read \\| mem_write`	Nonsequential transfer
`HREADY`	n/a (always ready)	Slave ready signal
`HRESP`	alignment warning	Error response

Our dmem is an AMBA AHB slave with zero wait states: HREADY is always high, HRESP is always OKAY (we just print warnings instead of asserting an error). The ARM AMBA 3 AHB-Lite Protocol Specification (IHI0033B) documents this exact interface in section 3.

The RISC-V SoC ecosystem uses TileLink instead of AMBA (documented in the SiFive TileLink Specification, version 1.8). TileLink adds message-passing handshake channels (A, D, and optional B, C, E), but the fundamental concept — address, data, size, and read/write control — maps to the same dmem interface at the leaf level.

Testbench

The testbench covers eight scenarios matching the three dimensions of the coverage model (access type × alignment × data pattern):

// File: tb_dmem.cpp
// Testbench for dmem module
// Covers all 8 RV32I load/store types, alignment variants, sign extension,
// and the unsigned vs signed distinction for LB/LBU and LH/LHU

#include <systemc.h>
#include <iostream>
#include <iomanip>
#include <cassert>
#include <cstring>
#include "dmem.h"

static int test_pass = 0;
static int test_fail = 0;

void check(const char* name, uint32_t got, uint32_t expected) {
    if (got == expected) {
        std::cout << "  PASS  " << name
                  << " = 0x" << std::hex << std::setw(8) << std::setfill('0') << got
                  << std::dec << std::endl;
        test_pass++;
    } else {
        std::cout << "  FAIL  " << name
                  << " got=0x" << std::hex << std::setw(8) << std::setfill('0') << got
                  << " exp=0x" << std::setw(8) << std::setfill('0') << expected
                  << std::dec << std::endl;
        test_fail++;
    }
}

SC_MODULE(tb_dmem) {
    sc_clock        clk{"clk", 10, SC_NS};

    sc_signal<sc_uint<32>>  addr{"addr"};
    sc_signal<sc_uint<32>>  wr_data{"wr_data"};
    sc_signal<sc_uint<32>>  rd_data{"rd_data"};
    sc_signal<bool>         mem_read{"mem_read"};
    sc_signal<bool>         mem_write{"mem_write"};
    sc_signal<sc_uint<3>>   funct3{"funct3"};

    dmem dut{"dut"};

    SC_CTOR(tb_dmem) {
        dut.clk(clk);
        dut.addr(addr);
        dut.wr_data(wr_data);
        dut.rd_data(rd_data);
        dut.mem_read(mem_read);
        dut.mem_write(mem_write);
        dut.funct3(funct3);

        SC_THREAD(test_proc);
    }

    // ── Utility: drive a store, tick clock, then idle ───────────────────────
    void do_store(uint32_t a, uint32_t data, uint32_t f3) {
        addr.write(a);
        wr_data.write(data);
        funct3.write(f3);
        mem_write.write(true);
        mem_read.write(false);
        wait(clk.posedge_event());
        wait(SC_ZERO_TIME);
        mem_write.write(false);
    }

    // ── Utility: drive a load, wait for combinational output ────────────────
    uint32_t do_load(uint32_t a, uint32_t f3) {
        addr.write(a);
        funct3.write(f3);
        mem_read.write(true);
        mem_write.write(false);
        wait(SC_ZERO_TIME);  // Let SC_METHOD evaluate
        uint32_t result = (uint32_t)rd_data.read();
        mem_read.write(false);
        return result;
    }

    void test_proc();
};

void tb_dmem::test_proc() {
    mem_read.write(false);
    mem_write.write(false);
    addr.write(0);
    wr_data.write(0);
    funct3.write(0);
    wait(clk.posedge_event());

    // =========================================================================
    // TEST 1: SW + LW — store and load a full word
    // =========================================================================
    std::cout << "\n=== TEST 1: SW / LW (word store + load) ===\n";
    do_store(0x00, 0xDEADBEEFu, FUNCT3_SW);
    check("LW at 0x00 after SW", do_load(0x00, FUNCT3_LW), 0xDEADBEEFu);

    do_store(0x04, 0x12345678u, FUNCT3_SW);
    check("LW at 0x04", do_load(0x04, FUNCT3_LW), 0x12345678u);

    // =========================================================================
    // TEST 2: SH + LH — halfword store, signed load
    // =========================================================================
    std::cout << "\n=== TEST 2: SH / LH (halfword, signed) ===\n";
    // Store 0x1234 at 0x100 (even, aligned), then 0x8765 at 0x102
    do_store(0x100, 0x00001234u, FUNCT3_SH);
    do_store(0x102, 0x00008765u, FUNCT3_SH);

    check("LH at 0x100 (positive)",    do_load(0x100, FUNCT3_LH), 0x00001234u);
    // 0x8765 has MSB set → sign-extend should give 0xFFFF8765
    check("LH at 0x102 (negative, sign-extended)", do_load(0x102, FUNCT3_LH), 0xFFFF8765u);

    // =========================================================================
    // TEST 3: LHU — halfword zero-extend (same raw bytes, different result)
    // =========================================================================
    std::cout << "\n=== TEST 3: LHU (halfword, unsigned) ===\n";
    // mem[0x102..0x103] still holds 0x8765 from TEST 2
    check("LHU at 0x102 (zero-extend)", do_load(0x102, FUNCT3_LHU), 0x00008765u);
    check("LHU at 0x100 (zero-extend)", do_load(0x100, FUNCT3_LHU), 0x00001234u);

    // =========================================================================
    // TEST 4: SB + LB — byte store, signed load
    // =========================================================================
    std::cout << "\n=== TEST 4: SB / LB (byte, signed) ===\n";
    do_store(0x200, 0x000000ABu, FUNCT3_SB);  // 0xAB = 171 unsigned = -85 signed
    do_store(0x201, 0x0000007Fu, FUNCT3_SB);  // 0x7F = 127 (positive max for int8)
    do_store(0x202, 0x000000FFu, FUNCT3_SB);  // 0xFF = -1 signed
    do_store(0x203, 0x00000000u, FUNCT3_SB);  // 0x00 = 0

    // LB sign-extends: 0xAB = 1010_1011 → MSB=1 → sign-extend → 0xFFFFFFAB
    check("LB at 0x200 (0xAB sign-ext)", do_load(0x200, FUNCT3_LB), 0xFFFFFFABu);
    check("LB at 0x201 (0x7F positive)", do_load(0x201, FUNCT3_LB), 0x0000007Fu);
    check("LB at 0x202 (0xFF = -1)",     do_load(0x202, FUNCT3_LB), 0xFFFFFFFFu);
    check("LB at 0x203 (0x00 = 0)",      do_load(0x203, FUNCT3_LB), 0x00000000u);

    // =========================================================================
    // TEST 5: LBU — same bytes, zero-extend
    // =========================================================================
    std::cout << "\n=== TEST 5: LBU (byte, unsigned) ===\n";
    // Same memory from TEST 4
    check("LBU at 0x200 (0xAB unsigned)", do_load(0x200, FUNCT3_LBU), 0x000000ABu);
    check("LBU at 0x202 (0xFF unsigned)", do_load(0x202, FUNCT3_LBU), 0x000000FFu);

    // =========================================================================
    // TEST 6: Byte stores at all 4 positions within a word
    // =========================================================================
    std::cout << "\n=== TEST 6: SB at all 4 byte positions ===\n";
    // Clear a word first
    do_store(0x300, 0x00000000u, FUNCT3_SW);

    // Write individual bytes: address = base + byte_offset
    do_store(0x300, 0xAAu, FUNCT3_SB);   // byte 0 (addr[1:0]=00)
    do_store(0x301, 0xBBu, FUNCT3_SB);   // byte 1
    do_store(0x302, 0xCCu, FUNCT3_SB);   // byte 2
    do_store(0x303, 0xDDu, FUNCT3_SB);   // byte 3

    // Full word should now be 0xDDCCBBAA (little-endian: AA at lowest addr)
    check("LW after 4x SB (little-endian)", do_load(0x300, FUNCT3_LW), 0xDDCCBBAAu);

    // And individual bytes
    check("LBU byte 0 = 0xAA", do_load(0x300, FUNCT3_LBU), 0xAAu);
    check("LBU byte 1 = 0xBB", do_load(0x301, FUNCT3_LBU), 0xBBu);
    check("LBU byte 2 = 0xCC", do_load(0x302, FUNCT3_LBU), 0xCCu);
    check("LBU byte 3 = 0xDD", do_load(0x303, FUNCT3_LBU), 0xDDu);

    // =========================================================================
    // TEST 7: Overwrite byte within word (partial write)
    // =========================================================================
    std::cout << "\n=== TEST 7: Partial byte overwrite ===\n";
    do_store(0x400, 0x11223344u, FUNCT3_SW);
    // Overwrite only byte 2 (address 0x402)
    do_store(0x402, 0xFFu, FUNCT3_SB);
    // Word should now be 0x11FF3344
    check("LW after SB overwrite of byte 2", do_load(0x400, FUNCT3_LW), 0x11FF3344u);

    // =========================================================================
    // TEST 8: SH at both halfword positions within a word
    // =========================================================================
    std::cout << "\n=== TEST 8: SH at both halfword positions ===\n";
    do_store(0x500, 0x00000000u, FUNCT3_SW);
    do_store(0x500, 0xABCDu, FUNCT3_SH);   // low halfword
    do_store(0x502, 0x1234u, FUNCT3_SH);   // high halfword
    // Word = 0x1234ABCD
    check("LW after SH low + SH high", do_load(0x500, FUNCT3_LW), 0x1234ABCDu);
    check("LH low half",  do_load(0x500, FUNCT3_LH), 0xFFFFABCDu);  // sign-ext: MSB=1
    check("LH high half", do_load(0x502, FUNCT3_LH), 0x00001234u);  // positive

    // =========================================================================
    // TEST 9: Sign extension corner cases
    // =========================================================================
    std::cout << "\n=== TEST 9: Sign extension corner cases ===\n";
    // Most-negative byte: 0x80 = -128 signed
    do_store(0x600, 0x80u, FUNCT3_SB);
    check("LB  0x80 = -128 (0xFFFFFF80)", do_load(0x600, FUNCT3_LB), 0xFFFFFF80u);
    check("LBU 0x80 = 128  (0x00000080)", do_load(0x600, FUNCT3_LBU), 0x00000080u);

    // Most-negative halfword: 0x8000 = -32768 signed
    do_store(0x604, 0x8000u, FUNCT3_SH);
    check("LH  0x8000 = -32768 (0xFFFF8000)", do_load(0x604, FUNCT3_LH), 0xFFFF8000u);
    check("LHU 0x8000 = 32768  (0x00008000)", do_load(0x604, FUNCT3_LHU), 0x00008000u);

    // 0x7F = +127 (max positive byte) — must NOT sign-extend
    do_store(0x608, 0x7Fu, FUNCT3_SB);
    check("LB  0x7F = +127 (0x0000007F)", do_load(0x608, FUNCT3_LB), 0x0000007Fu);

    // =========================================================================
    // TEST 10: Alignment warning verification (should print warning, not crash)
    // =========================================================================
    std::cout << "\n=== TEST 10: Alignment warnings (expect stderr warnings) ===\n";
    // LH at odd address — alignment violation, produces warning
    do_store(0x700, 0xAABBCCDDu, FUNCT3_SW);
    std::cout << "  (Expect alignment warning for LH at 0x701):\n";
    do_load(0x701, FUNCT3_LH);  // Misaligned — warning expected
    std::cout << "  [continuing after misaligned access — no crash]\n";
    test_pass++;  // Counts as pass if we reach here without crashing

    // =========================================================================
    // Summary
    // =========================================================================
    std::cout << "\n========================================\n";
    std::cout << "  PASS: " << test_pass << "   FAIL: " << test_fail << std::endl;
    std::cout << "========================================\n";
    if (test_fail == 0) {
        std::cout << "  ALL TESTS PASSED — dmem verified\n";
    } else {
        std::cout << "  FAILURES DETECTED — review output\n";
    }
    sc_stop();
}

int sc_main(int argc, char* argv[]) {
    tb_dmem tb{"tb"};
    sc_start();
    return 0;
}

CMake Build

# CMakeLists.txt — Post 11: Data Memory
cmake_minimum_required(VERSION 3.16)
project(post11_dmem CXX)

set(CMAKE_CXX_STANDARD 17)

find_package(SystemCLanguage QUIET)
if(NOT SystemCLanguage_FOUND)
    set(SYSTEMC_HOME $ENV{SYSTEMC_HOME})
    include_directories(${SYSTEMC_HOME}/include)
    link_directories(${SYSTEMC_HOME}/lib-linux64)
    set(SC_LIBS systemc)
endif()

add_executable(tb_dmem
    dmem.cpp
    tb_dmem.cpp
)
target_link_libraries(tb_dmem ${SC_LIBS})

mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug
make -j4
./tb_dmem

Expected output excerpt:

=== TEST 4: SB / LB (byte, signed) ===
  PASS  LB at 0x200 (0xAB sign-ext) = 0xffffffab
  PASS  LB at 0x201 (0x7F positive) = 0x0000007f
  PASS  LB at 0x202 (0xFF = -1) = 0xffffffff
  PASS  LB at 0x203 (0x00 = 0) = 0x00000000

=== TEST 5: LBU (byte, unsigned) ===
  PASS  LBU at 0x200 (0xAB unsigned) = 0x000000ab
  PASS  LBU at 0x202 (0xFF unsigned) = 0x000000ff

=== TEST 10: Alignment warnings (expect stderr warnings) ===
  (Expect alignment warning for LH at 0x701):
[dmem] ALIGNMENT WARNING: LH at 0x00000701 not 2-byte aligned
  [continuing after misaligned access — no crash]

========================================
  PASS: 32   FAIL: 0
  ALL TESTS PASSED — dmem verified
========================================

Coverage Model: The 3-Way Cross

DV Insight Data memory is a classic coverage challenge because "it works for the basic case" is almost guaranteed — SW followed by LW is the first thing every engineer tests. The bugs hide in the corners: sign extension on the maximum negative value, byte position 3 in a partial write, back-to-back stores to adjacent addresses with aliased byte enables, and the LB/LBU distinction on a value with bit 7 set.

The industry-standard approach is to define a cross coverage model with three dimensions:

Coverage Group: dmem_access

Dimension 1 — access_type (8 bins):
  LB, LH, LW, LBU, LHU, SB, SH, SW

Dimension 2 — alignment (3 bins):
  ALIGNED   — access meets natural alignment requirement
  UNALIGNED — access violates alignment (tests exception handling)
  ZERO      — address is exactly 0x000 (reset value corner case)

Dimension 3 — data_pattern (5 bins):
  ALL_ZEROS    — 0x00000000
  ALL_ONES     — 0xFFFFFFFF (= -1 signed)
  MSB_SET      — MSB of the accessed width is 1 (sign extension trigger)
  MAX_POSITIVE — 0x7F/0x7FFF/0x7FFFFFFF (max positive for signed interpretation)
  RANDOM       — arbitrary pattern

3-way cross: access_type × alignment × data_pattern
Total bins: 8 × 3 × 5 = 120 coverage points

In SystemVerilog UVM, this would be a covergroup with cross between three coverpoints. In SystemC, you implement it as a monitor struct tracking hit/miss for each combination. Here is the SystemC coverage collector:

// File: dmem_coverage.h
// Functional coverage collector for dmem
// Tracks the 3-way cross: access_type × alignment × data_pattern

#ifndef DMEM_COVERAGE_H
#define DMEM_COVERAGE_H

#include <systemc.h>
#include <bitset>
#include <iostream>
#include <iomanip>
#include "dmem.h"

SC_MODULE(dmem_coverage) {
    sc_in<bool>         clk;
    sc_in<sc_uint<32>>  addr;
    sc_in<sc_uint<32>>  wr_data;
    sc_in<sc_uint<32>>  rd_data;
    sc_in<bool>         mem_read;
    sc_in<bool>         mem_write;
    sc_in<sc_uint<3>>   funct3;

    // ── Coverage dimensions ─────────────────────────────────────────────────
    // Dimension 1: access type (8 values: LB=0 LH=1 LW=2 LBU=4 LHU=5 SB=8 SH=9 SW=10)
    // We map to indices 0-7 for the array
    enum AccessType { AT_LB=0, AT_LH=1, AT_LW=2, AT_LBU=3, AT_LHU=4,
                      AT_SB=5, AT_SH=6, AT_SW=7, AT_COUNT=8 };

    // Dimension 2: alignment
    enum AlignType  { AL_ALIGNED=0, AL_UNALIGNED=1, AL_ZERO=2, AL_COUNT=3 };

    // Dimension 3: data pattern
    enum DataPat    { DP_ALL_ZEROS=0, DP_ALL_ONES=1, DP_MSB_SET=2,
                      DP_MAX_POS=3,   DP_OTHER=4,    DP_COUNT=5 };

    // 3-way cross: hit[access_type][align][data_pattern]
    bool hit[AT_COUNT][AL_COUNT][DP_COUNT];
    int  total_hits = 0;

    AccessType classify_access(uint32_t f3, bool is_write) const {
        if (is_write) {
            if (f3 == 0) return AT_SB;
            if (f3 == 1) return AT_SH;
            return AT_SW;
        } else {
            if (f3 == 0) return AT_LB;
            if (f3 == 1) return AT_LH;
            if (f3 == 2) return AT_LW;
            if (f3 == 4) return AT_LBU;
            return AT_LHU;
        }
    }

    AlignType classify_align(uint32_t a, uint32_t f3) const {
        if (a == 0) return AL_ZERO;
        uint32_t width = (f3 == 2) ? 4 : (f3 == 1) ? 2 : 1;
        if ((a & (width - 1)) != 0) return AL_UNALIGNED;
        return AL_ALIGNED;
    }

    DataPat classify_data(uint32_t data, uint32_t f3) const {
        // Mask to accessed width for stores
        uint32_t mask = (f3 == 2) ? 0xFFFFFFFFu :
                        (f3 == 1) ? 0x0000FFFFu : 0x000000FFu;
        uint32_t d = data & mask;
        if (d == 0)    return DP_ALL_ZEROS;
        if (d == mask) return DP_ALL_ONES;
        uint32_t msb_mask = (f3 == 2) ? 0x80000000u :
                            (f3 == 1) ? 0x00008000u : 0x00000080u;
        if (d & msb_mask) {
            if (d == msb_mask) return DP_MSB_SET;  // exactly MSB set
        }
        uint32_t max_pos = mask >> 1;
        if (d == max_pos) return DP_MAX_POS;
        return DP_OTHER;
    }

    void sample() {
        bool is_write = mem_write.read();
        bool is_read  = mem_read.read();
        if (!is_write && !is_read) return;

        uint32_t a  = (uint32_t)addr.read();
        uint32_t d  = (uint32_t)(is_write ? wr_data.read() : rd_data.read());
        uint32_t f3 = (uint32_t)funct3.read();

        AccessType at = classify_access(f3, is_write);
        AlignType  al = classify_align(a, f3 & 0x3);  // mask to width bits
        DataPat    dp = classify_data(d, f3 & 0x3);

        if (!hit[at][al][dp]) {
            hit[at][al][dp] = true;
            total_hits++;
        }
    }

    void report() const {
        const char* at_names[] = {"LB","LH","LW","LBU","LHU","SB","SH","SW"};
        const char* al_names[] = {"aligned","unaligned","zero-addr"};
        const char* dp_names[] = {"all-zeros","all-ones","msb-set","max-pos","other"};
        int total = AT_COUNT * AL_COUNT * DP_COUNT;

        std::cout << "\n=== dmem 3-Way Coverage Report ===\n";
        std::cout << "  Total bins:  " << total << std::endl;
        std::cout << "  Bins hit:    " << total_hits << std::endl;
        std::cout << "  Coverage:    "
                  << std::fixed << std::setprecision(1)
                  << (100.0 * total_hits / total) << "%\n";

        std::cout << "\n  Unhit bins:\n";
        for (int a = 0; a < AT_COUNT; a++)
            for (int al = 0; al < AL_COUNT; al++)
                for (int dp = 0; dp < DP_COUNT; dp++)
                    if (!hit[a][al][dp])
                        std::cout << "    [ ] " << at_names[a] << " × "
                                  << al_names[al] << " × " << dp_names[dp] << "\n";
    }

    SC_CTOR(dmem_coverage) {
        memset(hit, 0, sizeof(hit));

        SC_METHOD(sample);
        sensitive << clk.pos();
    }
};

#endif // DMEM_COVERAGE_H

This is a direct SystemC translation of what a UVM covergroup with cross does. The sample() method fires on every clock edge (mirroring sample() in a UVM covergroup), classifies the current transaction into the 3-way bin, and marks it hit. The report() method at end-of-simulation prints coverage percentage and lists the unhit bins — exactly what you'd see from a UVM coverage report or Synopsys VCS urg output.

DV Insight In a real verification project, 100% of 120 bins would not be the closure criterion — that would be defined in the verification plan (vplan). Typically you'd require 100% of safety-critical bins (all-ones, msb-set for sign extension paths) and accept N/A for impossible bins (unaligned word access at the zero address, which implies a program bug). The value of the explicit model is that it forces you to state which bins you're waiving and why — not just accept "looks like it works."

Connecting dmem to the CPU Datapath

When the single-cycle CPU comes together in Post 18, dmem connects to the ALU output (effective address) and the register file (write data and read-back data):

graph LR
    RF["Register File\n(Post 8)"]
    ALU["ALU\n(Post 5)"]
    DEC["Decoder\n(Post 9)"]
    DMEM["dmem\n(Post 11)"]
    WB["Write-Back\nMux"]

    RF -->|rs1, rs2| ALU
    ALU -->|result = addr| DMEM
    RF -->|rs2 = store data| DMEM
    DEC -->|mem_read\nmem_write\nfunct3| DMEM
    DMEM -->|rd_data| WB
    ALU -->|alu_result| WB
    WB -->|rd_data| RF

    style DMEM fill:#06b6d4,color:#fff
    style ALU  fill:#10b981,color:#fff
    style RF   fill:#f59e0b,color:#fff
    style WB   fill:#6366f1,color:#fff

The address path (rs1 + imm → ALU → dmem.addr) and the data path (rs2 → dmem.wr_data, dmem.rd_data → register file) are the two memory-access data flows. The funct3 field flows directly from the decoder to dmem without transformation — the encoding is designed so that dmem can use it directly.

Section 2 Progress

graph LR
    P8["Post 8\nRegister File"]
    P9["Post 9\nInstruction Decoder"]
    P10["Post 10\nPC + IMEM"]
    P11["Post 11\ndmem\n← You are here"]
    P12["Post 12\nSection 2 Capstone"]
    P18["Post 18\nSingle-Cycle CPU"]

    P8  --> P12
    P9  --> P12
    P10 --> P12
    P11 --> P12
    P12 --> P18

    style P11 fill:#f59e0b,color:#fff
    style P12 fill:#10b981,color:#fff
    style P18 fill:#6366f1,color:#fff

All five Section 2 data-path blocks are now complete:
- Post 8: Register File (32 × 32-bit registers, synchronous write, combinational read)
- Post 9: Instruction Decoder (opcode/funct3/funct7 → control signals)
- Post 10: PC + IMEM (sequential counter + combinational ROM)
- Post 11: Data Memory (byte-addressable, all 8 load/store widths)

The only remaining Section 2 work is the capstone (Post 12), which wires these modules together and runs an end-to-end fetch-decode-memory simulation.

Simulation Semantics: SC_METHOD for Combinational Load, SC_CTHREAD for Store

The dmem module uses the same two-process split as the PC module, but applied to memory rather than a register.

Why Split the Processes?

The combinational read (read_proc as SC_METHOD) models the behavior of an SRAM's read port: address in, data out, no clock required. The synchronous write (write_proc as SC_CTHREAD) models the SRAM's write port: data is captured on the rising clock edge, then becomes visible on the next read.

In the single-cycle CPU, these processes fire in this order during one clock period:

Rising edge at time T:
  1. SC_CTHREAD (write_proc) resumes
     → If mem_write is high: mem[addr] ← wr_data (C++ array write — immediate, no delta)
     → wait() — suspends until next edge

  Note: C++ array writes are NOT sc_signal writes. They take effect immediately,
        not at the end of the delta cycle. This is a key difference from SV.

  2. SC_METHOD (read_proc) evaluates (sensitive to addr, mem_read, funct3)
     → Reads mem[] with C++ array access — sees the value just written above

  [No further events until next clock edge]

This is subtly different from how SV always_ff and always_comb interact. In SV:

// SV — write and read in same cycle with non-blocking assignment
always_ff @(posedge clk)
    if (mem_write) mem[addr] <= wr_data;  // NB: schedules for end of time step

always_comb
    rd_data = mem[addr];                  // Reads OLD value (before write commits)

The SV non-blocking assignment (<=) means the write does not take effect until the end of the time step. A read in the same cycle sees the old value — this is write-before-read behavior is OFF by default in SV RTL models.

In our SystemC model, write_proc uses a plain C++ array write:

mem[a] = (uint8_t)(data & 0xFF);  // Immediate, not a signal write

This is a direct memory mutation, visible immediately to any code that reads mem[] afterward in the same delta cycle. The write takes effect before read_proc re-evaluates (if both fire in the same delta). For the single-cycle CPU — where read and write to the same address in the same cycle is undefined behavior anyway — this difference does not matter. But be aware of it when modeling more complex memory subsystems.

Byte-Addressable Memory Model — Theory

RISC-V (and virtually all modern architectures) use byte-addressable memory. A word at address 0x100 occupies bytes at 0x100, 0x101, 0x102, 0x103. Little-endian means the least significant byte is at the lowest address:

Address   Byte content (for word value 0x12345678)
0x100     0x78    ← LSB (word[7:0])
0x101     0x56    ← word[15:8]
0x102     0x34    ← word[23:16]
0x103     0x12    ← MSB (word[31:24])

This is important when comparing SystemC (byte-array model) to SystemVerilog (which commonly uses word-addressed memories):

// Word-addressed SV memory — common in RTL testbenches
logic [31:0] mem [0:255];          // 256 words = 1 KB
assign rd_data = mem[addr >> 2];   // Word index = byte addr / 4

// Byte-addressed SV memory — closer to hardware
logic [7:0] mem [0:1023];          // 1024 bytes = 1 KB
assign rd_data = {mem[addr+3], mem[addr+2], mem[addr+1], mem[addr]};
// SV concatenation: leftmost = MSB (mem[addr+3] = byte[31:24])

The byte-addressed model (uint8_t mem[] in SystemC, logic [7:0] mem[] in SV) handles all access widths naturally without any word-granularity conversion. Our dmem uses this model to correctly implement LB, LH, LW, LBU, and LHU with the same underlying array.

Byte-Enable vs. funct3-Based Access

RTL memories — especially those connected to AXI or AHB buses — use byte-enable signals (4 bits for a 32-bit word, one bit per byte). Our dmem uses funct3 directly, but the two approaches are equivalent:

funct3	Instruction	Byte enables (for aligned word access at addr[1:0]=00)
`010`	LW / SW	`BE = 1111` — all 4 bytes
`001`	LH / SH	`BE = 0011` — lower 2 bytes
`000`	LB / SB (at addr[1:0]=00)	`BE = 0001` — byte 0 only
`000`	LB / SB (at addr[1:0]=01)	`BE = 0010` — byte 1 only
`000`	LB / SB (at addr[1:0]=10)	`BE = 0100` — byte 2 only
`000`	LB / SB (at addr[1:0]=11)	`BE = 1000` — byte 3 only

The byte-enable approach is standard on AMBA AHB (HBSTRB[3:0]) and AXI4 (WSTRB[3:0]). Converting between the two:

// Derive byte enables from funct3 + addr[1:0]
uint32_t byte_offset = addr & 3;       // addr[1:0]
uint32_t be = 0;
switch (funct3 & 0x3) {               // mask away the sign/unsigned bit
    case 0: be = 0x1 << byte_offset;  break;  // byte
    case 1: be = 0x3 << byte_offset;  break;  // halfword
    case 2: be = 0xF;                 break;  // word
}

Our dmem avoids this conversion by indexing the byte array directly with the full byte address. Both approaches are correct; the byte-enable approach maps more directly to bus interface signals.

Sign Extension in Hardware vs. C++

Sign extension is required for LB and LH (signed loads). The question is: what is the most correct and efficient way to express it in each language?

Classic Verilog (explicit bit replication):

// Sign-extend an 8-bit value to 32 bits
assign rd_data = {{24{raw_byte[7]}}, raw_byte[7:0]};
// Explicit: replicate bit 7 twenty-four times, then append the byte

SystemVerilog (cast with $signed):

// SV: zero-extend first, then reinterpret as signed
logic [31:0] rd_data;
logic [7:0]  raw_byte;
assign rd_data = $signed({{24{1'b0}}, raw_byte});
// Or more directly:
assign rd_data = signed'(raw_byte);  // SV 2012: static cast to signed

C++ / SystemC (cast chain — most idiomatic):

// Sign-extend a byte (uint8_t → int8_t → int32_t)
uint8_t  raw = mem[a];
int8_t   signed_byte = (int8_t)raw;      // Reinterpret bit pattern as signed
int32_t  extended = (int32_t)signed_byte; // C++ widens signed type by sign extension
uint32_t result = (uint32_t)extended;    // Reinterpret as unsigned for sc_uint

The C++ standard (ISO/IEC 14882, section on integral conversions) guarantees: converting a signed integer to a wider signed integer preserves the value (i.e., sign-extends). This means (int32_t)(int8_t)0xAB = 0xFFFFFFAB on every conforming C++ implementation.

The cast chain (uint32_t)(int32_t)(int8_t)raw is:
1. (int8_t)raw — reinterpret the bit pattern as signed (no change to bits)
2. (int32_t)(int8_t) — widen: C++ sign-extends from bit 7
3. (uint32_t)(int32_t) — reinterpret final result as unsigned (no change to bits)

Zero extension (LBU, LHU) is simpler — no cast needed:

// uint8_t widened to uint32_t: C++ always zero-extends unsigned types
result = (uint32_t)mem[a];   // mem[a] is uint8_t, widened to uint32_t with zeroes

Common Pitfalls for SV Engineers

Pitfall 1: Byte-enable confusion when halfword is at non-zero offset

When storing a halfword at address 0x102 (byte offset 2 within a word), the byte enables are BE = 1100 (bytes 2 and 3), not BE = 0011. Engineers who think "halfword = two bytes starting from position 0" get this wrong. The formula: BE = 0b0011 << addr[1:0]. At addr[1:0] = 10 (binary), BE = 0b1100.

Our dmem avoids this confusion by using full byte addresses directly: mem[a] and mem[a+1] for SH, regardless of offset.

Pitfall 2: LB and LBU are different instructions — a decoder bug silently gives wrong results

LB (funct3=000) sign-extends. LBU (funct3=100) zero-extends. The funct3 encoding bit 2 is the sign/unsigned selector. A decoder that does not output this bit separately, or that maps both to the same code path, will produce correct results for positive bytes (0x00–0x7F) and wrong results for "negative" bytes (0x80–0xFF). This bug is easily missed because most data in test programs is positive.

Pitfall 3: The C++ cast chain (int32_t)(uint8_t) does NOT sign-extend

The common mistake:

// WRONG — zero-extends, not sign-extends
result = (uint32_t)(int32_t)(uint8_t)raw;
//                 ↑ uint8_t → int32_t widens unsigned → zero extension

// CORRECT — sign-extends via int8_t intermediate
result = (uint32_t)(int32_t)(int8_t)raw;
//                 ↑ int8_t → int32_t widens signed → sign extension

The intermediate type must be int8_t (signed), not uint8_t (unsigned). The compiler cannot warn you about this — both compile without error and produce different behavior.

Pitfall 4: Little-endian byte order — assembling in the wrong direction

RISC-V is little-endian. When assembling a 32-bit word from four bytes:

// CORRECT little-endian assembly (LSB at lowest address):
result = (uint32_t)mem[a]
       | ((uint32_t)mem[a+1] << 8)
       | ((uint32_t)mem[a+2] << 16)
       | ((uint32_t)mem[a+3] << 24);
// mem[a] is the LEAST significant byte

// WRONG (big-endian assembly — incorrect for RISC-V):
result = ((uint32_t)mem[a] << 24)    // ← this puts mem[a] in MSB position
       | ((uint32_t)mem[a+1] << 16)
       | ((uint32_t)mem[a+2] << 8)
       | (uint32_t)mem[a+3];

The SV concatenation syntax {mem[a+3], mem[a+2], mem[a+1], mem[a]} is visually confusing because SV concatenation places the first operand at the MSB position. So mem[a+3] goes in bits [31:24] — correct for little-endian where mem[a+3] holds the MSB.

Pitfall 5: Alignment assumptions — document what your model supports

Some implementations (DEC Alpha, original MIPS R2000/R3000) trap on misaligned access in hardware. RISC-V allows but does not require misalignment support in the base ISA. Our dmem handles misaligned access by printing a warning and continuing — it does not abort simulation or model an exception.

This is a modeling choice that must be documented and communicated to testbench writers. If a testbench intentionally tests misaligned access and expects a trap (exception to the control unit), it will see the wrong behavior with our model. A complete implementation would add a misalign_fault output port and connect it to the control unit. For Section 2 purposes, the warning message is sufficient.

What's Next

Post 12 is the Section 2 capstone: connecting all five blocks — register file, decoder, PC, imem, and dmem — into a top-level testbench. We will load a small RV32I program (using the imem hex loader from Post 10), simulate fetch and decode for each instruction, manually drive the ALU results and memory control signals, and verify that the full data-path produces correct outcomes for a sequence of load/store operations.

After the capstone, Section 3 begins with the control unit — the logic that generates mem_read, mem_write, reg_write, branch_taken, alu_src, and the other control signals that currently come from the testbench. When the control unit is in place, the CPU runs real programs without manual testbench intervention.

Key takeaways from this post:

dmem splits into SC_METHOD (combinational load) and SC_CTHREAD (synchronous store) — the same split used in the PC module, and the same split RTL designers use with always_comb + always_ff
Sign extension is a C++ cast: (int32_t)(int8_t)byte is correct, readable, and optimally compiled — no manual bit masking required
Little-endian byte assembly is always: mem[a] | (mem[a+1]<<8) | (mem[a+2]<<16) | (mem[a+3]<<24)
The AMBA AHB HSIZE field is conceptually identical to funct3 in our interface — both encode transfer width as a 3-bit selector
Coverage for memory interfaces is a 3-way cross of access type, alignment, and data pattern — 120 bins, not "did the word load work"
Alignment constraints are not arbitrary: they exist because bus protocols transfer at word granularity and cache lines have minimum access granularity

← Part 10: Program Counter & Fetch Part 11 of 13 Part 12: Single-Cycle CPU Integration →

11. SystemC Tutorial - Data Memory Interface