# ichip2 Daily Technical Report

Date: 2026-05-29

## Scope
- MiniCPM5-1B hardware accelerator research and RTL evaluation.
- Focus: model structure, operators, datapath, DMA/memory, verification, synthesis, and design decisions.

## Sources
- [SCNet MiniCPM5-1B](https://www.scnet.cn/ui/aihub/models?order=updateTime&keyword=MiniCPM5-1B)
- [OpenBMB MiniCPM README](https://github.com/OpenBMB/MiniCPM/blob/main/README.md)
- [openbmb/MiniCPM5-1B model card](https://huggingface.co/openbmb/MiniCPM5-1B)
- Local model metadata: `downloads/MiniCPM5-1B_config.json`
- Local synthesis logs: `reports/synth_blocks/*.log`
- Local verification logs: `reports/verification_status.md`

## Model Structure
- MiniCPM5-1B uses `LlamaForCausalLM`, hidden size 1536, 24 layers, 16 Q heads, 2 KV heads, head dim 128.
- MLP intermediate size 4608 and untied `lm_head` make weight streaming important.
- Context length 131072 makes KV paging a first-class design constraint.

## Architecture Principle
- Use a multi-cluster 16x16 MAC tile as the basic compute primitive.
- Add a vector unit for RMSNorm, RoPE, SiLU, residual add, and elementwise math.
- Keep control, DMA, and SRAM simple enough to synthesize independently.

### Plain-Language Explanation
A decoder-only LLM repeatedly turns one hidden vector into the next token. Most of the work is multiplying that vector by many large weight matrices. The accelerator therefore needs a fast repeated multiply-accumulate engine, enough local memory to feed it continuously, and a DMA system that keeps weight and KV-cache pages moving before the compute unit runs dry.

Grouped-query attention reduces KV cache size because 16 query heads share only 2 K/V head groups. It helps, but the 131072-token context still makes KV bandwidth a dominant problem. The hardware plan therefore treats KV paging and INT8/INT4 compression as architecture requirements, not optional optimizations.

## Results Summary
- `tb_mac_array PASS`
- `tb_compute_cluster PASS`
- `tb_qk_score_unit PASS`
- `tb_kv_addr_gen PASS`
- `tb_kv_page_walker PASS`
- `tb_axi_read_adapter PASS`
- `tb_top_smoke PASS`
- `golden_checks.py PASS` for INT8 GEMV tile, QK score, mask, and KV address arithmetic.
- Yosys synthesis passed for top-level and block-level RTL.
- `ichip2_compute_cluster` now connects MAC, vector, and SRAM primitives into a real datapath layer.
- `ichip2_qk_score_unit`, `ichip2_kv_addr_gen`, and `ichip2_kv_page_walker` provide the first attention/KV baseline blocks.
- `ichip2_axi_read_adapter` now bridges burst descriptors into AXI4 read requests with backpressure and response tracking.

## Test Data

```text
# Verification Status

Date: 2026-05-28

## Completed

- Project skeleton created under `E:\HermesWorkspace\ichip2`.
- MiniCPM5-1B metadata copied from local reference download.
- Model parameter/resource estimate script executed successfully.
- ASCII scan passed for authored source, docs, scripts, and reports outside
  copied model metadata files.
- OSS CAD Suite installed on E drive:
  `E:\HermesWorkspace\tools\oss-cad-suite\oss-cad-suite`.
- Icarus Verilog simulation passed:
  - `tb_mac_array PASS`
  - `tb_top_smoke PASS`
- Yosys synthesis passed for the top-level control/DMA shell.
- Yosys block-level synthesis passed for MAC, vector, special-function helper,
  DMA, control, and top modules.
- `tb_compute_cluster PASS`.
- `tb_qk_score_unit PASS`.
- `tb_kv_addr_gen PASS`.
- `tb_kv_page_walker PASS`.
- `tb_axi_read_adapter PASS`.
- `golden_checks.py PASS` for signed INT8 GEMV tile, QK score, mask score,
  and KV cache address arithmetic.
- Top-level synthesis now includes `ichip2_compute_cluster` and its MAC/vector
  submodules.
- Added synthesizable attention/KV baseline blocks:
  - `rtl/attention/qk_score_unit.v`: 16-lane signed INT8 QK dot-product tile,
    raw accumulation, Q15 output scaling, and causal mask minimum score.
  - `rtl/memory/kv_addr_gen.v`: deterministic KV cache byte address generator
    for token/layer/K/V/head/element indexing.
  - `rtl/memory/kv_page_walker.v`: page descriptor walker that emits 128-byte
    ready/valid bursts for one physical K or V head.
  - `rtl/memory/axi_read_adapter.v`: burst-to-AXI read adapter with outstanding
    tracking, response-error capture, and done gating.
- Architecture evaluation report generated:
  `reports/architecture_eval.md`.
- Block synthesis summary generated:
  `reports/synth_blocks_summary.md`.

Resource estimate script output:

```text
per_layer_linear_params=28,311,552
layer_linear_params=679,477,248
embedding_params=200,540,160
lm_head_params=200,540,160
norm_params=75,264
major_params=1,080,632,832
bf16_weight_bytes=2,161,265,664
int8_weight_bytes=1,080,632,832
int4_weight_bytes=540,316,416
kv_elements_per_token=12,288
kv_cache_int8_bytes_full_context=1,610,612,736
kv_cache_bf16_bytes_full_context=3,221,225,472
baseline_mac_array=256 MAC/cycle
baseline_peak_at_500MHz=128.000 GMAC/s
```

## Tooling

Use the project setup script before direct EDA commands:

```powershell
powershell -ExecutionPolicy Bypass -Command ". .\scripts\setup_eda.ps1; yosys -V"
```

Project scripts:

- `scripts/setup_eda.ps1`
- `scripts/run_sim.ps1`
- `scripts/run_synth.ps1`
- `scripts/run_synth_blocks.ps1`
- `scripts/synth_yosys.ys`

## Next Verification Steps

1. Connect `kv_page_walker` to `axi_read_adapter` in a KV read DMA wrapper.
2. Extend `golden_checks.py` into randomized RTL-vs-golden comparisons.
3. Add online softmax RTL plan and fixed-point error budget.
4. Add AXI4 DMA wrappers and bus functional model tests.
```

## Key Numbers
- Major parameters: 1,080,632,832
- BF16 weight bytes: 2,161,265,664
- INT4 weight bytes: 540,316,416
- Full-context INT8 KV cache: 1,610,612,736 bytes
- Baseline MAC array: 256 MAC/cycle

## Synthesis Data

# Synth Blocks Summary

Generated from `reports/synth_blocks/*.json`.

| Top | Total cells | Key cells |
|---|---:|---|
| `ichip2_axi_read_adapter` | 77 | `$add`=4, `$sub`=3, `$adff`=2, `$adffe`=9 |
| `ichip2_compute_cluster` | 46 | `$adff`=4, `$adffe`=10 |
| `ichip2_control_regs` | 31 | `$adff`=1, `$adffe`=8 |
| `ichip2_dma_engine` | 36 | `$add`=2, `$sub`=1, `$adff`=1, `$adffe`=5 |
| `ichip2_kv_addr_gen` | 7 | `$mul`=1, `$add`=5 |
| `ichip2_kv_page_walker` | 49 | `$add`=1, `$sub`=1, `$adff`=1, `$adffe`=6 |
| `ichip2_mac_array` | 517 | `$mul`=256, `$add`=256, `$adff`=1, `$adffe`=1 |
| `ichip2_qk_score_unit` | 42 | `$mul`=17, `$add`=16, `$adff`=1, `$adffe`=2 |
| `ichip2_rmsnorm_unit` | 4 | `$mul`=2, `$adff`=1, `$adffe`=1 |
| `ichip2_rope_unit` | 9 | `$mul`=4, `$add`=1, `$sub`=1, `$adff`=1, `$adffe`=2 |
| `ichip2_silu_unit` | 3 | `$adff`=1, `$adffe`=1 |
| `ichip2_vector_unit` | 84 | `$mul`=16, `$add`=16, `$adff`=1, `$adffe`=16 |
| `minicpm5_accel_top` | 27 | `$add`=1, `$adff`=1, `$adffe`=1 |

## Important Verification Waveform

The cluster smoke waveform checks command launch, busy assertion, compute latency, and done pulse. The QK score waveform checks raw score launch, valid output, and masked minimum score behavior.

![cluster waveform](../../site/assets/waveforms/cluster_smoke.svg)

![qk waveform](../../site/assets/waveforms/qk_score_unit.svg)

## Decisions
- INT4 weights are the target path; INT8 remains the bring-up path.
- Multi-cluster scaling is preferable to a monolithic wide array.
- Full logits should be streamed to top-k instead of stored.
- The first cluster uses shadow registers to feed MAC/vector while SRAM mirrors the tile write; the next iteration should split this into real operand SRAM read timing and a tiled scheduler.
- QK accumulation must stay raw internally; scale and mask belong at the score output, not inside the accumulation state.
- KV page walking is now a separate ready/valid stage so the later AXI DMA wrapper can focus only on burst protocol.
- AXI read protocol is now isolated in a dedicated adapter so burst policy can be verified without mixing bus timing into KV address generation.

## Decision Reasoning
- Compute scaling alone looks attractive because 8 clusters at 500 MHz gives a linear lower bound near 0.86 ms/token, but that bound ignores DDR stalls and attention over context.
- Full-context attention at 131072 tokens costs about 12.88B attention MACs/token, much larger than the 880M linear MACs/token. This makes attention/KV layout the next high-value work item.
- The top-level was first only a control shell. Adding `compute_cluster` makes synthesis statistics closer to the real accelerator and exposes timing/area pressure from actual multipliers and accumulators.
- A standalone KV address generator is the right first step before page tables because it proves the cache byte layout independently of burst policy.
- A dedicated AXI read adapter is the right next step after page walking because it lets bus pressure and response handling be tested without coupling them to KV arithmetic.

## Open Questions
- Exact AXI4 DMA shape and burst policy beyond the current read adapter.
- KV cache layout and compression policy.
- Fixed-point error budget for softmax/RMSNorm/RoPE.
- Whether the product target should guarantee full 131072 context or expose shorter edge-device profiles.

## References
- [Condensed web report](../../site/reports/daily/2026-05-29.html)