OpenRisc-66-基于ORPSoC对linux进行RTL仿真

浏览数：59 / 时间：2015年06月20日

引言

前面，我们介绍过对裸机程序进行RTL仿真，那些裸机程序规模比较小，只有几KB大小。

另外，我们也已经实现了针对O_board的SoC进行了RTL仿真（http://blog.csdn.net/rill_zhen/article/details/21190757），本小节，我们将实现在ML501平台上对linux进行RTL仿真。

1，DDR2仿真模型的修改

针对ML501的ORPSoC工程中，默认配置的DDR2的仿真模型与实际板子上使用的DDR2 SDRAM的参数不一致，我们要进行修改。

a，实际内存参数

要想对DDR2 SDRAM的仿真模型进行修改，我们首先要弄明白几个概念。

RANK，BANK，row,，column。这几个都是逻辑上的概念。

此外还有channel，module，chip，device等物理上的概念。

对于ML501使用的DDR2 SDRAM来说，其具体参数如下所示：

通过查看内存条，我们可以看到如下内容：MT4HTF3264HY-667F1 1RX16 256MB PC-5300S，

其中3263是指内存条的organization：32Megx64，x64表示整个内存条的数据线（DQ）宽度是64bit。

667表示内存条的speed grade。PC-5300也是speed grade。

1RX16表示内存条上面的4个device，每个数据宽度是16，16X4正好是64bit。

256MB，毫无疑问，表示内存条的容量是256M bytes。

通过内存条上面的标示，我们就可以获得很多信息，此外，通过查看其数据手册，我们会得到更详细的参数：

RANK：是single rank。

BANK：BA是2bit，说明bank数量是4，每个bank的大小是256MB/4=64MB。

row：宽度是[12:0]，一共13bit。

column：宽度是[9:0]，一共10bit。

b，仿真模型参数

确定了我们实际使用的内存条的参数之后，我们就可以修改仿真模型的具体参数了。

需要注意的是ddr2_model.v只是一个timing model，具体的storage，需要我们自己根据实际情况来定。

这里需要修改的是MEM_BITS，由于ddr2_model.v是一个device的仿真模型，每个device中包含4个四分之一的bank，共64MB，所以对于如下定义：

    // Memory Storage
`ifdef MAX_MEM
    reg     [BL_MAX*DQ_BITS-1:0] memory  [0:`MAX_SIZE-1];
`else//     [8      *  16   -1:0]        [0：(1<<22) -1]==>26bit==>64MB
    reg     [BL_MAX*DQ_BITS-1:0] memory  [0:`MEM_SIZE-1];
    reg     [`MAX_BITS-1:0]      address [0:`MEM_SIZE-1];
    reg     [MEM_BITS:0]         memory_index;
    reg     [MEM_BITS:0]         memory_used;
`endif

我们需要定义MEM_BITS为22，如下所示：

完整的参数，如下所示：





snippet_file_name="blog_20140410_2_9177979"  name="code" class="html">/**************************************************************************************** This software code and all associated documentation, comments or other information (collectively "Software") is provided "AS IS" without warranty of any kind. MICRON TECHNOLOGY, INC. ("MTI") EXPRESSLY DISCLAIMS ALL WARRANTIES EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, NONINFRINGEMENT OF THIRD PARTY RIGHTS, AND ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. MTI DOES NOT WARRANT THAT THE SOFTWARE WILL MEET YOUR REQUIREMENTS, OR THAT THE OPERATION OF THE SOFTWARE WILL BE UNINTERRUPTED OR ERROR-FREE. FURTHERMORE, MTI DOES NOT MAKE ANY REPRESENTATIONS REGARDING THE USE OR THE RESULTS OF THE USE OF THE SOFTWARE IN TERMS OF ITS CORRECTNESS, ACCURACY, RELIABILITY, OR OTHERWISE. THE ENTIRE RISK ARISING OUT OF USE OR PERFORMANCE OF THE SOFTWARE REMAINS WITH YOU. IN NO EVENT SHALL MTI, ITS AFFILIATED COMPANIES OR THEIR SUPPLIERS BE LIABLE FOR ANY DIRECT, INDIRECT, CONSEQUENTIAL, INCIDENTAL, OR SPECIAL DAMAGES (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION, OR LOSS OF INFORMATION) ARISING OUT OF YOUR USE OF OR INABILITY TO USE THE SOFTWARE, EVEN IF MTI HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Because some jurisdictions prohibit the exclusion or limitation of liability for consequential or incidental damages, the above limitation may not apply to you. Copyright 2003 Micron Technology, Inc. All rights reserved. ********************************************************************/ current with 512Mb datasheet rev N parameters based on Speed Grade // SYMBOL UNITS DESCRIPTION =    3750; // tCK    ps    Minimum Clock Cycle Time =     125; // tJIT(per)  ps Period JItter =     125; // tJIT(duty) ps Half Period Jitter =     250; // tJIT(cc)   ps Cycle to Cycle jitter =     175; // tERR(nper) ps Accumulated Error (2-cycle) =     225; // tERR(nper) ps Accumulated Error (3-cycle) =     250; // tERR(nper) ps Accumulated Error (4-cycle) =     250; // tERR(nper) ps Accumulated Error (5-cycle) =     350; // tERR(nper) ps Accumulated Error (6-10-cycle) =     450; // tERR(nper) ps Accumulated Error (11-50-cycle) =     400; // tQHS   ps    Data hold skew factor =     500; // tAC    ps    DQ output access time from CK/CK# =     100; // tDS    ps    DQ and DM input setup time relative to DQS =     225; // tDH    ps    DQ and DM input hold time relative to DQS =     450; // tDQSCK ps    DQS output access time from CK/CK# =     300; // tDQSQ  ps    DQS-DQ skew, DQS to last DQ valid, per group, per access =     250; // tIS    ps    Input Setup Time =     375; // tIH    ps    Input Hold Time =   55000; // tRC    ps    Active to Active/Auto Refresh command time =   15000; // tRCD   ps    Active to Read/Write command time =    7500; // tWTR   ps    Write to Read command delay =   15000; // tRP    ps    Precharge command period =   15000; // tRPA   ps    Precharge All period =       6; // tXARDS tCK   Exit low power active power down to a read command =       2; // tXARD  tCK   Exit active power down to a read command =       2; // tXP    tCK   Exit power down to a non-read command =       3; // tANPD  tCK   ODT to power-down entry latency =       8; // tAXPD  tCK   ODT power-down exit latency =   15000; // CL     ps    Minimum CAS Latency // ------ ----- ----------- =   50000; // tFAW  ps     Four Bank Activate window =       0; // AL     tCK   Minimum Additive Latency =       6; // AL     tCK   Maximum Additive Latency =       3; // CL     tCK   Minimum CAS Latency =       7; // CL     tCK   Maximum CAS Latency =       2; // WR     tCK   Minimum Write Recovery =       8; // WR     tCK   Maximum Write Recovery =       4; // BL     tCK   Minimum Burst Length =       8; // BL     tCK   Minimum Burst Length =    8000; // tCK    ps    Maximum Clock Cycle Time =    0.48; // tCH    tCK   Minimum Clock High-Level Pulse Width =    0.52; // tCH    tCK   Maximum Clock High-Level Pulse Width =    0.48; // tCL    tCK   Minimum Clock Low-Level Pulse Width =    0.52; // tCL    tCK   Maximum Clock Low-Level Pulse Width =     TAC; // tLZ    ps    Data-out low-impedance window from CK/CK# =     TAC; // tHZ    ps    Data-out high impedance window from CK/CK# =    0.35; // tDIPW  tCK   DQ and DM input Pulse Width =    0.35; // tDQSH  tCK   DQS input High Pulse Width =    0.35; // tDQSL  tCK   DQS input Low Pulse Width =    0.20; // tDSS   tCK   DQS falling edge to CLK rising (setup time) =    0.20; // tDSH   tCK   DQS falling edge from CLK rising (hold time) =    0.35; // tWPRE  tCK   DQS Write Preamble =    0.40; // tWPST  tCK   DQS Write Postamble =    0.25; // tDQSS  tCK   Rising clock edge to DQS/DQS# latching transition =     0.6; // tIPW   tCK   Control and Address input Pulse Width =       2; // tCCD   tCK   Cas to Cas command delay =   40000; // tRAS   ps    Minimum Active to Precharge command time =70000000; // tRAS   ps    Maximum Active to Precharge command time =    7500; // tRTP   ps    Read to Precharge command delay =   15000; // tWR    ps    Write recovery time =       2; // tMRD   tCK   Load Mode Register command cycle time =     200; // tDLLK  tCK   DLL locking time =  105000; // tRFC   ps    Refresh to Refresh Command interval minimum value =70000000; // tRFC   ps    Refresh to Refresh Command Interval maximum value = TRFC_MIN + 10000; // tXSNR  ps    Exit self refesh to a non-read command =     200; // tXSRD  tCK   Exit self refresh to a read command =     TIS; // tISXR  ps    CKE setup time during self refresh exit. =       2; // tAOND  tCK   ODT turn-on delay =     2.5; // tAOFD  tCK   ODT turn-off delay =    2000; // tAONPD ps    ODT turn-on (precharge power-down mode) =    2000; // tAOFPD ps    ODT turn-off (precharge power-down mode) =   12000; // tMOD   ps    ODT enable in EMR to ODT pin transition =       3; // tCKE   tCK   CKE minimum high or low pulse width Parameters based on Part Width =      13; // Address Bits =      13; // Number of Address bits =      10; // Number of Column bits =       2; // Number of Data Mask bits =      16; // Number of Data bits =       2; // Number of Dqs bits =   10000; // tRRD   Active bank a to Active bank b command time DUAL_RANK // also define DUAL_RANK =       4; // Number of Chip Select Bits =       4; // Number of Chip Select Bits =       2; // Number of Chip Select Bits =       2; // Number of Chip Select Bits =       2; // Number of Chip Select Bits =       1; // Number of Chip Select Bits =       2; // Set this parmaeter to control how many Bank Address bits 14, a DQ=16 each part, DQ=64 total (4 parts) => 1MB total (256KB each) 15, a DQ=16 each part, DQ=64 total (4 parts) => 2MB total (512KB each) 16, a DQ=16 each part, DQ=64 total (4 parts) => 4MB total (1MB each) 17, a DQ=16 each part, DQ=64 total (4 parts) => 8MB total (2MB each) =      14; // Number of write data bursts can be stored in memory.  The default is 2^10=1024. =      22; // Number of write data bursts can be stored in memory.  //256MB total(64MB each),Rill modify from 17 to 22 140410 =      10; // the address bit that controls auto-precharge and precharge-all =       3; // the number of bits required to count to MAX_BL =       2; // the number of Burst Order Bits =       1; // If set to 1, the model will halt on command sequence/major errors =       0; // Turn on Debug messages =       0; // delay in nanoseconds 0; // If set to 1, the model will put a random amount of delay on DQ/DQS during reads = 711689044; //seed value for random generator. =       2; // DQS driving time prior to first read strobe =       1; // DQS driving time after last read strobe =       2; // DQS low time prior to first read strobe =       1; // DQS low time after last valid read strobe =       0; // DQ/DM driving time prior to first read data =       0; // DQ/DM driving time after last read data =       1; // DQS half clock periods prior to first write strobe =       1; // DQS half clock periods after last valid write strobe


c，preload的修改
目前，我们已经建立的和实际硬件一致的仿真模型，但是我们在仿真前，要把linux的镜像实现load到仿真模型中才行，这就需要了解DDR2 SDRAM的内部组织结构，了解BL_MAX，BL_BITS，DQ_BITS等参数的具体含义，了解DDR2 SDRAM的读写过程和时序。这些内容请参考《memory system - cache dram disk》一书。这里不再赘述。
对于仿真linux而言，由于编译时指定的内存大小是32MB，所以，我在preload时也只load32MB，一个bank是64MB，所以我们只需要load bank0即可，但是bank0是分布在4个device里的。
下面是修改后的orpsoc_testbench.v的部分代码：

`ifdef XILINX_DDR2
 `ifndef GATE_SIM
   defparam dut.xilinx_ddr2_0.xilinx_ddr2_if0.ddr2_mig0.SIM_ONLY = 1;
 `endif

   always @( * ) begin
      ddr2_ck_sdram        <=  #(TPROP_PCB_CTRL) ddr2_ck_fpga;
      ddr2_ck_n_sdram      <=  #(TPROP_PCB_CTRL) ddr2_ck_n_fpga;
      ddr2_a_sdram    <=  #(TPROP_PCB_CTRL) ddr2_a_fpga;
      ddr2_ba_sdram         <=  #(TPROP_PCB_CTRL) ddr2_ba_fpga;
      ddr2_ras_n_sdram      <=  #(TPROP_PCB_CTRL) ddr2_ras_n_fpga;
      ddr2_cas_n_sdram      <=  #(TPROP_PCB_CTRL) ddr2_cas_n_fpga;
      ddr2_we_n_sdram       <=  #(TPROP_PCB_CTRL) ddr2_we_n_fpga;
      ddr2_cs_n_sdram       <=  #(TPROP_PCB_CTRL) ddr2_cs_n_fpga;
      ddr2_cke_sdram        <=  #(TPROP_PCB_CTRL) ddr2_cke_fpga;
      ddr2_odt_sdram        <=  #(TPROP_PCB_CTRL) ddr2_odt_fpga;
      ddr2_dm_sdram_tmp     <=  #(TPROP_PCB_DATA) ddr2_dm_fpga;//DM signal generation
   end // always @ ( * )
   
   // Model delays on bi-directional BUS
   genvar dqwd;
   generate
      for (dqwd = 0;dqwd < DQ_WIDTH;dqwd = dqwd+1) begin : dq_delay
	 wiredelay #
	   (
            .Delay_g     (TPROP_PCB_DATA),
            .Delay_rd    (TPROP_PCB_DATA_RD)
	    )
	 u_delay_dq
	   (
            .A           (ddr2_dq_fpga[dqwd]),
            .B           (ddr2_dq_sdram[dqwd]),
            .reset       (rst_n)
	    );
      end
   endgenerate
   
   genvar dqswd;
   generate
      for (dqswd = 0;dqswd < DQS_WIDTH;dqswd = dqswd+1) begin : dqs_delay
	 wiredelay #
	   (
            .Delay_g     (TPROP_DQS),
            .Delay_rd    (TPROP_DQS_RD)
	    )
	 u_delay_dqs
	   (
            .A           (ddr2_dqs_fpga[dqswd]),
            .B           (ddr2_dqs_sdram[dqswd]),
            .reset       (rst_n)
	    );
	 
	 wiredelay #
	   (
            .Delay_g     (TPROP_DQS),
            .Delay_rd    (TPROP_DQS_RD)
	    )
	 u_delay_dqs_n
	   (
            .A           (ddr2_dqs_n_fpga[dqswd]),
            .B           (ddr2_dqs_n_sdram[dqswd]),
            .reset       (rst_n)
	    );
      end
   endgenerate
   
   assign ddr2_dm_sdram = ddr2_dm_sdram_tmp;
   //parameter NUM_PROGRAM_WORDS=1048576; 
parameter NUM_PROGRAM_WORDS=8388608;   //Rill modify from 1048576
   integer ram_ptr, program_word_ptr, k;
   reg [31:0] tmp_program_word;
   reg [31:0] program_array [0:NUM_PROGRAM_WORDS-1]; // 1M words = 4MB//8M words = 32MB
   reg [8*16-1:0] ddr2_ram_mem_line; //8*16-bits= 8 shorts (half-words)
   genvar 	  i, j;
   generate
      // if the data width is multiple of 16
      for(j = 0; j < CS_NUM; j = j+1) begin : gen_cs // Loop of 1
         for(i = 0; i < DQS_WIDTH/2; i = i+1) begin : gen // Loop of 4 (DQS_WIDTH=8)
	    initial
	      begin

 `ifdef PRELOAD_RAM
  `include "ddr2_model_preload.v"
 `endif
	      end
	    
	    ddr2_model u_mem0
	      (
	       .ck        (ddr2_ck_sdram[CLK_WIDTH*i/DQS_WIDTH]),
	       .ck_n      (ddr2_ck_n_sdram[CLK_WIDTH*i/DQS_WIDTH]),
	       .cke       (ddr2_cke_sdram[j]),
	       .cs_n      (ddr2_cs_n_sdram[CS_WIDTH*i/DQS_WIDTH]),
	       .ras_n     (ddr2_ras_n_sdram),
	       .cas_n     (ddr2_cas_n_sdram),
	       .we_n      (ddr2_we_n_sdram),
	       .dm_rdqs   (ddr2_dm_sdram[(2*(i+1))-1 : i*2]),
	       .ba        (ddr2_ba_sdram),
	       .addr      (ddr2_a_sdram),
	       .dq        (ddr2_dq_sdram[(16*(i+1))-1 : i*16]),
	       .dqs       (ddr2_dqs_sdram[(2*(i+1))-1 : i*2]),
	       .dqs_n     (ddr2_dqs_n_sdram[(2*(i+1))-1 : i*2]),
	       .rdqs_n    (),
	       .odt       (ddr2_odt_sdram[ODT_WIDTH*i/DQS_WIDTH])
	       );
         end
      end
   endgenerate
   
`endif

下面是ddr2_model_preload.v的修改后的代码：

// File intended to be included in the generate statement for each DDR2 part.
// The following loads a vmem file, "sram.vmem" by default, into the SDRAM.

// Wait until the DDR memory is initialised, and then magically
// load it
$display("%t: wait phy_init_done",$time);
@(posedge dut.xilinx_ddr2_0.xilinx_ddr2_if0.phy_init_done);
$display("%t: Loading DDR2",$time);

$readmemh("sram.vmem", program_array);
/* Now dish it out to the DDR2 model‘s memory */
for(ram_ptr = 0 ; ram_ptr < 64*1024/*4096*/ ; ram_ptr = ram_ptr + 1)
  begin

     // Construct the burst line, with every second word from where we
     // started, and picking the correct half of the word with i%2
     program_word_ptr = ram_ptr * 16 + (i/2) ; // Start on word0 or word1

     tmp_program_word = program_array[program_word_ptr];
     ddr2_ram_mem_line[15:0] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)];

     program_word_ptr = program_word_ptr + 2;
     tmp_program_word = program_array[program_word_ptr]; 
     ddr2_ram_mem_line[31:16] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)];
     
     program_word_ptr = program_word_ptr + 2;
     tmp_program_word = program_array[program_word_ptr];
     ddr2_ram_mem_line[47:32] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)];
     
     program_word_ptr = program_word_ptr + 2;
     tmp_program_word = program_array[program_word_ptr];
     ddr2_ram_mem_line[63:48] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)];
     
     program_word_ptr = program_word_ptr + 2;
     tmp_program_word = program_array[program_word_ptr];
     ddr2_ram_mem_line[79:64] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)];
     
     program_word_ptr = program_word_ptr + 2;
     tmp_program_word = program_array[program_word_ptr];
     ddr2_ram_mem_line[95:80] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)];
     
     program_word_ptr = program_word_ptr + 2;
     tmp_program_word = program_array[program_word_ptr];
     ddr2_ram_mem_line[111:96] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)];
     
     program_word_ptr = program_word_ptr + 2;
     tmp_program_word = program_array[program_word_ptr];
     ddr2_ram_mem_line[127:112] = tmp_program_word[15 + ((i%2)*16):((i%2)*16)];
     
     // Put this assembled line into the RAM using its memory writing TASK
     //                 (bank ,row          , { col               }, data
     u_mem0.memory_write(2‘b00,ram_ptr[19:7], {ram_ptr[6:0],3‘b000},ddr2_ram_mem_line);
     
     //$display("Writing 0x%h, ramline=%d",ddr2_ram_mem_line, ram_ptr);
     
  end // for (ram_ptr = 0 ; ram_ptr < ...
$display("(%t) * DDR2 RAM %1d preloaded",$time, i);

这里有两点需要注意：
首先，program_array[]是连续线性的，但是4个device的组织不是连续线性的，所以在调用memory_write()之前一定要变成DDR2 SDRAM实际的组织形式。
此外，由于我们只preload了32MB，小于一个bank，所以bank的地址我们一直是2‘b00，如果以后需要仿真的程序规模超过一个bank的大小了，那么就需要修改bank地址了。


2，验证
修改orpsocv2/sw/makefile.inc中，是指使用现成的elf文件，生成vmem文件。具体修改方法，前面已经介绍过了，这里不再赘述。
执行：make rtl-test TEST=linux PRELOAD_RAM=1
即可得到linux的仿真结果，和实际下板的结果相同。
毫无疑问，由于linux程序规模很大，如果要等到linux启动完成，需要等待很久。
下面是部分输出：


3，小结
之前搞嵌入式，linux的启动信息很熟悉，但是如果想知道linux启动过程中，几乎是不可能的，现在板子上所有设备的每个clock的状态，通过RTL仿真，即可实现。
enjoy！
OpenRisc-66-基于ORPSoC对linux进行RTL仿真,古老的榕树,5-wow.com