Intro

We have a development environment going now, so it is time to execute on our first piece of the puzzle - installing the trap handler.

~~compile a trap handler~~ ✔️
upload the trap handler shader & install it and ask the wave to be launched with ttmps reserved
invoke one of the methods of triggering the trap handler for a specific wave
use a piece of memory to communicate to the host (👋 hey host! i am in the trap handler now)
- if we want to breakpoint on an instruction instead of just the wave, we can now enable single stepping
park the wave somehow until the host tells us to continue
repeat entering the trap handler and waiting for the host as required

Uploading and installing the trap handler

To check if we managed to successfully install a trap handler, we can use umr. We dump the waves executing on the GPU and check STATUS.TRAP_EN. If that flag is 1, then the trap handler is ready to go. Below is a part of the output we can get from umr by running sudo umr -O bits,halt_waves -go 0 -wa gfx_0.0.0: all wave state is displayed for each active wave. Furthermore, we get a disassembly of the code that the wave is currently executing, where * marks the current instruction to be executed.

------------------------------------------------------
se0.sa0.wgp0.simd0.wave0


Main Registers:
  pc_hi: 00008000 | pc_lo:  00004e30 |  wave_inst_dw0:  f8001c0f |  exec_hi:  ffffffff | 
exec_lo: ffffffff |    m0:  00000000 |  ib_dbg1:        01000000 | 


Wave_Status[0801a100]:
                scc:  0 |        execz:  0 |        vccz:  0 |          in_tg:  0 | 
               halt:  1 |        valid:  1 |    spi_prio:  0 |      wave_prio:  0 | 
            trap_en:  0 |    ttrace_en:  0 |  export_rdy:  1 |     in_barrier:  0 | 
               trap:  0 |      ecc_err:  0 | skip_export:  0 |        perf_en:  0 | 
      cond_dbg_user:  0 | cond_dbg_sys:  0 |    data_atc:  0 |       inst_atc:  0 | 
dispatch_cache_ctrl:  0 |  must_export:  1 |  fatal_halt:  0 | ttrace_simd_en:  1 | 


HW_ID1[00000000]:
wave_id:  0 | simd_id:  0 | wgp_id:  0 | se_id:  0 | sa_id:  0 | 


HW_ID2[07001000]:
queue_id:  0 | pipe_id:  0 | me_id:  0 | state_id:  1 | wg_id:  0 | vm_id:  7 | 


GPR_ALLOC[0f000100]:
vgpr_base:  0 | vgpr_size:  1 | sgpr_base:  0 | sgpr_size:  15 | 

SGPRS:
[... omitted for posterity ...]


VGPRS:
[... omitted for posterity ...]

PGM_MEM:
 (found shader at: 7@0x800000004e00 of 60 bytes)
    pgm[7@0x800000004e00 + 0x10  ] = 0xc8000100		v_interp_p1_f32_e32 v0, v0, attr0.y
    pgm[7@0x800000004e00 + 0x14  ] = 0xc8010101		v_interp_p2_f32_e32 v0, v1, attr0.y
    pgm[7@0x800000004e00 + 0x18  ] = 0x100000f0		v_mul_f32_e32 v0, 0.5, v0          
    pgm[7@0x800000004e00 + 0x1c  ] = 0x060204ff		v_add_f32_e32 v1, 0x7f7ffffd, v2   
    pgm[7@0x800000004e00 + 0x20  ] = 0x7f7ffffd	;;                                     
    pgm[7@0x800000004e00 + 0x24  ] = 0x5e000103		v_cvt_pkrtz_f16_f32_e32 v0, v3, v0 
    pgm[7@0x800000004e00 + 0x28  ] = 0xd52f0001		v_cvt_pkrtz_f16_f32_e64 v1, v1, 1.0
    pgm[7@0x800000004e00 + 0x2c  ] = 0x0001e501	;;                                     
 *  pgm[7@0x800000004e00 + 0x30  ] = 0xf8001c0f		exp mrt0 v0, v0, v1, v1 done compr vm
    pgm[7@0x800000004e00 + 0x34  ] = 0x80800100	;;                                       
    pgm[7@0x800000004e00 + 0x38  ] = 0xbf810000		s_endpgm                             
    pgm[7@0x800000004e00 + 0x3c  ] = 0xbf9f0000		s_code_end                           
    pgm[7@0x800000004e00 + 0x40  ] = 0xbf9f0000		s_code_end                           
    pgm[7@0x800000004e00 + 0x44  ] = 0xbf9f0000		s_code_end                           
    pgm[7@0x800000004e00 + 0x48  ] = 0xbf9f0000		s_code_end                           
    pgm[7@0x800000004e00 + 0x4c  ] = 0xbf9f0000		s_code_end                           
End of disassembly.



LDS_ALLOC[00002018]:
lds_base:  24 | lds_size:  2 | vgpr_shared_size:  0 | 


IB_STS[00000000]:
vm_cnt:  0 | exp_cnt:  0 | lgkm_cnt:   0 | valu_cnt:   0 | vs_cnt:  0 | 


IB_STS2[00000a03]:
inst_prefetch:  3 | resource_override:  0 | mem_order:  2 | fwd_progress:  0 | wave64:  1 | 


TRAPSTS[40000000]:
          excp:  0 | illegal_inst:  0 |      buffer_oob:  0 | excp_cycle:  0 | 
excp_wave64hi:   0 |      dp_rate:  2 | excp_group_mask:  0 | utc_error:   0 | 


MODE[001ff1c0]:
    fp_round:  0 | fp_denorm:  12 | dx10_clamp:  1 |      ieee:  0 | 
 lod_clamped:  0 |   debug_en:  0 |  excp_en:  511 | fp16_ovfl:  0 | 
disable_perf:  0 | 

... more waves follow ...

To install the trap handler, we now need to enlist the help of the driver, since we want to alter the state of the debugee. So let’s search for the most appropriate place to do this in radv…

… and its already there!

A surprise, to be sure, but a welcome one. This MR seems to do what we want: uses aco to compile a trap handler shader, uploads it, installs it and sets all the appropriate state for the hardware for it to be used. Excellent, we only need to set RADV_TRAP_HANDLER as an env variable to have trap handlers in radv.

Except the relevant code for trap handlers in radv is gated behind asserts requiring gfx8 or lower and I want to run this on gfx10.3. Let’s see what happens if we just remove the asserts and let it run anyway - welcome baby’s first mesa MR for fixing an uninitialized struct on the trap handler path.

I suppose that is a testament to the number of users for this so far. But no matter, with that fixed we try to enable the trap handler - and nothing happens. This is too anticlimactic, so let’s do some more research. It turns out Samuel Pitoiset has done more work on this: a second MR for gfx8+. And this MR comes with a related stalled kernel patch as well.

It appears that the requirement for installing the trap handler is setting the TBA¹ and TMA² registers. On gfx8 and below, setting these are nonprivileged operations and can be done via packets (what the first mesa MR enables). However from gfx9 and above, setting TBA and TMA is a privileged operation and only the kernel can do it. And currently it doesn’t.

I wavered a bit at this point. Could it be that this functionality doesn’t work on gfx10? Even if it does, is my 10 day Linux driver stack experience enough to patch the kernel to enable this? Fortunately, I realized that the kernel can be sidestepped here - enter umr.

Cutting out the middlemen

So far we have used umr to look at the state of the GPU. umr stands for “User Mode Register Debugger”, so how can it interface with the GPU if it is user mode? In fact, how can any program talk to the GPU?

Generally, applications talk to a user-mode driver (UMD), like radv, via an API, such as Vulkan, OpenCL, you name it. The UMD does some work transforming this incoming data into a format that the kernel mode driver (KMD) understands and talks to the KMD via ioctls. An ioctl is comparable to a syscall: an entrypoint into the kernel. You prepare a command and some data, invoke the ioctl, the kernel driver takes over and performs what you requested.

Then the KMD needs to talk to the GPU. The easiest way to do this is by toggling certain GPU registers, and perform writes to the VRAM. But this does not scale well, you’d need the KMD to continuously keep up with commands executed. Instead the KMD writes commands for what the GPU should do, which the GPU can execute on its own. The unit consuming these instruction is called the Command Processor (CP). The GPU has multiple independent command processors, which map to queues in Vulkan or rings³ in amdgpu.

In fact when we previously dumped the waves with umr, we specified the graphics ring (gfx_0.0.0).

But it is still possible to write registers from the host (done via memory mapped I/O (MMIO) and others), which is usually done by the KMD for maintenance tasks, such as starting or resetting the GPU, and other housekeeping.

Fortunately, amdgpu bestows the userspace with powers of the kernel. It is possible (with elevated privileges) to access KMD and GPU functions - this is what umr is doing. The KMD exposes a debugfs interface (essentially file descriptors located in /sys/kernel/debug/dri/<device_id>/) that will call into the KMD when read or written.

What this means for us is that if the trap handler installation requires setting some registers, we can probably do it ourselves instead of having to get a new kernel!

Banked registers

An important aspect of setting registers is banking. The GPU can have many of a certain type of register, because the GPU has many instances of the same hardware block. To access the register of a certain hardware block, we need to first select hardware block (or bank). There are two banking mechanisms important for us: GRBM, which selects by SE, SA and INSTANCE (which is a combination of WGP and SIMD indices), and SRBM which selects based on ME, PIPE, QUEUE and VMID⁴. When we perform a banked register read or write, umr tells the KMD to first set the GRBM or SRBM, thereby activating the bank we want, then performs the access on the register instance. For writes it is also possible to have them broadcast into multiple banks. The reason we need a bit of dance is because the GRBM and SRBM selector registers are global - another process (or the kernel itself) could change them simultaneously, and then our write will go to the wrong bank. For this reason we need the KMD arbitrating (with a lock) to facilitate race-free MMIO.

Incantations of trap enablement

We have code in radv that sets up a shader and a buffer, so we don’t need to write that code ourselves for now. For simplicity, we can just write these virtual addresses to a file, then have the debugger program enable them. In the debugger, we use MMIO to write the relevant registers (TBA_LO, TBA_HI, TMA_LO, TMA_HI). The addresses need to be aligned, so we drop the low 8 bits. There is also a bit in TBA_HI that must be set for gfx10+ for the trap to be enabled. If we want to disable the trap, we can just clear this bit.

  
#define SQ_SHADER_TBA_HI__TRAP_ENABLE (1 << 0x1f)

int enable_debugging(uint64_t tba_va, uint64_t tma_va){
    umr_reg* reg_tba_lo = umr_find_reg_data(asic, "mmSQ_SHADER_TBA_LO");
    if(!reg_tba_lo){
        asic->err_msg("[BUG]: Cannot find TBA_LO register\n");
        return -1;
    }

    umr_reg* reg_tba_hi = umr_find_reg_data(asic, "mmSQ_SHADER_TBA_HI");
    if(!reg_tba_hi){
        asic->err_msg("[BUG]: Cannot find TBA_HI register\n");
        return -1;
    }

    umr_reg* reg_tma_lo = umr_find_reg_data(asic, "mmSQ_SHADER_TMA_LO");
    if(!reg_tma_lo){
        asic->err_msg("[BUG]: Cannot find TMA_LO register\n");
        return -1;
    }

    umr_reg* reg_tma_hi = umr_find_reg_data(asic, "mmSQ_SHADER_TMA_HI");
    if(!reg_tma_hi){
        asic->err_msg("[BUG]: Cannot find TMA_HI register\n");
        return -1;
    }
  
    asic->options.use_bank = 2;
    asic->options.bank.srbm.me = 0;
    asic->options.bank.srbm.pipe = 0;
    for(int i = 1; i < 15; i++){
        asic->options.bank.srbm.vmid = i;
        
        // drop low 8 bits, and take low 32 bits of the result
        asic->reg_funcs.write_reg(asic, reg_tba_lo->addr * 4, tba_va >> 8, reg_tba_lo->type);
        // drop low 8 bits, and take high 32 bits of the result (not all used)
        asic->reg_funcs.write_reg(asic, reg_tba_hi->addr * 4, (tba_va >> 40) | SQ_SHADER_TBA_HI__TRAP_ENABLE, reg_tba_hi->type);
        // drop low 8 bits, and take low 32 bits of the result
        asic->reg_funcs.write_reg(asic, reg_tma_lo->addr * 4, tma_va >> 8, reg_tma_lo->type);
        // drop low 8 bits, and take high 32 bits of the result (not all used)
        asic->reg_funcs.write_reg(asic, reg_tma_hi->addr * 4, (tma_va >> 40), reg_tma_hi->type);
    }

    return 0;
}

int disable_debugging(){
    umr_reg *reg = umr_find_reg_data(asic, "mmSQ_SHADER_TBA_HI");
    if(!reg){
        asic->err_msg("[BUG]: Cannot find TBA_HI register\n");
        return -1;
    }

    asic->options.use_bank = 2;
    asic->options.bank.srbm.me = 0;
    asic->options.bank.srbm.pipe = 0;

    for(int i = 1; i < 15; i++){
        asic->options.bank.srbm.vmid = i;
        
        unsigned v = asic->reg_funcs.read_reg(asic, reg->addr * 4, reg->type);
        asic->reg_funcs.write_reg(asic, reg->addr * 4, v & ~(1 << SQ_SHADER_TBA_HI__TRAP_EN__SHIFT), reg->type);
    }

    return 0;
}

Update (31/01/2023): AMD has begun upstreaming the debug code for gfx10.3 into the kernel, yay! We can see by that code that I missed some steps (possibly preventing hangs). Ironically, I had similar code before, but removed it as it seemed not to do anything.

For this initial implementation we don’t bother to be selective and just loop over all of the VMIDs and set the trap handler registers (VMID0 is special, so that we skip).

And if we now query the waves with umr:

Grepping for trap_en in the umr output shows us that the waves are now launching with the trap handler.

Invoking the trap handler

Since umr is indicating our success, it is time to test actually invoking the trap handler. We will install very simple trap handler, that just calls s_sleep before returning. We will see in the next part how to make proper trap handlers, but suffice to say that this following one just delays the wave before continuing.

  
; depending on how many waves are running per frame, we might need more or less
; sleeping for the frame time to be significantly different.
; one s_sleep is limited to 127*64 cycles (the actual amount slept is not precise), 
; therefore we can add more s_sleeps to increase the time taken.
s_sleep 127
s_sleep 127
s_sleep 127
s_sleep 127

s_add_u32 ttmp0, ttmp0, 4
s_addc_u32 ttmp1, ttmp1, 0
s_and_b32 ttmp1, ttmp1, 0xffff
s_rfe_b64 [ttmp0, ttmp1]

To put this shader on the GPU, we change out the trap handler shader to our binary:

  
struct radv_trap_handler_shader *
radv_create_trap_handler_shader(struct radv_device *device)
{
   struct radv_trap_handler_shader *trap;

   trap = malloc(sizeof(struct radv_trap_handler_shader));
   if (!trap)
      return NULL;

   FILE *f = fopen("/home/deck/radbg-poc/asmc.hex", "rb");
   fseek(f, 0, SEEK_END);
   long asmc_len = ftell(f);
   fseek(f, 0, SEEK_SET);

   char *asmc = malloc(asmc_len);
   fread(asmc, asmc_len, 1, f);
   fclose(f);

   trap->alloc = radv_alloc_shader_memory(device, asmc_len, NULL);

   trap->bo = trap->alloc->arena->bo;
   char *dest_ptr = trap->alloc->arena->ptr + trap->alloc->offset;
   
   memcpy(dest_ptr, asmc, asmc_len);

   return trap;
}

Although there are a number of ways to invoke the trap handler, the easiest is to have the s_trap # instruction. The immediate value is the trap ID, which must be non-zero. We can modify the instruction selection in aco to have s_trap 2 emitted into all shaders:

  
void
select_program(Program* program, unsigned shader_count, struct nir_shader* const* shaders,
               ac_shader_config* config, const struct aco_compiler_options* options,
               const struct aco_shader_info* info,
               const struct radv_shader_args* args)
{
   isel_context ctx = setup_isel_context(program, shader_count, shaders, config, options, info, args, false);
   if_context ic_merged_wave_info;
   bool ngg_gs = ctx.stage.hw == HWStage::NGG && ctx.stage.has(SWStage::GS);

   for (unsigned i = 0; i < shader_count; i++) {
      nir_shader* nir = shaders[i];
      init_context(&ctx, nir);

      setup_fp_mode(&ctx, nir);

      if (!i) {
         /* needs to be after init_context() for FS */
         Pseudo_instruction* startpgm = add_startpgm(&ctx);
         append_logical_start(ctx.block);

         if (unlikely(ctx.options->has_ls_vgpr_init_bug && ctx.stage == vertex_tess_control_hs))
            fix_ls_vgpr_init_bug(&ctx, startpgm);

         split_arguments(&ctx, startpgm);

         if (!info->vs.has_prolog &&
             (program->stage.has(SWStage::VS) || program->stage.has(SWStage::TES))) {
            Builder(ctx.program, ctx.block).sopp(aco_opcode::s_setprio, -1u, 0x3u);
         }
      }
+     {
+        Builder bld(ctx.program, ctx.block);
+        bld.sopp(aco_opcode::s_trap, -1u, 2);
+     }

After running with the app with a rebuilt radv, success - we can toggle an FPS drop in our test app! We can tick off another two steps from the plan.

In the next part we will make the trap handler talk to the host, and make the host talk to the trap handler, to give control over shader execution to the debugger.

Footnotes and glossary

Trap Base Address - address of the trap handler shader ↩
Trap Memory Address - an address available to the trap handler shader, for temporary storage or other use ↩
the host writes to ringbuffers for the various command processors to consume ↩
multiple virtual memory tables can be active at the same time on the GPU to achieve process isolation. The KMD assigns VMIDs dynamically to processes to identify which tables they use. There can be at most 15 concurrent processes executing (VMIDs 1-15). VMID0 is reserved for physical addresses. ↩

Making an AMDGPU debugger part III - Trap handler

Intro

Uploading and installing the trap handler

Cutting out the middlemen

Banked registers

Incantations of trap enablement

Invoking the trap handler

Footnotes and glossary

Further Reading

Making an AMDGPU debugger part I - The Plan

Making an AMDGPU debugger part IV - Grand finale

Making an AMDGPU debugger part II - The Devk