Home Making an AMDGPU debugger part IV - Grand finale
Post
Cancel

Making an AMDGPU debugger part IV - Grand finale

Intro

Last part we successfully installed the trap handler and force the wave to trap - now it is time to communicate with the host and wrap this up!

  1. compile a trap handler ✔️
  2. upload the trap handler shader & install it and ask the wave to be launched with ttmps reserved ✔️
  3. invoke one of the methods of triggering the trap handler for a specific wave ✔️
  4. use a piece of memory to communicate to the host (👋 hey host! i am in the trap handler now) - if we want to breakpoint on an instruction instead of just the wave, we can now enable single stepping
  5. park the wave somehow until the host tells us to continue
  6. repeat entering the trap handler and waiting for the host as required

BO-nanza

Our task now is to share information with the host, and for this we will need some memory. We can take advantage of the existing code in radv that allocates a TMA1 buffer object (BO), but we need this memory to be available to our debugger.

A memory map of a BO cannot be shared to another process, presumably due to the implicit sync behaviour that is attached to BOs. What we have to do instead is share the BO itself, then we will map it in the second process.

To perform this we will need to communicate with the DRM (Direct Rendering Manager) in the debugger as well. For this proof-of-concept, I went with creating a radv winsys object, but creating a Vulkan instance and then using external memory exercises the same paths. We will export the BO in radv to associate an fd with the BO. Then we will write the fd identifier to a file for the debugger to read.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
result = ws->buffer_create(ws, TMA_BO_SIZE, 256, RADEON_DOMAIN_VRAM,
                           RADEON_FLAG_CPU_ACCESS | RADEON_FLAG_ZERO_VRAM | RADEON_FLAG_32BIT,
                           RADV_BO_PRIORITY_SCRATCH, 0, &device->tma_bo);
if (result != VK_SUCCESS)
  return false;

int fd = -1;
// retrieve an fd for this BO (this means exporting)
result = ws->buffer_get_fd(ws, device->tma_bo, &fd);
if (!result){
  fprintf(stderr, "could not get fd %d\n", fd);
  return false;
}
// write pid to file
FILE* f = fopen("/home/deck/tma_fd", "wb");
assert(f);
fwrite(&fd, sizeof(fd), 1, f);
fclose(f);

In the debugger, we read the fd identifier from the file, and then use the pidfd API to retrieve the fd from the inferior process. To achieve this, first we need to open an fd in our process that represents the inferior, then we can use pidfd_getfd to duplicate the fd from the inferior into our process. With the duplicate fd, we can import the BO from the DRM, and can request a CPU map of it, so that we can freely read and write the contents.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// retrieve the pidfd of the inferior identified by its pid
int inferior_pidfd = syscall(SYS_pidfd_open, inferior_pid, 0);
if(inferior_pidfd < 0){
    return -1;
}

// open file and read pid
FILE *f = fopen("/home/deck/tma_fd", "rb");
assert(f);
int tgt_fd = 0;
fread(&tgt_fd, sizeof(int), 1, f);
fclose(f);

// import the fd into our process using pidfd_getfd, 
// which will be identified by our_fd in this process
auto our_fd = syscall(SYS_pidfd_getfd, inferior_pidfd, tgt_fd, 0);
if(our_fd < 0){
    return -1;
}

// import the BO into this process using the fd
uint64_t alloc_size;
auto result = ws->buffer_from_fd(ws, our_fd, RADV_BO_PRIORITY_SCRATCH, &out_bo, &alloc_size);
if (result != 0){
    return -1;
}

// get CPU map of the BO
volatile uint32_t* base_ptr = (volatile uint32_t*) ws->buffer_map(out_bo);

With this shared memory with the inferior, we can now attempt more complex trap manipulation, so it is appropriate to learn more about trap handlers.

How to write a trap handler

Trap handlers are quite simple, the important bit to note are that the hardware prefills some ttmp registers for us on entering. For our current purposes, we only care about ttmp[0:1], which contains the trap invocation information. On gfx10.3 this is {1'h0, PCRewind[5:0], HT[0], TrapID[7:0], PC[47:0]};2

  • PCRewind is used to offset the faulting PC
  • HT is a bit indicating that the trap is host-initiated
  • TrapID is a 1 byte identifier, for example the value passed to s_trap if that was the invocation method
  • PC is the saved faulting PC

We also want to make sure we don’t alter the state of the wave inadvertently (the Prime Directive). To this end, we save any state we might modify at the beginning of the handler (eg. the exec mask), and restore them before returning3.

Once we finish with the trap handler, we need to resume execution. There are two cases:

  • if we entered the trap handler from an s_trap instruction, then the saved PC will be of the trap instruction itself
  • otherwise, the saved PC will be the next instruction to execute

Therefore, in the first case we advance the PC by 4 (size of an s_trap instruction), before returning.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
.equ PC_HI_TRAP_ID_MASK, 0x00FF0000
.equ PC_HI_TRAP_ID_SHIFT, 16

; save the STATUS word into ttmp8
s_getreg_b32 ttmp8, hwreg(HW_REG_STATUS)
; save exec into ttmp[2:3]
s_mov_b64 ttmp[2:3], exec

; ...
; body of trap handler
;                  ...

; extract the trap ID from ttmp1
s_and_b32 ttmp9, ttmp1, PC_HI_TRAP_ID_MASK
s_lshr_b32 ttmp9, ttmp9, PC_HI_TRAP_ID_SHIFT
; if the trapID == 0, then this is a hardware trap,
; we don't need to fix up the return address
s_cmpk_eq_u32 ttmp9, 0
s_cbranch_scc1 RETURN_FROM_NON_S_TRAP

; restore PC
; add 4 to the faulting address, with carry
s_add_u32 ttmp0, ttmp0, 4
s_addc_u32 ttmp1, ttmp1, 0

RETURN_FROM_NON_S_TRAP:
; mask off non-address high bits from ttmp1
s_and_b32 ttmp1, ttmp1, 0xffff

; restore exec
s_mov_b64 exec, ttmp[2:3]

; restore STATUS.EXECZ, not writable by s_setreg_b32
s_and_b64 exec, exec, exec
; restore STATUS.VCCZ, not writable by s_setreg_b32
s_and_b64 vcc, vcc, vcc
; restore STATUS.SCC
s_setreg_b32 hwreg(HW_REG_STATUS, 0, 1), ttmp8

; return from trap handler and restore STATUS.PRIV
s_rfe_b64 [ttmp0, ttmp1]

With the skeleton of the trap handler done, let’s write the body to give control to the host.

Host control

For the host to take control, we need to be able to stop the wave, let the host know that the wave is stopped, then resume after the host releases the wave.

To stop a wave, we can implement the spin loop on the GPU. We can then write from the host to the memory that the GPU is spinning on to release the wave:

1
2
3
4
5
6
7
8
9
10
11
SPIN:
; issue loading of value into v1
global_load_dword v1, v[4:5], off offset:4 glc slc dlc
; wait until load has finished
s_waitcnt vmcnt(0)
; read v1[first] into ttmp13
v_readfirstlane_b32 ttmp13, v1
; set SCC = ttmp13 != 0
s_and_b32 ttmp13, ttmp13, ttmp13
; jump to SPIN if SCC == 0
s_cbranch_scc0 SPIN

Unfortunately this doesn’t work, we get stuck and the GPU hangs. It seems like the value written from the device doesn’t become visible on the host (and similarly, we can set up a reverse scenario, where the host writes get trapped somewhere). Despite the fact we used glc slc dlc (all the cache coherency bits), it seems like this is not sufficient to get the value on the host mid-submission. Fortunately, @Bas had the solution: turning the TMA BO uncached by adding RADEON_FLAG_VA_UNCACHED to the buffer flags on creation, which bypasses any cache that can hold our writes hostage.

Curiously, destroying the imported BO in the debugger process also made host writes propagate. This indicates that explicit flushing might be possible instead of uncached memory (since uncached memory is very slow to access).

Communication now works, but currently all waves will enter the trap handler and start talking to the host. If we want to have a meaningful output (at least in this first proof-of-concept), it would be better to filter out a single wave. Furthermore it would be better if we could trap the same logical wave (ie. running the same fragment) every frame. Let’s see how we can selectively trap waves.

Wave filtering

To provide a graphics debugging experience instead of a compute one, we need to be able to map waves to API concepts. We want the waves to enter the trap handler on start, and evaluate a filter expression. If the filter fails, we return and continue on as normal. If the filter passes, it signals that we are interested in the wave and we let the host know. These filter expressions should be mappable to a graphics concept, such as a specific fragment/pixel or vertex.

Currently, the FS waves are filtered only on a hardcoded gl_FragCoord.xy (don’t need more for a single triangle!). To force the shader compiler to emit VGPRs containing gl_FragCoord.xy, a store is added to the fragment shader behind a branch that is never hit. This ensures that v[2:3] will contain gl_FragCoord.xy for this program.

In the future the compiler would be made to emit gl_FragCoord.xy unconditionally without needing a source workaround.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#version 450
#pragma shader_stage(fragment)

layout (location = 0) in vec3 inColor;
layout (location = 0) out vec4 outFragColor;

layout (push_constant) uniform PushConstants {
    uint always_zero;
};

layout(binding = 0) buffer dummybuf {
  vec2 dummy[];
};

void main() {
  if(always_zero > 0){
    dummy[0] = gl_FragCoord.xy; // force emission of gl_FragCoord.xy
  }
  outFragColor = vec4(inColor, 1.0);
}

Once we have this fragment shader, we can add our filter expression to the trap handler:

1
2
3
4
v_cmpx_eq_f32 0x42970000, v2 ; disable threads where gl_FragCoord.x != 75.5f
v_cmpx_eq_f32 0x42970000, v3 ; disable threads where gl_FragCoord.y != 75.5f
s_cbranch_execz RET_FROM_TRAP_S ; return if no threads are of interest
; only 1 thread active here with gl_FragCoord.xy == vec2(75.5f)

If the filter passes, the host is notified by copying HW_ID1 register to the TMA buffer to let the host know of the identity of this wave. The wave is then parked to await the host’s intervention.

Single-step and wave trace

The first feature I have implemented is single-stepping and trace.

On the host, we wait for the wave to be caught in the wait loop. We use umr to stop the ring, so that wave state does not change while we read the contents. We already know which wave we are debugging based on the HW_ID1 we received earlier, which let’s us read the specific wave slot as it contains the wave ID(se.sa.wgp.simd.wave_id).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
int umr_scan_wave_slot(struct umr_asic *asic, uint32_t se, uint32_t sh, uint32_t cu,
                       uint32_t simd, uint32_t wave, struct umr_wave_data *pwd);

...


// wait until wave has passed filter
while(base_ptr[10] == 0) {
    usleep(10);
}

// read HW_ID1
auto value = base_ptr[10];
// decompose HW_ID1 using umr's register tables
auto reg = umr_find_reg_data(asic, "ixSQ_WAVE_HW_ID1");
auto wave_id = umr_bitslice_reg(asic, reg, "WAVE_ID", value);
auto simd_id = umr_bitslice_reg(asic, reg, "SIMD_ID", value);
auto wgp_id  = umr_bitslice_reg(asic, reg, "WGP_ID", value);
auto sa_id   = umr_bitslice_reg(asic, reg, "SA_ID", value);
auto se_id   = umr_bitslice_reg(asic, reg, "SE_ID", value);

umr_wave_data wdt = {};

// scan wave slot
int result = umr_scan_wave_slot(asic, se, sa, wgp, simd, wave, &wdt);

// wdt is now filled with wave registers

After this, we can read the GPRs and state registers of the wave from wdt.

Busier waiting

We run into an issue here: the GPRs and state registers cannot be read reliably. It turns out that when we do our spin, we use s_waitcnt which makes the wave inactive on the SIMD unit, and this makes reads unreliable. To fix this, we need to wait busier. The busier wait forgoes s_waitcnt in favour of reading the VMCNT value (outstanding vector memory read count) directly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SPIN:
; issue loading of value into v1
buffer_load_dword v1, off, ttmp[4:7], null offset:4 glc slc dlc
SPIN1:
; retrieve IB_STS.VM_CNT into ttmp13
s_getreg_b32 ttmp13, hwreg(HW_REG_IB_STS, 0, 3);
; set SCC = ttmp13 != 0
s_and_b32 ttmp13, ttmp13, ttmp13
; spin until IB_STS.VM_CNT becomes 0
s_cbranch_scc1 SPIN1

; read v1[first] into ttmp13
v_readfirstlane_b32 ttmp13, v1
; set SCC = ttmp13 != 0
s_and_b32 ttmp13, ttmp13, ttmp13
; jump to SPIN if SCC == 0
s_cbranch_scc0 SPIN

Shader disassembly

Since we know that ttmp[0:1] contains the faulting PC, we can use this to disassemble the shader on the host using umr’s disassembly function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
unsigned ttmp[16];
// read ttmp registers from the SGPR register file
memcpy(ttmp, &wdt.sgprs[0x6C], sizeof(unsigned) * 16);
unsigned fault_lo = ttmp[0];
unsigned fault_hi = ttmp[1] & 0xffff;
// assemble faulting PC based on the contents of ttmp[0:1]
uint64_t fault_pc = (((uint64_t)fault_hi << 32) | fault_lo);

// for now disassemble 2 opcodes, otherwise we can get incomplete output
const unsigned shader_size = 4 * 2;

// if we are stopped by s_trap, then we need to step 
// the PC to disassemble the next instruction
if(wp){ 
  umr_vm_disasm(asic, -1, wd->ws.hw_id2.vm_id, fault_pc + 4, 
                fault_pc + 4, shader_size, 0, NULL);
} else {
  umr_vm_disasm(asic, -1, wd->ws.hw_id2.vm_id, fault_pc, fault_pc, 
                shader_size, 0, NULL);
}

Stepping

We have now read all the state of the wave we are interested in, and need to advance to the next instruction. We release the wave from the busy wait by writing to the TMA buffer, and then turn on single stepping via flipping a bit in the MODE register:

1
2
; set 1 bit from the 11th to the immediate value of 1
s_setreg_imm32_b32 hwreg(HW_REG_MODE, 11, 1), 1

After we read all the state we wanted, we can resume the ring via umr, and let the shader step to the next instruction. After each instruction, the wave invokes the trap handler with a trap ID of 0. By repeatedly releasing the wave from the busy wait, we can step through the shader:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
...
PC=0x80000000017c FAULT_PC=0x800000004f40 m0=0

 *  pgm[2@0x800000004f40 + 0x0   ] = 0x5e020702         v_cvt_pkrtz_f16_f32_e32 v1, v2, v3
    pgm[2@0x800000004f40 + 0x4   ] = 0xd52f0000 ...                                       
End of disassembly.
-----------------------------

PC=0x80000000017c FAULT_PC=0x800000004f44 m0=0

 *  pgm[2@0x800000004f44 + 0x0   ] = 0xd52f0000         v_cvt_pkrtz_f16_f32_e64 v0, v0, 1.0
    pgm[2@0x800000004f44 + 0x4   ] = 0x0001e500 ;;                                         
End of disassembly.
-----------------------------

PC=0x80000000017c FAULT_PC=0x800000004f4c m0=0

 *  pgm[2@0x800000004f4c + 0x0   ] = 0xf8001c0f         exp mrt0 v1, v1, v0, v0 done compr vm
    pgm[2@0x800000004f4c + 0x4   ] = 0x80800001 ;;                                           
End of disassembly.
-----------------------------

PC=0x80000000017c FAULT_PC=0x800000004f54 m0=0

 *  pgm[2@0x800000004f54 + 0x0   ] = 0xbf810000         s_endpgm                             
    pgm[2@0x800000004f54 + 0x4   ] = 0xbf9f0000         s_code_end                           
End of disassembly.
-----------------------------

Stepping through the shader and disassembling the faulting instruction gives a trace.

Write access to the wave

Yes, we have read all the state we wanted, but what about writing?

To be fair, this was more of a stretch goal. Since we control the shader, we could make the trap handler read and write registers into the TMA buffer. umr has functions for reading SGPRs and VGPRs, but not for writing them. Let’s see what these do under the hood: ultimately we end up calling into the KMD.

1
2
3
4
5
6
7
8
9
10
11
12
static void wave_read_regs(struct amdgpu_device *adev, uint32_t wave,
			   uint32_t thread, uint32_t regno,
			   uint32_t num, uint32_t *out)
{
	WREG32_SOC15(GC, 0, mmSQ_IND_INDEX,
		(wave << SQ_IND_INDEX__WAVE_ID__SHIFT) |
		(regno << SQ_IND_INDEX__INDEX__SHIFT) |
		(thread << SQ_IND_INDEX__WORKITEM_ID__SHIFT) |
		(SQ_IND_INDEX__AUTO_INCR_MASK));
	while (num--)
		*(out++) = RREG32_SOC15(GC, 0, mmSQ_IND_DATA);
}

We are selecting a single item of a register within the bank via writing a combination of wave | register_no | thread to mmSQ_IND_INDEX, and then we read the scalar from mmSQ_IND_DATA. There is also a bit of “hardware sugar” here that let’s multiple values be read at the same time with a single mmSQ_IND_INDEX write. Unfortunately, there is no wave_write_regs anywhere to be found but…

… what if we just write to mmSQ_IND_DATA instead of reading?

Yep, that just works4:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#define VGPR_OFFSET 0x400

int write_vgpr(unsigned se, unsigned sa, unsigned wgp, unsigned simd, unsigned wave, 
               unsigned vgpr, unsigned thread, unsigned value){
    struct umr_reg *ind_index, *ind_data;
    uint32_t data;

    // GPR R/W uses GRBM banking
    asic->options.use_bank           = 1;
    asic->options.bank.grbm.se       = se;
    asic->options.bank.grbm.sh       = sa;
    asic->options.bank.grbm.instance = (wgp << 2) | simd;

    ind_index = umr_find_reg_data(asic, "mmSQ_IND_INDEX");
    ind_data  = umr_find_reg_data(asic, "mmSQ_IND_DATA");

    if (ind_index && ind_data) {
        data = umr_bitslice_compose_value(asic, ind_index, "WAVE_ID", wave);
        data |= umr_bitslice_compose_value(asic, ind_index, "INDEX", VGPR_OFFSET + vgpr);
        data |= umr_bitslice_compose_value(asic, ind_index, "WORKITEM_ID", thread);
        umr_write_reg(asic, ind_index->addr * 4, data, REG_MMIO);
        umr_write_reg(asic, ind_data->addr * 4, value, REG_MMIO);
    } else {
        asic->err_msg("BUG]: Cannot find SQ_IND_{INDEX,DATA} registers\n", 
                      asic->asicname);
        return -1;
    }
    return 0;
}

Armed with this function, we can manipulate the shader state directly. Our fragment shader looks like the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
    pgm[7@0x800000004f00 + 0x0   ] 		s_trap 2                                       
    pgm[7@0x800000004f00 + 0x4   ] 		s_cmp_lt_u32 0, s3                             
    pgm[7@0x800000004f00 + 0x8   ] 		s_cbranch_scc0 6                               
    pgm[7@0x800000004f00 + 0xc   ] 		s_movk_i32 s3, 0x8000                          
    pgm[7@0x800000004f00 + 0x10  ] 		s_load_dwordx4 s[0:3], s[2:3], 0x0             
    pgm[7@0x800000004f00 + 0x14  ] 	;;                                                 
    pgm[7@0x800000004f00 + 0x18  ] 		s_waitcnt lgkmcnt(0)                           
    pgm[7@0x800000004f00 + 0x1c  ] 		buffer_store_dwordx2 v[2:3], off, s[0:3], 0 glc
    pgm[7@0x800000004f00 + 0x20  ] 	;;                                                 
    pgm[7@0x800000004f00 + 0x24  ] 		s_mov_b32 m0, s4                               
    pgm[7@0x800000004f00 + 0x28  ] 		v_interp_p1_f32_e32 v2, v0, attr0.x            
    pgm[7@0x800000004f00 + 0x2c  ] 		v_interp_p2_f32_e32 v2, v1, attr0.x            
    pgm[7@0x800000004f00 + 0x30  ] 		v_interp_p1_f32_e32 v3, v0, attr0.y            
    pgm[7@0x800000004f00 + 0x34  ] 		v_interp_p2_f32_e32 v3, v1, attr0.y            
    pgm[7@0x800000004f00 + 0x38  ] 		v_interp_p1_f32_e32 v0, v0, attr0.z            
    pgm[7@0x800000004f00 + 0x3c  ] 		v_interp_p2_f32_e32 v0, v1, attr0.z            
--> pgm[7@0x800000004f00 + 0x40  ] 		v_cvt_pkrtz_f16_f32_e32 v1, v2, v3             
    pgm[7@0x800000004f00 + 0x44  ] 		v_cvt_pkrtz_f16_f32_e64 v0, v0, 1.0            
    pgm[7@0x800000004f00 + 0x48  ] 	;;                                                 
    pgm[7@0x800000004f00 + 0x4c  ] 		exp mrt0 v1, v1, v0, v0 done compr vm          
    pgm[7@0x800000004f00 + 0x50  ] 	;;                                                 
    pgm[7@0x800000004f00 + 0x54  ] 		s_endpgm   

We set a breakpoint in the shader, on the instruction marked with an arrow, and when that is hit, we shuffle the VGPRs responsible for the colour output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
if(fault_pc == 0x800000004f40){ // before packing v2 and v3, we rotate v2
  for (int thread = 0; thread < wdt.num_threads; ++thread) {
      auto prev_thread = (thread + wdt.num_threads - vgpr_rotate_counter) % wdt.num_threads;

      float fv = *(float*)&wdt.vgprs[prev_thread * 256 + 0];
      upload_vgpr(se, sa, wgp, simd, wave, 0, thread, *(unsigned*)&fv);
      fv = *(float*)&wdt.vgprs[prev_thread * 256 + 2];
      upload_vgpr(se, sa, wgp, simd, wave, 2, thread, *(unsigned*)&fv);
      fv = *(float*)&wdt.vgprs[prev_thread * 256 + 3];
      upload_vgpr(se, sa, wgp, simd, wave, 3, thread, *(unsigned*)&fv);
  }

  vgpr_rotate_counter = (vgpr_rotate_counter + 1) % 64;
}

And tada! We get our demo:

While voted as the most boring video in existence, this demo nicely confirms things are working.

As an aside, having write access to the wave state makes the trap handler shader simpler, including allowing us to write the following zero GPR busy wait:

1
2
3
4
; set SCC to 0
s_setreg_imm32_b32 hwreg(HW_REG_STATUS, 0, 1), 0
SPINNING:
s_cbranch_scc0 SPINNING

Which is of dubious use, but looks interesting nonetheless.

Future uses

The above two features were the ones that made the cut for this post series, but it is not that difficult to imagine more:

Memory violation debugging
No wave filtering required. Once a memory violation is detected, the wave can be stopped and examined from the debugger. Memory violations are generally not precise (the faulting PC is not the one issuing the access), but with extra work they can be made so. It might even be possible to stop the context loss from occuring.
Data breakpoints
Just as on the CPU, seems like some registers can be programmed to trap when interesting addresses are accessed.
Fragment debugging
Want to examine a variable written in the FS visually? Just swap out the color VGPRs before they get written!
Conditional breakpoints
Shader or host can evaluate arbitrary (even memory dependent) expressions to trigger breakpoints.
True compute debugging
Shared memory, nonuniform behaviour is all accessible.
Assertions in shader code
Can trap when hit with a debugger present.
Source level debugging
Needs some significant amount of plumbing for sure, but the GLSL -> SPIRV side is done. addr2line support in NIR is being worked on.
Arbitrary code execution
Additional code could be compiled by the debugger, then called by the shader.
Debugging barrier hangs
Waves can be stopped when stuck in barriers, and released to prevent context loss, while diagnosing issues.
Multi-wave debugging
More waves can be trapped and worked on at the same time.
Time travel debugging
It is possible to record the entire state evolution of a wave, which afterwards the host can replay, freely stepping forwards and backwards.
GPU exceptions

meme about exceptions on the GPU Finally exceptions on the GPU!

Put away the pitchfork, it is a joke, but indeed could be used to implement setjmp/longjmp on the GPU should someone desire that.

Conclusion

I hope you enjoyed this trip into the heart of the GPU as much as I did! I learned a lot about AMD GPUs and the Linux graphics stack, and I hope I managed to pass some of that on to you as well.

I think there is great potential, but most of the work remains ahead. I am hoping something small can be built using these pieces soon, so that least some features become available to developers.

I would like to thank @nanokatze for sparking this journey and Samuel Pitoiset for his work on trap handlers in radv/amdgpu. Many thanks to @ishitatsuyuki, @pixelcluster, @An0num0us, @Jaker, @clepirelli of the GP discord server and @Bas, @DadSchoorse and @Plagman of the LGD discord server for helping me along the way!

Blooper reel

  • stray add-with-carry randomly corrupting the PC because the carry bit is the same as the scalar condition code (STATUS.SCC)
  • setting address VGPRs with narrower exec than the load itself, getting a memory violation
  • forgetting to wait for the atomic pre-op value

Footnotes and glossary

  1. Trap Memory Address - an address available to the trap handler shader, for temporary storage or other use 

  2. RDNA 2 ISA manual 

  3. a good reference for writing trap handlers is the compute wave save/restore code. 

  4. Astute readers will note how unsafe it is that we split setting mmSQ_IND_INDEX and writing mmSQ_IND_DATA into separate calls - in practice these are not used outside of debugging. Alternatively writes could be checked, or in the long term, this function can be moved into the kernel. 

This post is licensed under CC BY 4.0 by the author.