Debugging a hard fault in ARM Cortex-M4












2














I am at my wit's end trying to debug a hard fault on an EFR32BG12 processor. I've been following the instructions in the Silicon Labs knowledge base here:



https://www.silabs.com/community/mcu/32-bit/knowledge-base.entry.html/2014/05/26/debug_a_hardfault-78gc



I've also been using the Keil app note here to fill in some details:



http://www.keil.com/appnotes/files/apnt209.pdf



I've managed to get the hard fault to occur quite consistently in one place. When the hard fault occurs, the code from the knowledge base article gives me the following values (pushed onto the stack by the processor before calling the hard fault handler):



Name     Type        Value               Location
~~~~ ~~~~ ~~~~~ ~~~~~~~~
cfsr uint32_t 0x20000 (Hex) 0x2000078c
hfsr uint32_t 0x40000000 (Hex) 0x20000788
mmfar uint32_t 0xe000ed34 (Hex) 0x20000784
bfar uint32_t 0xe000ed38 (Hex) 0x20000780
r0 uint32_t 0x0 (Hex) 0x2000077c
r1 uint32_t 0x8 (Hex) 0x20000778
r2 uint32_t 0x0 (Hex) 0x20000774
r3 uint32_t 0x0 (Hex) 0x20000770
r12 uint32_t 0x1 (Hex) 0x2000076c
lr uint32_t 0xab61 (Hex) 0x20000768
pc uint32_t 0x38dc8 (Hex) 0x20000764
psr uint32_t 0x0 (Hex) 0x20000760


Looking at the Keil app note, I believe a CFSR value of 0x20000 indicates a Usage Fault with the INVSTATE bit set, i.e.:




INVSTATE: Invalid state: 0 = no invalid state 1 = the processor has
attempted to execute an instruction that makes illegal use of the
Execution Program Status Register (EPSR). When this bit is set, the PC
value stacked for the exception return points to the instruction that
attempted the illegal use of the EPSR. Potential reasons: a) Loading a
branch target address to PC with LSB=0. b) Stacked PSR corrupted
during exception or interrupt handling. c) Vector table contains a
vector address with LSB=0.




The PC value pushed onto the stack by the exception (provided by the code from the knowledge base article) seems to be 0x38dc8. If I go to this address in the Simplicity Studio "Disassembly" window, I see the following:



00038db8:   str     r5,[r5,#0x14]
00038dba: str r0,[r7,r1]
00038dbc: str r4,[r5,#0x14]
00038dbe: ldr r4,[pc,#0x1e4] ; 0x38fa0
00038dc0: strb r1,[r4,#0x11]
00038dc2: ldr r5,[r4,#0x64]
00038dc4: ldrb r3,[r4,#0x5]
00038dc6: movs r3,r6
00038dc8: strb r1,[r4,#0x15]
00038dca: ldr r4,[r4,#0x14]
00038dcc: cmp r7,#0x6f
00038dce: cmp r6,#0x30
00038dd0: str r7,[r6,#0x14]
00038dd2: lsls r6,r6,#1
00038dd4: movs r5,r0
00038dd6: movs r0,r0


The address appears to be well past the end of my code. If I look at the same address in the "Memory" window, this is what I see:



0x00038DC8  69647561 2E302F6F 00766177 00000005  audio/0.wav.....
0x00038DD8 00000000 000F4240 00000105 00000000 ....@B..........
0x00038DE8 00000000 00000000 00000005 00000000 ................
0x00038DF8 0001C200 00000500 00001000 00000000 .Â..............
0x00038E08 00000000 F00000F0 02F00001 0003F000 ....ð..ð..ð..ð..
0x00038E18 F00004F0 06010005 01020101 01011201 ð..ð............
0x00038E28 35010121 01010D01 6C363025 2E6E6775 !..5....%06lugn.
0x00038E38 00746164 00000001 000008D0 00038400 dat.....Ð.......


Curiously, "audio/0.wav" is a static string which is part of the firmware. If I understand correctly, what I've learned here is that PC somehow gets set to this point in memory, which of course is not a valid instruction and causes the hard fault.



To debug the issue, I need to know how PC came to be set to this incorrect value. I believe the LR register should give me an idea. The LR register pushed onto the stack by the exception seems to be 0xab61. If I look at this location, I see the following in the Disassembly window:



1270            dp->sect = clst2sect(fs, clst);
0000ab58: ldr r0,[r7,#0x10]
0000ab5a: ldr r1,[r7,#0x14]
0000ab5c: bl 0x00009904
0000ab60: mov r2,r0
0000ab62: ldr r3,[r7,#0x4]
0000ab64: str r2,[r3,#0x18]


It looks to me like the problem occurs during this call specifically:



0000ab5c:   bl      0x00009904


This makes me think that the problem occurs as a result of a corrupt stack, which causes clst2sect to return to an invalid part of memory rather than to 0xab60. The code for clst2sect is pretty innocuous:



/*-----------------------------------------------------------------------*/
/* Get physical sector number from cluster number */
/*-----------------------------------------------------------------------*/

DWORD clst2sect ( /* !=0:Sector number, 0:Failed (invalid cluster#) */
FATFS* fs, /* Filesystem object */
DWORD clst /* Cluster# to be converted */
)
{
clst -= 2; /* Cluster number is origin from 2 */
if (clst >= fs->n_fatent - 2) return 0; /* Is it invalid cluster number? */
return fs->database + fs->csize * clst; /* Start sector number of the cluster */
}


Does this analysis sound about right?



I suppose the problem I've run into is that I have no idea what might cause this kind of behaviour... I've tried putting breakpoints in all of my interrupt handlers, to see if one of them might be corrupting the stack, but there doesn't seem to be any pattern--sometimes, no interrupt handler is called but the problem still occurs.



In that case, though, it's hard for me to see how a program might try to execute code at a location well past the actual end of the code... I feel like a function pointer might be a likely candidate, but in that case I would expect to see the problem show up, e.g., where a function pointer is used. However, I don't see any function pointers used near where the error is occurring.



Perhaps there is more information I can extract from the debug information I've given above? The problem is quite reproducible, so if there's something I have not tried, but which you think might give some insight, I would love to hear it.



Thanks for any help you can offer!










share|improve this question






















  • Could you give a bit more context? Is this bare metal, or is there an OS?
    – cooperised
    Nov 12 '18 at 9:19










  • The error is between 0x0000ab5c, and the next BL, somewhere where there is a PC modifying instruction. It seems your stack is OK here, are there any hints above/below this stack location?
    – Sean Houlihane
    Nov 12 '18 at 13:05










  • @cooperised This is bare metal with a simple cooperative scheduler.
    – Michael Cooper
    Nov 13 '18 at 16:13






  • 1




    this return is presumably a pop to the PC? It could be past that, until you find something that modifies the LR, and it could be that return which is taking a corrupted value.
    – Sean Houlihane
    Nov 13 '18 at 16:37






  • 1




    If you've tried doubling the stack size and it hasn't helped then this probably isn't the problem. The most reliable way to check for an overflowing stack is to configure the MPU to make a small region at the bottom of the allocated stack space inaccessible; an overflowing stack will then trigger a data abort exception and you can work back from there. An easier but less reliable method is to try and work out where your stack usage is greatest and put a breakpoint in so you can peek r13.
    – cooperised
    Nov 15 '18 at 12:26
















2














I am at my wit's end trying to debug a hard fault on an EFR32BG12 processor. I've been following the instructions in the Silicon Labs knowledge base here:



https://www.silabs.com/community/mcu/32-bit/knowledge-base.entry.html/2014/05/26/debug_a_hardfault-78gc



I've also been using the Keil app note here to fill in some details:



http://www.keil.com/appnotes/files/apnt209.pdf



I've managed to get the hard fault to occur quite consistently in one place. When the hard fault occurs, the code from the knowledge base article gives me the following values (pushed onto the stack by the processor before calling the hard fault handler):



Name     Type        Value               Location
~~~~ ~~~~ ~~~~~ ~~~~~~~~
cfsr uint32_t 0x20000 (Hex) 0x2000078c
hfsr uint32_t 0x40000000 (Hex) 0x20000788
mmfar uint32_t 0xe000ed34 (Hex) 0x20000784
bfar uint32_t 0xe000ed38 (Hex) 0x20000780
r0 uint32_t 0x0 (Hex) 0x2000077c
r1 uint32_t 0x8 (Hex) 0x20000778
r2 uint32_t 0x0 (Hex) 0x20000774
r3 uint32_t 0x0 (Hex) 0x20000770
r12 uint32_t 0x1 (Hex) 0x2000076c
lr uint32_t 0xab61 (Hex) 0x20000768
pc uint32_t 0x38dc8 (Hex) 0x20000764
psr uint32_t 0x0 (Hex) 0x20000760


Looking at the Keil app note, I believe a CFSR value of 0x20000 indicates a Usage Fault with the INVSTATE bit set, i.e.:




INVSTATE: Invalid state: 0 = no invalid state 1 = the processor has
attempted to execute an instruction that makes illegal use of the
Execution Program Status Register (EPSR). When this bit is set, the PC
value stacked for the exception return points to the instruction that
attempted the illegal use of the EPSR. Potential reasons: a) Loading a
branch target address to PC with LSB=0. b) Stacked PSR corrupted
during exception or interrupt handling. c) Vector table contains a
vector address with LSB=0.




The PC value pushed onto the stack by the exception (provided by the code from the knowledge base article) seems to be 0x38dc8. If I go to this address in the Simplicity Studio "Disassembly" window, I see the following:



00038db8:   str     r5,[r5,#0x14]
00038dba: str r0,[r7,r1]
00038dbc: str r4,[r5,#0x14]
00038dbe: ldr r4,[pc,#0x1e4] ; 0x38fa0
00038dc0: strb r1,[r4,#0x11]
00038dc2: ldr r5,[r4,#0x64]
00038dc4: ldrb r3,[r4,#0x5]
00038dc6: movs r3,r6
00038dc8: strb r1,[r4,#0x15]
00038dca: ldr r4,[r4,#0x14]
00038dcc: cmp r7,#0x6f
00038dce: cmp r6,#0x30
00038dd0: str r7,[r6,#0x14]
00038dd2: lsls r6,r6,#1
00038dd4: movs r5,r0
00038dd6: movs r0,r0


The address appears to be well past the end of my code. If I look at the same address in the "Memory" window, this is what I see:



0x00038DC8  69647561 2E302F6F 00766177 00000005  audio/0.wav.....
0x00038DD8 00000000 000F4240 00000105 00000000 ....@B..........
0x00038DE8 00000000 00000000 00000005 00000000 ................
0x00038DF8 0001C200 00000500 00001000 00000000 .Â..............
0x00038E08 00000000 F00000F0 02F00001 0003F000 ....ð..ð..ð..ð..
0x00038E18 F00004F0 06010005 01020101 01011201 ð..ð............
0x00038E28 35010121 01010D01 6C363025 2E6E6775 !..5....%06lugn.
0x00038E38 00746164 00000001 000008D0 00038400 dat.....Ð.......


Curiously, "audio/0.wav" is a static string which is part of the firmware. If I understand correctly, what I've learned here is that PC somehow gets set to this point in memory, which of course is not a valid instruction and causes the hard fault.



To debug the issue, I need to know how PC came to be set to this incorrect value. I believe the LR register should give me an idea. The LR register pushed onto the stack by the exception seems to be 0xab61. If I look at this location, I see the following in the Disassembly window:



1270            dp->sect = clst2sect(fs, clst);
0000ab58: ldr r0,[r7,#0x10]
0000ab5a: ldr r1,[r7,#0x14]
0000ab5c: bl 0x00009904
0000ab60: mov r2,r0
0000ab62: ldr r3,[r7,#0x4]
0000ab64: str r2,[r3,#0x18]


It looks to me like the problem occurs during this call specifically:



0000ab5c:   bl      0x00009904


This makes me think that the problem occurs as a result of a corrupt stack, which causes clst2sect to return to an invalid part of memory rather than to 0xab60. The code for clst2sect is pretty innocuous:



/*-----------------------------------------------------------------------*/
/* Get physical sector number from cluster number */
/*-----------------------------------------------------------------------*/

DWORD clst2sect ( /* !=0:Sector number, 0:Failed (invalid cluster#) */
FATFS* fs, /* Filesystem object */
DWORD clst /* Cluster# to be converted */
)
{
clst -= 2; /* Cluster number is origin from 2 */
if (clst >= fs->n_fatent - 2) return 0; /* Is it invalid cluster number? */
return fs->database + fs->csize * clst; /* Start sector number of the cluster */
}


Does this analysis sound about right?



I suppose the problem I've run into is that I have no idea what might cause this kind of behaviour... I've tried putting breakpoints in all of my interrupt handlers, to see if one of them might be corrupting the stack, but there doesn't seem to be any pattern--sometimes, no interrupt handler is called but the problem still occurs.



In that case, though, it's hard for me to see how a program might try to execute code at a location well past the actual end of the code... I feel like a function pointer might be a likely candidate, but in that case I would expect to see the problem show up, e.g., where a function pointer is used. However, I don't see any function pointers used near where the error is occurring.



Perhaps there is more information I can extract from the debug information I've given above? The problem is quite reproducible, so if there's something I have not tried, but which you think might give some insight, I would love to hear it.



Thanks for any help you can offer!










share|improve this question






















  • Could you give a bit more context? Is this bare metal, or is there an OS?
    – cooperised
    Nov 12 '18 at 9:19










  • The error is between 0x0000ab5c, and the next BL, somewhere where there is a PC modifying instruction. It seems your stack is OK here, are there any hints above/below this stack location?
    – Sean Houlihane
    Nov 12 '18 at 13:05










  • @cooperised This is bare metal with a simple cooperative scheduler.
    – Michael Cooper
    Nov 13 '18 at 16:13






  • 1




    this return is presumably a pop to the PC? It could be past that, until you find something that modifies the LR, and it could be that return which is taking a corrupted value.
    – Sean Houlihane
    Nov 13 '18 at 16:37






  • 1




    If you've tried doubling the stack size and it hasn't helped then this probably isn't the problem. The most reliable way to check for an overflowing stack is to configure the MPU to make a small region at the bottom of the allocated stack space inaccessible; an overflowing stack will then trigger a data abort exception and you can work back from there. An easier but less reliable method is to try and work out where your stack usage is greatest and put a breakpoint in so you can peek r13.
    – cooperised
    Nov 15 '18 at 12:26














2












2








2







I am at my wit's end trying to debug a hard fault on an EFR32BG12 processor. I've been following the instructions in the Silicon Labs knowledge base here:



https://www.silabs.com/community/mcu/32-bit/knowledge-base.entry.html/2014/05/26/debug_a_hardfault-78gc



I've also been using the Keil app note here to fill in some details:



http://www.keil.com/appnotes/files/apnt209.pdf



I've managed to get the hard fault to occur quite consistently in one place. When the hard fault occurs, the code from the knowledge base article gives me the following values (pushed onto the stack by the processor before calling the hard fault handler):



Name     Type        Value               Location
~~~~ ~~~~ ~~~~~ ~~~~~~~~
cfsr uint32_t 0x20000 (Hex) 0x2000078c
hfsr uint32_t 0x40000000 (Hex) 0x20000788
mmfar uint32_t 0xe000ed34 (Hex) 0x20000784
bfar uint32_t 0xe000ed38 (Hex) 0x20000780
r0 uint32_t 0x0 (Hex) 0x2000077c
r1 uint32_t 0x8 (Hex) 0x20000778
r2 uint32_t 0x0 (Hex) 0x20000774
r3 uint32_t 0x0 (Hex) 0x20000770
r12 uint32_t 0x1 (Hex) 0x2000076c
lr uint32_t 0xab61 (Hex) 0x20000768
pc uint32_t 0x38dc8 (Hex) 0x20000764
psr uint32_t 0x0 (Hex) 0x20000760


Looking at the Keil app note, I believe a CFSR value of 0x20000 indicates a Usage Fault with the INVSTATE bit set, i.e.:




INVSTATE: Invalid state: 0 = no invalid state 1 = the processor has
attempted to execute an instruction that makes illegal use of the
Execution Program Status Register (EPSR). When this bit is set, the PC
value stacked for the exception return points to the instruction that
attempted the illegal use of the EPSR. Potential reasons: a) Loading a
branch target address to PC with LSB=0. b) Stacked PSR corrupted
during exception or interrupt handling. c) Vector table contains a
vector address with LSB=0.




The PC value pushed onto the stack by the exception (provided by the code from the knowledge base article) seems to be 0x38dc8. If I go to this address in the Simplicity Studio "Disassembly" window, I see the following:



00038db8:   str     r5,[r5,#0x14]
00038dba: str r0,[r7,r1]
00038dbc: str r4,[r5,#0x14]
00038dbe: ldr r4,[pc,#0x1e4] ; 0x38fa0
00038dc0: strb r1,[r4,#0x11]
00038dc2: ldr r5,[r4,#0x64]
00038dc4: ldrb r3,[r4,#0x5]
00038dc6: movs r3,r6
00038dc8: strb r1,[r4,#0x15]
00038dca: ldr r4,[r4,#0x14]
00038dcc: cmp r7,#0x6f
00038dce: cmp r6,#0x30
00038dd0: str r7,[r6,#0x14]
00038dd2: lsls r6,r6,#1
00038dd4: movs r5,r0
00038dd6: movs r0,r0


The address appears to be well past the end of my code. If I look at the same address in the "Memory" window, this is what I see:



0x00038DC8  69647561 2E302F6F 00766177 00000005  audio/0.wav.....
0x00038DD8 00000000 000F4240 00000105 00000000 ....@B..........
0x00038DE8 00000000 00000000 00000005 00000000 ................
0x00038DF8 0001C200 00000500 00001000 00000000 .Â..............
0x00038E08 00000000 F00000F0 02F00001 0003F000 ....ð..ð..ð..ð..
0x00038E18 F00004F0 06010005 01020101 01011201 ð..ð............
0x00038E28 35010121 01010D01 6C363025 2E6E6775 !..5....%06lugn.
0x00038E38 00746164 00000001 000008D0 00038400 dat.....Ð.......


Curiously, "audio/0.wav" is a static string which is part of the firmware. If I understand correctly, what I've learned here is that PC somehow gets set to this point in memory, which of course is not a valid instruction and causes the hard fault.



To debug the issue, I need to know how PC came to be set to this incorrect value. I believe the LR register should give me an idea. The LR register pushed onto the stack by the exception seems to be 0xab61. If I look at this location, I see the following in the Disassembly window:



1270            dp->sect = clst2sect(fs, clst);
0000ab58: ldr r0,[r7,#0x10]
0000ab5a: ldr r1,[r7,#0x14]
0000ab5c: bl 0x00009904
0000ab60: mov r2,r0
0000ab62: ldr r3,[r7,#0x4]
0000ab64: str r2,[r3,#0x18]


It looks to me like the problem occurs during this call specifically:



0000ab5c:   bl      0x00009904


This makes me think that the problem occurs as a result of a corrupt stack, which causes clst2sect to return to an invalid part of memory rather than to 0xab60. The code for clst2sect is pretty innocuous:



/*-----------------------------------------------------------------------*/
/* Get physical sector number from cluster number */
/*-----------------------------------------------------------------------*/

DWORD clst2sect ( /* !=0:Sector number, 0:Failed (invalid cluster#) */
FATFS* fs, /* Filesystem object */
DWORD clst /* Cluster# to be converted */
)
{
clst -= 2; /* Cluster number is origin from 2 */
if (clst >= fs->n_fatent - 2) return 0; /* Is it invalid cluster number? */
return fs->database + fs->csize * clst; /* Start sector number of the cluster */
}


Does this analysis sound about right?



I suppose the problem I've run into is that I have no idea what might cause this kind of behaviour... I've tried putting breakpoints in all of my interrupt handlers, to see if one of them might be corrupting the stack, but there doesn't seem to be any pattern--sometimes, no interrupt handler is called but the problem still occurs.



In that case, though, it's hard for me to see how a program might try to execute code at a location well past the actual end of the code... I feel like a function pointer might be a likely candidate, but in that case I would expect to see the problem show up, e.g., where a function pointer is used. However, I don't see any function pointers used near where the error is occurring.



Perhaps there is more information I can extract from the debug information I've given above? The problem is quite reproducible, so if there's something I have not tried, but which you think might give some insight, I would love to hear it.



Thanks for any help you can offer!










share|improve this question













I am at my wit's end trying to debug a hard fault on an EFR32BG12 processor. I've been following the instructions in the Silicon Labs knowledge base here:



https://www.silabs.com/community/mcu/32-bit/knowledge-base.entry.html/2014/05/26/debug_a_hardfault-78gc



I've also been using the Keil app note here to fill in some details:



http://www.keil.com/appnotes/files/apnt209.pdf



I've managed to get the hard fault to occur quite consistently in one place. When the hard fault occurs, the code from the knowledge base article gives me the following values (pushed onto the stack by the processor before calling the hard fault handler):



Name     Type        Value               Location
~~~~ ~~~~ ~~~~~ ~~~~~~~~
cfsr uint32_t 0x20000 (Hex) 0x2000078c
hfsr uint32_t 0x40000000 (Hex) 0x20000788
mmfar uint32_t 0xe000ed34 (Hex) 0x20000784
bfar uint32_t 0xe000ed38 (Hex) 0x20000780
r0 uint32_t 0x0 (Hex) 0x2000077c
r1 uint32_t 0x8 (Hex) 0x20000778
r2 uint32_t 0x0 (Hex) 0x20000774
r3 uint32_t 0x0 (Hex) 0x20000770
r12 uint32_t 0x1 (Hex) 0x2000076c
lr uint32_t 0xab61 (Hex) 0x20000768
pc uint32_t 0x38dc8 (Hex) 0x20000764
psr uint32_t 0x0 (Hex) 0x20000760


Looking at the Keil app note, I believe a CFSR value of 0x20000 indicates a Usage Fault with the INVSTATE bit set, i.e.:




INVSTATE: Invalid state: 0 = no invalid state 1 = the processor has
attempted to execute an instruction that makes illegal use of the
Execution Program Status Register (EPSR). When this bit is set, the PC
value stacked for the exception return points to the instruction that
attempted the illegal use of the EPSR. Potential reasons: a) Loading a
branch target address to PC with LSB=0. b) Stacked PSR corrupted
during exception or interrupt handling. c) Vector table contains a
vector address with LSB=0.




The PC value pushed onto the stack by the exception (provided by the code from the knowledge base article) seems to be 0x38dc8. If I go to this address in the Simplicity Studio "Disassembly" window, I see the following:



00038db8:   str     r5,[r5,#0x14]
00038dba: str r0,[r7,r1]
00038dbc: str r4,[r5,#0x14]
00038dbe: ldr r4,[pc,#0x1e4] ; 0x38fa0
00038dc0: strb r1,[r4,#0x11]
00038dc2: ldr r5,[r4,#0x64]
00038dc4: ldrb r3,[r4,#0x5]
00038dc6: movs r3,r6
00038dc8: strb r1,[r4,#0x15]
00038dca: ldr r4,[r4,#0x14]
00038dcc: cmp r7,#0x6f
00038dce: cmp r6,#0x30
00038dd0: str r7,[r6,#0x14]
00038dd2: lsls r6,r6,#1
00038dd4: movs r5,r0
00038dd6: movs r0,r0


The address appears to be well past the end of my code. If I look at the same address in the "Memory" window, this is what I see:



0x00038DC8  69647561 2E302F6F 00766177 00000005  audio/0.wav.....
0x00038DD8 00000000 000F4240 00000105 00000000 ....@B..........
0x00038DE8 00000000 00000000 00000005 00000000 ................
0x00038DF8 0001C200 00000500 00001000 00000000 .Â..............
0x00038E08 00000000 F00000F0 02F00001 0003F000 ....ð..ð..ð..ð..
0x00038E18 F00004F0 06010005 01020101 01011201 ð..ð............
0x00038E28 35010121 01010D01 6C363025 2E6E6775 !..5....%06lugn.
0x00038E38 00746164 00000001 000008D0 00038400 dat.....Ð.......


Curiously, "audio/0.wav" is a static string which is part of the firmware. If I understand correctly, what I've learned here is that PC somehow gets set to this point in memory, which of course is not a valid instruction and causes the hard fault.



To debug the issue, I need to know how PC came to be set to this incorrect value. I believe the LR register should give me an idea. The LR register pushed onto the stack by the exception seems to be 0xab61. If I look at this location, I see the following in the Disassembly window:



1270            dp->sect = clst2sect(fs, clst);
0000ab58: ldr r0,[r7,#0x10]
0000ab5a: ldr r1,[r7,#0x14]
0000ab5c: bl 0x00009904
0000ab60: mov r2,r0
0000ab62: ldr r3,[r7,#0x4]
0000ab64: str r2,[r3,#0x18]


It looks to me like the problem occurs during this call specifically:



0000ab5c:   bl      0x00009904


This makes me think that the problem occurs as a result of a corrupt stack, which causes clst2sect to return to an invalid part of memory rather than to 0xab60. The code for clst2sect is pretty innocuous:



/*-----------------------------------------------------------------------*/
/* Get physical sector number from cluster number */
/*-----------------------------------------------------------------------*/

DWORD clst2sect ( /* !=0:Sector number, 0:Failed (invalid cluster#) */
FATFS* fs, /* Filesystem object */
DWORD clst /* Cluster# to be converted */
)
{
clst -= 2; /* Cluster number is origin from 2 */
if (clst >= fs->n_fatent - 2) return 0; /* Is it invalid cluster number? */
return fs->database + fs->csize * clst; /* Start sector number of the cluster */
}


Does this analysis sound about right?



I suppose the problem I've run into is that I have no idea what might cause this kind of behaviour... I've tried putting breakpoints in all of my interrupt handlers, to see if one of them might be corrupting the stack, but there doesn't seem to be any pattern--sometimes, no interrupt handler is called but the problem still occurs.



In that case, though, it's hard for me to see how a program might try to execute code at a location well past the actual end of the code... I feel like a function pointer might be a likely candidate, but in that case I would expect to see the problem show up, e.g., where a function pointer is used. However, I don't see any function pointers used near where the error is occurring.



Perhaps there is more information I can extract from the debug information I've given above? The problem is quite reproducible, so if there's something I have not tried, but which you think might give some insight, I would love to hear it.



Thanks for any help you can offer!







debugging arm






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 11 '18 at 21:57









Michael Cooper

1,36811116




1,36811116












  • Could you give a bit more context? Is this bare metal, or is there an OS?
    – cooperised
    Nov 12 '18 at 9:19










  • The error is between 0x0000ab5c, and the next BL, somewhere where there is a PC modifying instruction. It seems your stack is OK here, are there any hints above/below this stack location?
    – Sean Houlihane
    Nov 12 '18 at 13:05










  • @cooperised This is bare metal with a simple cooperative scheduler.
    – Michael Cooper
    Nov 13 '18 at 16:13






  • 1




    this return is presumably a pop to the PC? It could be past that, until you find something that modifies the LR, and it could be that return which is taking a corrupted value.
    – Sean Houlihane
    Nov 13 '18 at 16:37






  • 1




    If you've tried doubling the stack size and it hasn't helped then this probably isn't the problem. The most reliable way to check for an overflowing stack is to configure the MPU to make a small region at the bottom of the allocated stack space inaccessible; an overflowing stack will then trigger a data abort exception and you can work back from there. An easier but less reliable method is to try and work out where your stack usage is greatest and put a breakpoint in so you can peek r13.
    – cooperised
    Nov 15 '18 at 12:26


















  • Could you give a bit more context? Is this bare metal, or is there an OS?
    – cooperised
    Nov 12 '18 at 9:19










  • The error is between 0x0000ab5c, and the next BL, somewhere where there is a PC modifying instruction. It seems your stack is OK here, are there any hints above/below this stack location?
    – Sean Houlihane
    Nov 12 '18 at 13:05










  • @cooperised This is bare metal with a simple cooperative scheduler.
    – Michael Cooper
    Nov 13 '18 at 16:13






  • 1




    this return is presumably a pop to the PC? It could be past that, until you find something that modifies the LR, and it could be that return which is taking a corrupted value.
    – Sean Houlihane
    Nov 13 '18 at 16:37






  • 1




    If you've tried doubling the stack size and it hasn't helped then this probably isn't the problem. The most reliable way to check for an overflowing stack is to configure the MPU to make a small region at the bottom of the allocated stack space inaccessible; an overflowing stack will then trigger a data abort exception and you can work back from there. An easier but less reliable method is to try and work out where your stack usage is greatest and put a breakpoint in so you can peek r13.
    – cooperised
    Nov 15 '18 at 12:26
















Could you give a bit more context? Is this bare metal, or is there an OS?
– cooperised
Nov 12 '18 at 9:19




Could you give a bit more context? Is this bare metal, or is there an OS?
– cooperised
Nov 12 '18 at 9:19












The error is between 0x0000ab5c, and the next BL, somewhere where there is a PC modifying instruction. It seems your stack is OK here, are there any hints above/below this stack location?
– Sean Houlihane
Nov 12 '18 at 13:05




The error is between 0x0000ab5c, and the next BL, somewhere where there is a PC modifying instruction. It seems your stack is OK here, are there any hints above/below this stack location?
– Sean Houlihane
Nov 12 '18 at 13:05












@cooperised This is bare metal with a simple cooperative scheduler.
– Michael Cooper
Nov 13 '18 at 16:13




@cooperised This is bare metal with a simple cooperative scheduler.
– Michael Cooper
Nov 13 '18 at 16:13




1




1




this return is presumably a pop to the PC? It could be past that, until you find something that modifies the LR, and it could be that return which is taking a corrupted value.
– Sean Houlihane
Nov 13 '18 at 16:37




this return is presumably a pop to the PC? It could be past that, until you find something that modifies the LR, and it could be that return which is taking a corrupted value.
– Sean Houlihane
Nov 13 '18 at 16:37




1




1




If you've tried doubling the stack size and it hasn't helped then this probably isn't the problem. The most reliable way to check for an overflowing stack is to configure the MPU to make a small region at the bottom of the allocated stack space inaccessible; an overflowing stack will then trigger a data abort exception and you can work back from there. An easier but less reliable method is to try and work out where your stack usage is greatest and put a breakpoint in so you can peek r13.
– cooperised
Nov 15 '18 at 12:26




If you've tried doubling the stack size and it hasn't helped then this probably isn't the problem. The most reliable way to check for an overflowing stack is to configure the MPU to make a small region at the bottom of the allocated stack space inaccessible; an overflowing stack will then trigger a data abort exception and you can work back from there. An easier but less reliable method is to try and work out where your stack usage is greatest and put a breakpoint in so you can peek r13.
– cooperised
Nov 15 '18 at 12:26












1 Answer
1






active

oldest

votes


















0














After about a month of chasing this one, I managed to identify the cause of the problem. I hope I can give enough information here that this will be useful to someone else.



In the end, the problem was caused by passing a pointer to a non-static local variable to a state machine which changed the value at that memory location later on. Because the local variable was no longer in scope, that memory location was a random point in the stack, and changing the value there corrupted the stack.



The problem was difficult to track down for two reasons:




  1. Depending on how the code compiled, the changed memory location could be something non-critical like another local variable, which would cause a much more subtle error. Only when I got lucky would the change affect the PC register and cause a hard fault.


  2. Even when I found a version of the code that consistently generated a hard fault, the actual hard fault typically occurred somewhere up the call stack, when a function returned and popped the stack value into PC. This made it difficult to identify the cause of the problem--all I knew was that something was corrupting the stack before that function return.



A few tools were really helpful in identifying the cause of the problem:




  1. Early on, I had identified a block of code where the hard fault usually occurred using GPIO pins. I would toggle a pin high before entering the block and low when exiting the block. Then I performed many tests, checking if the pin was high or low when the hard fault occurred, and used a sort of binary search to determine the smallest block of code which consistently contained all the hard faults.


  2. The hard fault pushes a number of important registers onto the stack. These helped me confirm where the PC register was becoming corrupt, and also helped me understand that it was becoming corrupt as a result of a stack corruption.


  3. Starting somewhere before that block of code and stepping forward while keeping an eye on local variables, I was able to identify a function call that was corrupting the stack. I could confirm this using Simplicity Studio's memory view.


  4. Finally, stepping through the offending function in detail, I realized that the problem was occurring when I dereferenced a stored pointer and wrote to that memory location. Looking back at where that pointer value was set, I realized it had been set to point to a non-static local variable that was now out of scope.



Thanks to @SeanHoulihane and @cooperised, who helped me eliminate a few possible causes and gave me a little more confidence with the debugging tools.






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53253652%2fdebugging-a-hard-fault-in-arm-cortex-m4%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    After about a month of chasing this one, I managed to identify the cause of the problem. I hope I can give enough information here that this will be useful to someone else.



    In the end, the problem was caused by passing a pointer to a non-static local variable to a state machine which changed the value at that memory location later on. Because the local variable was no longer in scope, that memory location was a random point in the stack, and changing the value there corrupted the stack.



    The problem was difficult to track down for two reasons:




    1. Depending on how the code compiled, the changed memory location could be something non-critical like another local variable, which would cause a much more subtle error. Only when I got lucky would the change affect the PC register and cause a hard fault.


    2. Even when I found a version of the code that consistently generated a hard fault, the actual hard fault typically occurred somewhere up the call stack, when a function returned and popped the stack value into PC. This made it difficult to identify the cause of the problem--all I knew was that something was corrupting the stack before that function return.



    A few tools were really helpful in identifying the cause of the problem:




    1. Early on, I had identified a block of code where the hard fault usually occurred using GPIO pins. I would toggle a pin high before entering the block and low when exiting the block. Then I performed many tests, checking if the pin was high or low when the hard fault occurred, and used a sort of binary search to determine the smallest block of code which consistently contained all the hard faults.


    2. The hard fault pushes a number of important registers onto the stack. These helped me confirm where the PC register was becoming corrupt, and also helped me understand that it was becoming corrupt as a result of a stack corruption.


    3. Starting somewhere before that block of code and stepping forward while keeping an eye on local variables, I was able to identify a function call that was corrupting the stack. I could confirm this using Simplicity Studio's memory view.


    4. Finally, stepping through the offending function in detail, I realized that the problem was occurring when I dereferenced a stored pointer and wrote to that memory location. Looking back at where that pointer value was set, I realized it had been set to point to a non-static local variable that was now out of scope.



    Thanks to @SeanHoulihane and @cooperised, who helped me eliminate a few possible causes and gave me a little more confidence with the debugging tools.






    share|improve this answer


























      0














      After about a month of chasing this one, I managed to identify the cause of the problem. I hope I can give enough information here that this will be useful to someone else.



      In the end, the problem was caused by passing a pointer to a non-static local variable to a state machine which changed the value at that memory location later on. Because the local variable was no longer in scope, that memory location was a random point in the stack, and changing the value there corrupted the stack.



      The problem was difficult to track down for two reasons:




      1. Depending on how the code compiled, the changed memory location could be something non-critical like another local variable, which would cause a much more subtle error. Only when I got lucky would the change affect the PC register and cause a hard fault.


      2. Even when I found a version of the code that consistently generated a hard fault, the actual hard fault typically occurred somewhere up the call stack, when a function returned and popped the stack value into PC. This made it difficult to identify the cause of the problem--all I knew was that something was corrupting the stack before that function return.



      A few tools were really helpful in identifying the cause of the problem:




      1. Early on, I had identified a block of code where the hard fault usually occurred using GPIO pins. I would toggle a pin high before entering the block and low when exiting the block. Then I performed many tests, checking if the pin was high or low when the hard fault occurred, and used a sort of binary search to determine the smallest block of code which consistently contained all the hard faults.


      2. The hard fault pushes a number of important registers onto the stack. These helped me confirm where the PC register was becoming corrupt, and also helped me understand that it was becoming corrupt as a result of a stack corruption.


      3. Starting somewhere before that block of code and stepping forward while keeping an eye on local variables, I was able to identify a function call that was corrupting the stack. I could confirm this using Simplicity Studio's memory view.


      4. Finally, stepping through the offending function in detail, I realized that the problem was occurring when I dereferenced a stored pointer and wrote to that memory location. Looking back at where that pointer value was set, I realized it had been set to point to a non-static local variable that was now out of scope.



      Thanks to @SeanHoulihane and @cooperised, who helped me eliminate a few possible causes and gave me a little more confidence with the debugging tools.






      share|improve this answer
























        0












        0








        0






        After about a month of chasing this one, I managed to identify the cause of the problem. I hope I can give enough information here that this will be useful to someone else.



        In the end, the problem was caused by passing a pointer to a non-static local variable to a state machine which changed the value at that memory location later on. Because the local variable was no longer in scope, that memory location was a random point in the stack, and changing the value there corrupted the stack.



        The problem was difficult to track down for two reasons:




        1. Depending on how the code compiled, the changed memory location could be something non-critical like another local variable, which would cause a much more subtle error. Only when I got lucky would the change affect the PC register and cause a hard fault.


        2. Even when I found a version of the code that consistently generated a hard fault, the actual hard fault typically occurred somewhere up the call stack, when a function returned and popped the stack value into PC. This made it difficult to identify the cause of the problem--all I knew was that something was corrupting the stack before that function return.



        A few tools were really helpful in identifying the cause of the problem:




        1. Early on, I had identified a block of code where the hard fault usually occurred using GPIO pins. I would toggle a pin high before entering the block and low when exiting the block. Then I performed many tests, checking if the pin was high or low when the hard fault occurred, and used a sort of binary search to determine the smallest block of code which consistently contained all the hard faults.


        2. The hard fault pushes a number of important registers onto the stack. These helped me confirm where the PC register was becoming corrupt, and also helped me understand that it was becoming corrupt as a result of a stack corruption.


        3. Starting somewhere before that block of code and stepping forward while keeping an eye on local variables, I was able to identify a function call that was corrupting the stack. I could confirm this using Simplicity Studio's memory view.


        4. Finally, stepping through the offending function in detail, I realized that the problem was occurring when I dereferenced a stored pointer and wrote to that memory location. Looking back at where that pointer value was set, I realized it had been set to point to a non-static local variable that was now out of scope.



        Thanks to @SeanHoulihane and @cooperised, who helped me eliminate a few possible causes and gave me a little more confidence with the debugging tools.






        share|improve this answer












        After about a month of chasing this one, I managed to identify the cause of the problem. I hope I can give enough information here that this will be useful to someone else.



        In the end, the problem was caused by passing a pointer to a non-static local variable to a state machine which changed the value at that memory location later on. Because the local variable was no longer in scope, that memory location was a random point in the stack, and changing the value there corrupted the stack.



        The problem was difficult to track down for two reasons:




        1. Depending on how the code compiled, the changed memory location could be something non-critical like another local variable, which would cause a much more subtle error. Only when I got lucky would the change affect the PC register and cause a hard fault.


        2. Even when I found a version of the code that consistently generated a hard fault, the actual hard fault typically occurred somewhere up the call stack, when a function returned and popped the stack value into PC. This made it difficult to identify the cause of the problem--all I knew was that something was corrupting the stack before that function return.



        A few tools were really helpful in identifying the cause of the problem:




        1. Early on, I had identified a block of code where the hard fault usually occurred using GPIO pins. I would toggle a pin high before entering the block and low when exiting the block. Then I performed many tests, checking if the pin was high or low when the hard fault occurred, and used a sort of binary search to determine the smallest block of code which consistently contained all the hard faults.


        2. The hard fault pushes a number of important registers onto the stack. These helped me confirm where the PC register was becoming corrupt, and also helped me understand that it was becoming corrupt as a result of a stack corruption.


        3. Starting somewhere before that block of code and stepping forward while keeping an eye on local variables, I was able to identify a function call that was corrupting the stack. I could confirm this using Simplicity Studio's memory view.


        4. Finally, stepping through the offending function in detail, I realized that the problem was occurring when I dereferenced a stored pointer and wrote to that memory location. Looking back at where that pointer value was set, I realized it had been set to point to a non-static local variable that was now out of scope.



        Thanks to @SeanHoulihane and @cooperised, who helped me eliminate a few possible causes and gave me a little more confidence with the debugging tools.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 16 '18 at 20:41









        Michael Cooper

        1,36811116




        1,36811116






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53253652%2fdebugging-a-hard-fault-in-arm-cortex-m4%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            How to change which sound is reproduced for terminal bell?

            Title Spacing in Bjornstrup Chapter, Removing Chapter Number From Contents

            Can I use Tabulator js library in my java Spring + Thymeleaf project?