Commit 3697cd9a authored by Amerigo Wang's avatar Amerigo Wang Committed by Linus Torvalds

Doc: update Documentation/exception.txt

Update Documentation/exception.txt.
Remove trailing whitespaces in it.
Signed-off-by: default avatarWANG Cong <amwang@redhat.com>
Signed-off-by: default avatarRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
parent 097041e5
Kernel level exception handling in Linux 2.1.8 Kernel level exception handling in Linux
Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com> Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
When a process runs in kernel mode, it often has to access user When a process runs in kernel mode, it often has to access user
mode memory whose address has been passed by an untrusted program. mode memory whose address has been passed by an untrusted program.
To protect itself the kernel has to verify this address. To protect itself the kernel has to verify this address.
In older versions of Linux this was done with the In older versions of Linux this was done with the
int verify_area(int type, const void * addr, unsigned long size) int verify_area(int type, const void * addr, unsigned long size)
function (which has since been replaced by access_ok()). function (which has since been replaced by access_ok()).
This function verified that the memory area starting at address This function verified that the memory area starting at address
'addr' and of size 'size' was accessible for the operation specified 'addr' and of size 'size' was accessible for the operation specified
in type (read or write). To do this, verify_read had to look up the in type (read or write). To do this, verify_read had to look up the
virtual memory area (vma) that contained the address addr. In the virtual memory area (vma) that contained the address addr. In the
normal case (correctly working program), this test was successful. normal case (correctly working program), this test was successful.
It only failed for a few buggy programs. In some kernel profiling It only failed for a few buggy programs. In some kernel profiling
tests, this normally unneeded verification used up a considerable tests, this normally unneeded verification used up a considerable
amount of time. amount of time.
To overcome this situation, Linus decided to let the virtual memory To overcome this situation, Linus decided to let the virtual memory
hardware present in every Linux-capable CPU handle this test. hardware present in every Linux-capable CPU handle this test.
How does this work? How does this work?
Whenever the kernel tries to access an address that is currently not Whenever the kernel tries to access an address that is currently not
accessible, the CPU generates a page fault exception and calls the accessible, the CPU generates a page fault exception and calls the
page fault handler page fault handler
void do_page_fault(struct pt_regs *regs, unsigned long error_code) void do_page_fault(struct pt_regs *regs, unsigned long error_code)
in arch/i386/mm/fault.c. The parameters on the stack are set up by in arch/x86/mm/fault.c. The parameters on the stack are set up by
the low level assembly glue in arch/i386/kernel/entry.S. The parameter the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
regs is a pointer to the saved registers on the stack, error_code regs is a pointer to the saved registers on the stack, error_code
contains a reason code for the exception. contains a reason code for the exception.
do_page_fault first obtains the unaccessible address from the CPU do_page_fault first obtains the unaccessible address from the CPU
control register CR2. If the address is within the virtual address control register CR2. If the address is within the virtual address
space of the process, the fault probably occurred, because the page space of the process, the fault probably occurred, because the page
was not swapped in, write protected or something similar. However, was not swapped in, write protected or something similar. However,
we are interested in the other case: the address is not valid, there we are interested in the other case: the address is not valid, there
is no vma that contains this address. In this case, the kernel jumps is no vma that contains this address. In this case, the kernel jumps
to the bad_area label. to the bad_area label.
There it uses the address of the instruction that caused the exception There it uses the address of the instruction that caused the exception
(i.e. regs->eip) to find an address where the execution can continue (i.e. regs->eip) to find an address where the execution can continue
(fixup). If this search is successful, the fault handler modifies the (fixup). If this search is successful, the fault handler modifies the
return address (again regs->eip) and returns. The execution will return address (again regs->eip) and returns. The execution will
continue at the address in fixup. continue at the address in fixup.
Where does fixup point to? Where does fixup point to?
Since we jump to the contents of fixup, fixup obviously points Since we jump to the contents of fixup, fixup obviously points
to executable code. This code is hidden inside the user access macros. to executable code. This code is hidden inside the user access macros.
I have picked the get_user macro defined in include/asm/uaccess.h as an I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
example. The definition is somewhat hard to follow, so let's peek at as an example. The definition is somewhat hard to follow, so let's peek at
the code generated by the preprocessor and the compiler. I selected the code generated by the preprocessor and the compiler. I selected
the get_user call in drivers/char/console.c for a detailed examination. the get_user call in drivers/char/sysrq.c for a detailed examination.
The original code in console.c line 1405: The original code in sysrq.c line 587:
get_user(c, buf); get_user(c, buf);
The preprocessor output (edited to become somewhat readable): The preprocessor output (edited to become somewhat readable):
( (
{ {
long __gu_err = - 14 , __gu_val = 0; long __gu_err = - 14 , __gu_val = 0;
const __typeof__(*( ( buf ) )) *__gu_addr = ((buf)); const __typeof__(*( ( buf ) )) *__gu_addr = ((buf));
if (((((0 + current_set[0])->tss.segment) == 0x18 ) || if (((((0 + current_set[0])->tss.segment) == 0x18 ) ||
(((sizeof(*(buf))) <= 0xC0000000UL) && (((sizeof(*(buf))) <= 0xC0000000UL) &&
((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf))))))) ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
do { do {
__gu_err = 0; __gu_err = 0;
switch ((sizeof(*(buf)))) { switch ((sizeof(*(buf)))) {
case 1: case 1:
__asm__ __volatile__( __asm__ __volatile__(
"1: mov" "b" " %2,%" "b" "1\n" "1: mov" "b" " %2,%" "b" "1\n"
"2:\n" "2:\n"
".section .fixup,\"ax\"\n" ".section .fixup,\"ax\"\n"
"3: movl %3,%0\n" "3: movl %3,%0\n"
" xor" "b" " %" "b" "1,%" "b" "1\n" " xor" "b" " %" "b" "1,%" "b" "1\n"
" jmp 2b\n" " jmp 2b\n"
".section __ex_table,\"a\"\n" ".section __ex_table,\"a\"\n"
" .align 4\n" " .align 4\n"
" .long 1b,3b\n" " .long 1b,3b\n"
".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *) ".text" : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ; ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )) ;
break; break;
case 2: case 2:
__asm__ __volatile__( __asm__ __volatile__(
"1: mov" "w" " %2,%" "w" "1\n" "1: mov" "w" " %2,%" "w" "1\n"
"2:\n" "2:\n"
".section .fixup,\"ax\"\n" ".section .fixup,\"ax\"\n"
"3: movl %3,%0\n" "3: movl %3,%0\n"
" xor" "w" " %" "w" "1,%" "w" "1\n" " xor" "w" " %" "w" "1,%" "w" "1\n"
" jmp 2b\n" " jmp 2b\n"
".section __ex_table,\"a\"\n" ".section __ex_table,\"a\"\n"
" .align 4\n" " .align 4\n"
" .long 1b,3b\n" " .long 1b,3b\n"
".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err )); ( __gu_addr )) ), "i"(- 14 ), "0"( __gu_err ));
break; break;
case 4: case 4:
__asm__ __volatile__( __asm__ __volatile__(
"1: mov" "l" " %2,%" "" "1\n" "1: mov" "l" " %2,%" "" "1\n"
"2:\n" "2:\n"
".section .fixup,\"ax\"\n" ".section .fixup,\"ax\"\n"
"3: movl %3,%0\n" "3: movl %3,%0\n"
" xor" "l" " %" "" "1,%" "" "1\n" " xor" "l" " %" "" "1,%" "" "1\n"
" jmp 2b\n" " jmp 2b\n"
".section __ex_table,\"a\"\n" ".section __ex_table,\"a\"\n"
" .align 4\n" " .long 1b,3b\n" " .align 4\n" " .long 1b,3b\n"
".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *) ".text" : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err)); ( __gu_addr )) ), "i"(- 14 ), "0"(__gu_err));
break; break;
default: default:
(__gu_val) = __get_user_bad(); (__gu_val) = __get_user_bad();
} }
} while (0) ; } while (0) ;
((c)) = (__typeof__(*((buf))))__gu_val; ((c)) = (__typeof__(*((buf))))__gu_val;
__gu_err; __gu_err;
} }
); );
...@@ -127,12 +127,12 @@ see what code gcc generates: ...@@ -127,12 +127,12 @@ see what code gcc generates:
> xorl %edx,%edx > xorl %edx,%edx
> movl current_set,%eax > movl current_set,%eax
> cmpl $24,788(%eax) > cmpl $24,788(%eax)
> je .L1424 > je .L1424
> cmpl $-1073741825,64(%esp) > cmpl $-1073741825,64(%esp)
> ja .L1423 > ja .L1423
> .L1424: > .L1424:
> movl %edx,%eax > movl %edx,%eax
> movl 64(%esp),%ebx > movl 64(%esp),%ebx
> #APP > #APP
> 1: movb (%ebx),%dl /* this is the actual user access */ > 1: movb (%ebx),%dl /* this is the actual user access */
...@@ -149,17 +149,17 @@ see what code gcc generates: ...@@ -149,17 +149,17 @@ see what code gcc generates:
> .L1423: > .L1423:
> movzbl %dl,%esi > movzbl %dl,%esi
The optimizer does a good job and gives us something we can actually The optimizer does a good job and gives us something we can actually
understand. Can we? The actual user access is quite obvious. Thanks understand. Can we? The actual user access is quite obvious. Thanks
to the unified address space we can just access the address in user to the unified address space we can just access the address in user
memory. But what does the .section stuff do????? memory. But what does the .section stuff do?????
To understand this we have to look at the final kernel: To understand this we have to look at the final kernel:
> objdump --section-headers vmlinux > objdump --section-headers vmlinux
> >
> vmlinux: file format elf32-i386 > vmlinux: file format elf32-i386
> >
> Sections: > Sections:
> Idx Name Size VMA LMA File off Algn > Idx Name Size VMA LMA File off Algn
> 0 .text 00098f40 c0100000 c0100000 00001000 2**4 > 0 .text 00098f40 c0100000 c0100000 00001000 2**4
...@@ -198,18 +198,18 @@ final kernel executable: ...@@ -198,18 +198,18 @@ final kernel executable:
The whole user memory access is reduced to 10 x86 machine instructions. The whole user memory access is reduced to 10 x86 machine instructions.
The instructions bracketed in the .section directives are no longer The instructions bracketed in the .section directives are no longer
in the normal execution path. They are located in a different section in the normal execution path. They are located in a different section
of the executable file: of the executable file:
> objdump --disassemble --section=.fixup vmlinux > objdump --disassemble --section=.fixup vmlinux
> >
> c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax > c0199ff5 <.fixup+10b5> movl $0xfffffff2,%eax
> c0199ffa <.fixup+10ba> xorb %dl,%dl > c0199ffa <.fixup+10ba> xorb %dl,%dl
> c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3> > c0199ffc <.fixup+10bc> jmp c017e7a7 <do_con_write+e3>
And finally: And finally:
> objdump --full-contents --section=__ex_table vmlinux > objdump --full-contents --section=__ex_table vmlinux
> >
> c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................ > c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0 ................
> c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................ > c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0 ................
> c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................ > c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0 ................
...@@ -235,8 +235,8 @@ sections in the ELF object file. So the instructions ...@@ -235,8 +235,8 @@ sections in the ELF object file. So the instructions
ended up in the .fixup section of the object file and the addresses ended up in the .fixup section of the object file and the addresses
.long 1b,3b .long 1b,3b
ended up in the __ex_table section of the object file. 1b and 3b ended up in the __ex_table section of the object file. 1b and 3b
are local labels. The local label 1b (1b stands for next label 1 are local labels. The local label 1b (1b stands for next label 1
backward) is the address of the instruction that might fault, i.e. backward) is the address of the instruction that might fault, i.e.
in our case the address of the label 1 is c017e7a5: in our case the address of the label 1 is c017e7a5:
the original assembly code: > 1: movb (%ebx),%dl the original assembly code: > 1: movb (%ebx),%dl
and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl and linked in vmlinux : > c017e7a5 <do_con_write+e1> movb (%ebx),%dl
...@@ -254,7 +254,7 @@ The assembly code ...@@ -254,7 +254,7 @@ The assembly code
becomes the value pair becomes the value pair
> c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................ > c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5 ................
^this is ^this is ^this is ^this is
1b 3b 1b 3b
c017e7a5,c0199ff5 in the exception table of the kernel. c017e7a5,c0199ff5 in the exception table of the kernel.
So, what actually happens if a fault from kernel mode with no suitable So, what actually happens if a fault from kernel mode with no suitable
...@@ -266,9 +266,9 @@ vma occurs? ...@@ -266,9 +266,9 @@ vma occurs?
3.) CPU calls do_page_fault 3.) CPU calls do_page_fault
4.) do page fault calls search_exception_table (regs->eip == c017e7a5); 4.) do page fault calls search_exception_table (regs->eip == c017e7a5);
5.) search_exception_table looks up the address c017e7a5 in the 5.) search_exception_table looks up the address c017e7a5 in the
exception table (i.e. the contents of the ELF section __ex_table) exception table (i.e. the contents of the ELF section __ex_table)
and returns the address of the associated fault handle code c0199ff5. and returns the address of the associated fault handle code c0199ff5.
6.) do_page_fault modifies its own return address to point to the fault 6.) do_page_fault modifies its own return address to point to the fault
handle code and returns. handle code and returns.
7.) execution continues in the fault handling code. 7.) execution continues in the fault handling code.
8.) 8a) EAX becomes -EFAULT (== -14) 8.) 8a) EAX becomes -EFAULT (== -14)
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment