Compilers are amazing #1: GCC's built-in strcpy() implementation

This is part 1 of possibly a 1-part series. We’ll see.

Has this ever happened to you? You have some code like this:

#include <string.h>
#include <stdio.h>
int main()
{
  char buf[32] = {0};
  strcpy(buf, "Hello world, this is a long string");
  puts(buf);
}

You compile it, run it under a debugger like GDB (or ltrace?) and set a breakpoint for the call to strcpy():

$ gcc test.c -o test
$ gdb ./test 
GNU gdb (Debian N.NN-N+NN) N.NN
Reading symbols from ./test...(no debugging symbols found)...done.
(gdb) break strcpy
Function "strcpy" not defined.
Make breakpoint pending on future shared library load? (y or [n])

Why isn’t strcpy() defined? We’re definitely calling it in that tiny program, right? So what gives? We double-check with nm:

$ nm -D test
   w __gmon_start__
   U __libc_start_main
   U puts

puts() is there but strcpy() is not! It turns out that GCC has built-in implementations of many string functions. Emphasis mine:

GCC provides a large number of built-in functions other than the ones mentioned above. Some of these are for internal use in the processing of exceptions or variable-length argument lists and are not documented here because they may change from time to time; we do not recommend general use of these functions.

The remaining functions are provided for optimization purposes.

The generated code on 32-bit x86 (below) is pretty neat; the call to strcpy() simply becomes a sequence of immediate-to-memory movs!

0x0804843a <+46>:    lea    eax,[esp+0x10]
0x0804843e <+50>:    mov    DWORD PTR [eax],0x6c6c6548
0x08048444 <+56>:    mov    DWORD PTR [eax+0x4],0x6f77206f
0x0804844b <+63>:    mov    DWORD PTR [eax+0x8],0x2c646c72
0x08048452 <+70>:    mov    DWORD PTR [eax+0xc],0x69687420
0x08048459 <+77>:    mov    DWORD PTR [eax+0x10],0x73692073
0x08048460 <+84>:    mov    DWORD PTR [eax+0x14],0x6c206120
0x08048467 <+91>:    mov    DWORD PTR [eax+0x18],0x20676e6f
0x0804846e <+98>:    mov    DWORD PTR [eax+0x1c],0x69727473
0x08048475 <+105>:   mov    WORD PTR [eax+0x20],0x676e
0x0804847b <+111>:   mov    BYTE PTR [eax+0x22],0x0

I suppose this will have a couple performance benefits:

One less symbol for the dynamic linker (ld.so) to resolve at process startup
No calls, so no stack manipulations (GCC on 32-bit x86 uses “cdecl” calling convention, where all arguments are passed on the stack)
The string is in the mov instructions so no TLB or cache misses for the source string!

It’s a little more convoluted on amd64. As far as I can tell from page 218 of the AMD64 Architecture Programmer’s Manual, the only 64-bit immediate mov is to a register, not to memory!

Snippet of AMD64 Architecture Programmer's Manual rev 3.22

The generated code on amd64 reflects this, using two moves (immediate to register, then register to memory) per each 8 bytes of the string:

0x000000000040050e <+40>:    lea    rax,[rbp-0x20]
0x0000000000400512 <+44>:    movabs rdx,0x6f77206f6c6c6548
0x000000000040051c <+54>:    mov    QWORD PTR [rax],rdx
0x000000000040051f <+57>:    movabs rcx,0x696874202c646c72
0x0000000000400529 <+67>:    mov    QWORD PTR [rax+0x8],rcx
0x000000000040052d <+71>:    movabs rsi,0x6c20612073692073
0x0000000000400537 <+81>:    mov    QWORD PTR [rax+0x10],rsi
0x000000000040053b <+85>:    movabs rdi,0x6972747320676e6f
0x0000000000400545 <+95>:    mov    QWORD PTR [rax+0x18],rdi
0x0000000000400549 <+99>:    mov    WORD PTR [rax+0x20],0x676e
0x000000000040054f <+105>:   mov    BYTE PTR [rax+0x22],0x0

There’s a code size overhead here: 3 bytes of instructions per every 4 bytes of string. As this is a significant overhead, it’s disabled by gcc -Os. If you absolutely need to set a breakpoint on strcpy(), strcat(), or the other GCC built-ins, compile with -fno-builtin to turn this behaviour off.

Alex's blog

Compilers are amazing #1: GCC's built-in strcpy() implementation