riley's blog

Thoughts, baseball, engineering

-march -mtune, What's the Difference?

Recently David Iserovich and I at AppNexus ran into an issue with build scripts while porting our builds to use OBS. Prior to building via OBS, we had dedicated, older, build machines which would build our releases.

At the same time I was upgrading coverity to the latest release for better static analysis of our stack. During this coverity upgrade I was wresting with getting our various apps to compile from a single trigger script. Turns out that some of our apps did:

1
gcc ... -march=native -mtune=native

Using -march=native will cause the compiler to generate machine code that matches the processor where it is currently running when optimizing code. This will generate the best possible code for that chipset but will likely break the compiled object on older chipsets (assuming backwards compatibility). -mtune=native will “tune” the optimized code to run best for the current chipset but will still allow backwards compatibility with older chipsets. It is important to note here that -march trumps -mtune. If you specify them both (like we were), you will get optimized code that can only run on that chip or newer chips.

In practice this wasn’t an issue for us because our sad build machine was old. Until OBS that is. The OBS build server had a shiny new Xeon E5 Sandy Bridge chip set while the old, sad, outcast build machine had:

1
model name   : Intel(R) Xeon(R) CPU           L5630  @ 2.13GHz

So -march used on the L5630 machine would produce optimized code for that chip. The newer E5 Sandy Bridge chips in production would run fine because they contained all the instructions already included in the L5630 chips on the build machine. However, when we started building on OBS and an E5 chip, we decided to test in a sandbox environment (which had the older L5630s) and you can imagine what happened:

1
2
3
4
5
6
7
8
(gdb)
0x000000000045ad1d    108    in ../common/an_md.c
(gdb)

Program received signal SIGILL, Illegal instruction.
0x000000000045ad1d in an_md_rdtsc_calibrate_scale (rdtsc=0x45b2a0 <an_md_x86_64_i_rdtscp>) at ../common/an_md.c:108
108    in ../common/an_md.c
0x000000000045ad1d <an_md_rdtsc_calibrate_scale+205>:    vcvtsi2sd %rax,%xmm0,%xmm0

This instruction vcvtsi2sd is part of the Advanced Vector Extensions, or AVX, introduced in Sandy Bridge on Intel and the Bulldozer processor from AMD which started shipping in 2011. vcvtsi2sd converts a 64-bit integer to a double. The code that generated this instruction is essentially:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
     uint64_t start, end;
     double diff;

     for (i = 1, j = k = 0; i <= MAX_GETTIME_SAMPLES; ++i) {
             struct timespec ts;

             start = rdtsc();
             clock_gettime(CLOCK_MONOTONIC, &ts);
             end = rdtsc();
             if (end < start)
                     continue;
             ++j;
             diff = end - start;


   /* more stuff here which averages time, etc... */
}

Where rdtsc() is a wrapper function around some assembly to find the proper rdtsc instruction on the current chipset. We are timing how long it takes to call clock_gettime() in clock cycles. More on rdtsc.

So I got to thinking about what is the best way to deploy our code. Should we do specific compiles for each chipset we run in production and make disparate RPMs for those installs? Should we compile for each chipset and package these differing binaries into one large RPM? Is there even benefit to using -march in our environment with our applications or would -mtune (or no tuning) suffice? Welp, I guess I have to test it.

My development box has:

1
2
3
model name   : Intel(R) Xeon(R) CPU           L5640  @ 2.27GHz
...
flags     : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dtherm tpr_shadow vnmi flexpriority ept vpid

The test program.

test_march.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

#define CONVERSIONS 1000000

static uint64_t
rdtscp(void)
{
  uint32_t eax = 0, edx;
  asm volatile ("rdtscp"
      : "+a" (eax), "=d" (edx)
      :
      : "%ecx", "memory");

  return (((uint64_t)edx << 32) | eax);
}

int
main(int argc, char **argv)
{
  struct timespec ts;
  uint64_t start, end, total, rval;
  double number;
  double total_number = 0.0;
  int i;

  srand(23492340);

  start = rdtscp();
  for (i = 0; i < CONVERSIONS; i++) {

      /* get some random number */
      rval = rand();

      /*  force cast to double (our problem instruction) */
      number = rval;

      /* prevent compiler from optimizing this out by using value */
      total_number += number;
  }
  end = rdtscp();

  total = end - start;
  printf("%d conversions took %llu ticks\n", CONVERSIONS, total);
  printf("Total number: %f\n", total_number);

}

Compiling and running:

1
2
3
4
[rberton@117.bm-rberton.dev.nym2 ~ []]$ gcc44 -o test_march -O0 -ggdb test_march.c
[rberton@117.bm-rberton.dev.nym2 ~ []]$ ./test_march
1000000 conversions took 32445304 ticks
Total number: 1073827214394388.000000

Note we compiled this without optimizations and without -march or -mtune to get the baseline assembly for uint64_t to double conversion:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  /*  force cast to double (our problem instruction) */
  number = rval;
  40059c:    48 8b 45 e0           mov    0xffffffffffffffe0(%rbp),%rax
  4005a0:    48 85 c0              test   %rax,%rax
  4005a3:    78 07                 js     4005ac <main+0x4f>
  4005a5:    f2 48 0f 2a c0          cvtsi2sdq %rax,%xmm0
  4005aa:    eb 15                   jmp    4005c1 <main+0x64>
  4005ac:    48 89 c2              mov    %rax,%rdx
  4005af:    48 d1 ea               shr    %rdx
  4005b2:    83 e0 01               and    $0x1,%eax
  4005b5:    48 09 c2              or     %rax,%rdx
  4005b8:    f2 48 0f 2a c2          cvtsi2sdq %rdx,%xmm0
  4005bd:    f2 0f 58 c0             addsd  %xmm0,%xmm0
  4005c1:    f2 0f 11 45 e8          movsd  %xmm0,0xffffffffffffffe8(%rbp)

Lines 4005a0 and 4005a3 simply test for negative number and then jump to negative number handling, assuming, non-negative, we call cvtsi2sdq to move %rax into an SSE register and then jump to 4005c1 where we copy this SSE register (%xmm0) to the stack.

So unoptimized code still chooses for SSE extensions. Let’s switch those off to get true baseline:

1
[rberton@117.bm-rberton.dev.nym2 ~ []]$ gcc44 -o test_march -O0 -g -mno-sse test_march.c

Now the assembly chooses x87 instructions.

1
2
3
4
5
6
7
8
  /*  force cast to double (our problem instruction) */
  number = rval;
  40059c:    df 6d e0                fildll 0xffffffffffffffe0(%rbp)
  40059f:    48 83 7d e0 00           cmpq   $0x0,0xffffffffffffffe0(%rbp)
  4005a4:    79 08                 jns    4005ae <main+0x51>
  4005a6:    db 2d b4 01 00 00       fldt   436(%rip)        # 400760 <__dso_handle+0x48>
  4005ac:    de c1                   faddp  %st,%st(1)
  4005ae:    dd 5d e8                fstpl  0xffffffffffffffe8(%rbp)

OK. This is as simple as we are going to get. Let’s benchmark it.

1
2
[rberton@117.bm-rberton.dev.nym2 ~ []]$ ./test_march
1000000 conversions took 30796404 ticks

Almost 31MM cycles to convert 1MM uint64_t to double. Great.

Now lets leave SSE off and see what happens when we -mtune=native.

1
2
3
4
5
6
7
8
  /*  force cast to double (our problem instruction) */
  number = rval;
  40059c:    df 6d e0                fildll 0xffffffffffffffe0(%rbp)
  40059f:    48 83 7d e0 00           cmpq   $0x0,0xffffffffffffffe0(%rbp)
  4005a4:    79 08                 jns    4005ae <main+0x51>
  4005a6:    db 2d b4 01 00 00       fldt   436(%rip)        # 400760 <__dso_handle+0x48>
  4005ac:    de c1                   faddp  %st,%st(1)
  4005ae:    dd 5d e8                fstpl  0xffffffffffffffe8(%rbp)

Same exact code. It appears that -mno-sse trumps -mtune. I also tried with -march=native and it trumps that as well. If you switch off SSE, it’s really off regardless of further options, it appears. OK, lets put SSE back on and see what -mtune does to the code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  /*  force cast to double (our problem instruction) */
  number = rval;
  40059c:    48 8b 45 e0           mov    0xffffffffffffffe0(%rbp),%rax
  4005a0:    48 85 c0              test   %rax,%rax
  4005a3:    78 07                 js     4005ac <main+0x4f>
  4005a5:    f2 48 0f 2a c0          cvtsi2sdq %rax,%xmm0
  4005aa:    eb 15                   jmp    4005c1 <main+0x64>
  4005ac:    48 89 c2              mov    %rax,%rdx
  4005af:    48 d1 ea               shr    %rdx
  4005b2:    83 e0 01               and    $0x1,%eax
  4005b5:    48 09 c2              or     %rax,%rdx
  4005b8:    f2 48 0f 2a c2          cvtsi2sdq %rdx,%xmm0
  4005bd:    f2 0f 58 c0             addsd  %xmm0,%xmm0
  4005c1:    f2 0f 11 45 e8          movsd  %xmm0,0xffffffffffffffe8(%rbp)

This gets us the same code as default unoptimized compile.

Switching over to the E5 based machine, let’s ask gcc what march we should use.

1
2
[rberton@229.bm-general.dev.nym2 ~]$ gcc -march=native -Q --help=target | grep march
  -march=                          core2

So this is saying this machine’s architecture is core2. Using core2.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  /*  force cast to double (our problem instruction) */
  number = rval;
  4005bc:    48 8b 45 e0           mov    -0x20(%rbp),%rax
  4005c0:    48 85 c0              test   %rax,%rax
  4005c3:    78 07                 js     4005cc <main+0x4f>
  4005c5:    f2 48 0f 2a c0          cvtsi2sd %rax,%xmm0
  4005ca:    eb 15                   jmp    4005e1 <main+0x64>
  4005cc:    48 89 c2              mov    %rax,%rdx
  4005cf:    48 d1 ea               shr    %rdx
  4005d2:    83 e0 01               and    $0x1,%eax
  4005d5:    48 09 c2              or     %rax,%rdx
  4005d8:    f2 48 0f 2a c2          cvtsi2sd %rdx,%xmm0
  4005dd:    f2 0f 58 c0             addsd  %xmm0,%xmm0
  4005e1:    f2 0f 11 45 e8          movsd  %xmm0,-0x18(%rbp)

Interestingly, even though this machine is AVX capable it chose the default core2 instruction set. Yet when I use -march=native.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
  /*  force cast to double (our problem instruction) */
  number = rval;
  4005bc:    48 8b 45 e0           mov    -0x20(%rbp),%rax
  4005c0:    48 85 c0              test   %rax,%rax
  4005c3:    78 07                 js     4005cc <main+0x4f>
  4005c5:    c4 e1 fb 2a c0          vcvtsi2sd %rax,%xmm0,%xmm0
  4005ca:    eb 15                   jmp    4005e1 <main+0x64>
  4005cc:    48 89 c2              mov    %rax,%rdx
  4005cf:    48 d1 ea               shr    %rdx
  4005d2:    83 e0 01               and    $0x1,%eax
  4005d5:    48 09 c2              or     %rax,%rdx
  4005d8:    c4 e1 fb 2a c2          vcvtsi2sd %rdx,%xmm0,%xmm0
  4005dd:    c5 fb 58 c0             vaddsd %xmm0,%xmm0,%xmm0
  4005e1:    c5 fb 11 45 e8          vmovsd %xmm0,-0x18(%rbp)

We get AVX instructions. This is basically telling me that -march=native is going to turn on all the x86 options supported by the chipset. -mtune=native is going to turn on the lowest common denominator for chips of that family (i.e. SSE).

As for performance, let’s compile at -O2 and -march=native and time them.

The L5630 Chipset with -02 and -march=native
1
2
[rberton@117.bm-rberton.dev.nym2 ~ []]$ ./test_march
1000000 conversions took 30582172 ticks
The L5630 Chipset with -02 and -mno-sse
1
2
3
4
5
6
7
8
9
10
11
[rberton@117.bm-rberton.dev.nym2 ~ []]$ gcc44 -o test_march -O2 -mno-sse test_march.c
In file included from test_march.c:4:
/usr/include/stdlib.h: In function ‘strtod’:
/usr/include/stdlib.h:329: error: SSE register return with SSE disabled
[rberton@117.bm-rberton.dev.nym2 ~ []]$ gcc44 -o test_march -O1 -mno-sse test_march.c
In file included from test_march.c:4:
/usr/include/stdlib.h: In function ‘strtod’:
/usr/include/stdlib.h:329: error: SSE register return with SSE disabled
[rberton@117.bm-rberton.dev.nym2 ~ []]$ gcc44 -o test_march -O0 -mno-sse test_march.c
[rberton@117.bm-rberton.dev.nym2 ~ []]$ ./test_march
1000000 conversions took 30668020 ticks

As you can see my glibc won’t allow me to do this and use printf. So I had to step down to unoptimized. This still ran at the same speed.

The E5 Chipset with -O2 and -march=native
1
2
3
[rberton@229.bm-general.dev.nym2 ~]$ gcc -o test_march -O2 -march=native test_march.c
[rberton@229.bm-general.dev.nym2 ~]$ ./test_march
1000000 conversions took 28981364 ticks

About 5% faster. This is probably noise from some other optimizations, and not the double conversion code.

This is getting absurdly long.

tl;dr Understand -march vs. -mtune vs. -m<other switches> trumping order on x86. For double conversion it likely doesn’t matter if you even bother enabling it. If in doubt or you don’t know the chipset you will deploy on, use -mtune=generic. If you know the chipset is in a family of processors you can use -mtune=native. If you know for sure (and think it matters) you should use -march=native. Where I say native feel free to substitute the chipset you deploy on.