momo zone

调核人的blog

Monthly Archives: 四月 2013

如何高效的访问内存

影响内存访问速度的因素主要有:
1.内存带宽:每秒读写内存的数据量,由硬件配置决定。
2.CACHE高速缓冲:CPU与内存之间的缓冲器,当命中率比较高时能大大提供内存平均访问速度。
3.TLB转换旁视缓冲:系统虚拟地址向物理地址转换的高速查表机制,转换速度比普通转换机制要快。

我们能够优化的只有第2点和第3点。由于CACHE的小容量与SMP的同步竞争,如何最大限度的利用高速缓冲就是我们的明确优化突破口(以常用的数据结构体为例):
1.压缩结构体大小:针对CACHE的小容量。
2.对结构体进行对齐:针对内存地址读写特性与SMP上CACHE的同步竞争。
3.申请地址连续的内存空间:针对TLB的小容量和CACHE命中。
4.其它优化:综合考虑多种因素

具体优化方法
1.压缩结构体大小
系统CACHE是有限的,并且容量很小,充分压缩结构体大小,使得CACHE能缓存更多的被访问数据,无非是提高内存平均访问速度的有效方法之一。
压缩结构体大小除了需要我们对应用逻辑做好更合理的设计,尽量去除不必要的字段,还有一些额外针对结构体本身的压缩方法。

1.1.对结构体字段进行合理的排列
由于结构体自身对齐的特性,具有同样字段的结构体,不同的字段排列顺序会产生不同大小的结构体。
大小:12字节

1
2
3
4
5
6
7
struct box_a
{
    char a;
    short b;
    int c;
    char d;
};

大小:8字节

1
2
3
4
5
6
7
struct box_b
{
    char a;
    char d;
    short b;
    int c;
};

1.2.利用位域
实际中,有些结构体字段并不需要那么大的存储空间,比如表示真假标记的flag字段只取两个值之一,0或1,此时用1个bit位即可,如果使用int类型的单一字段就大大的浪费了空间。
示例:tcp.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
struct tcphdr {
    __be16  source;
    __be16  dest;
    __be32  seq;
    __be32  ack_seq;
#if defined(__LITTLE_ENDIAN_BITFIELD)
    __u16   res1:4,
        doff:4,
        fin:1,
        syn:1,
        rst:1,
        psh:1,
        ack:1,
        urg:1,
        ece:1,
        cwr:1;
#elif defined(__BIG_ENDIAN_BITFIELD)
    __u16   doff:4,
        res1:4,
        cwr:1,
        ece:1,
        urg:1,
        ack:1,
        psh:1,
        rst:1,
        syn:1,
        fin:1;
#else
#error  "Adjust your <asm/byteorder.h> defines"
#endif
    __be16  window;
    __sum16 check;
    __be16  urg_ptr;
};

1.3.利用union
union结构体也是压缩结构体大小的方法之一,它允许我们在某些情况下能对结构体的多个字段进行合并或把小字节字段存放到大字节字段内。
示例:skbuff.h

1
2
3
4
5
6
7
8
9
10
11
struct sk_buff {
    
    union {
        __wsum      csum;
        struct {
            __u16   csum_start;
            __u16   csum_offset;
        };
    };
    
};

2.对结构体进行对齐
对结构体进行对齐有两层意思,一是指对较小结构体进行机器字对齐,二是指对较大结构体进行CACHE LINE对齐。

2.1.对较小结构体进行机器字对齐
我们知道,对于现代计算机硬件来说,内存只能通过特定的对齐地址(比如按照机器字)进行访问。举个例子来说,比如在64位的机器上,不管我们是要读取第0个字节还是要读取第1个字节,在硬件上传输的信号都是一样的。因为它都会把地址0到地址7,这8个字节全部读到CPU,只是当我们是需要读取第0个字节时,丢掉后面7个字节,当我们是需要读取第1个字节,丢掉第1个和后面6个字节。
当我们要读取的字节刚好落在两个机器字内时,就出现两次访问内存的情况,同时通过一些逻辑计算才能得到最终的结果。
因此,为了更好的提升性能,我们须尽量将结构体做到机器字(或倍数)对齐,而结构体中一些频繁访问的字段也尽量安排在机器字对齐的位置。
大小:12字节

1
2
3
4
5
6
7
8
struct box_c
{
    char a;
    char d;
    short b;
    int c;
    int e;
};

大小:16字节

1
2
3
4
5
6
7
8
9
struct box_d
{
    char a;
    char d;
    short b;
    int c;
    int e;
    char padding[4];
};

上面表格右边的box_d结构体,通过增加一个填充字段padding将结构体大小增加到16字节,从而与机器字倍数对齐,这在我们申请连续的box_d结构体数组时,仍能保证数组内的每一个结构体都与机器字倍数对齐。
通过填充字段padding使得结构体大小与机器字倍数对齐是一种常见的做法,在Linux内核源码里随处可见。

2.2.对较大结构体进行CACHE LINE对齐
我们知道,CACHE与内存交换的最小单位为CACHE LINE,一个CACHE LINE大小以64字节为例。当我们的结构体大小没有与64字节对齐时,一个结构体可能就要占用比原本需要更多的CACHE LINE。比如,把一个内存中没有64字节长的结构体缓存到CACHE时,即使该结构体本身长度或许没有还没有64字节,但由于其前后搭占在两条CACHE LINE上,那么对其进行淘汰时就会淘汰出去两条CACHE LINE。
这还不是最严重的问题,非CACHE LINE对齐结构体在SMP机器上容易引发名为错误共享的CACHE问题。比如,结构体T1和T2都没做CACHE LINE对齐,如果它们(T1后半部和T2前半部)在SMP机器上合占了同一条CACHE,如果CPU 0对结构体T1后半部做了修改则将导致CPU 1的CACHE LINE 1失效,同样,如果CPU 1对结构体T2前半部做了修改则也将导致CPU 0的CACHE LINE 1失效。如果CPU 0和CPU 1反复做相应的修改则导致的不良结果显而易见。本来逻辑上没有共享的结构体T1和T2,实际上却共享了CACHE LINE 1,这就是所谓的错误共享。
Linux源码里提供了利用GCC的__attribute__扩展属性定义的宏来做这种对齐处理,在文件/linux-2.6.xx/include/linux/cache.h内可以找到多个相类似的宏,比如:

1
#define ____cacheline_aligned __attribute__((__aligned__(SMP_CACHE_BYTES)))

该宏可以用来修饰结构体字段,作用是强制该字段地址与CACHE LINE映射起始地址对齐。
看/linux-2.6.xx/drivers/net/e100.c内结构体nic的实现,三个____cacheline_aligned修饰字段,表示强制这些字段与CACHE LINE映射起始地址对齐。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
struct nic {
    /* Begin: frequently used values: keep adjacent for cache effect */
    u32 msg_enable              ____cacheline_aligned;
    /* 4字节空洞 */
    struct net_device *netdev;
    struct pci_dev *pdev;
    /* 40字节空洞 */
    struct rx *rxs              ____cacheline_aligned;
    struct rx *rx_to_use;
    struct rx *rx_to_clean;
    struct rfd blank_rfd;
    enum ru_state ru_running;
    /* 20字节空洞 */
    spinlock_t cb_lock          ____cacheline_aligned;
    spinlock_t cmd_lock;
    struct csr __iomem *csr;
    enum scb_cmd_lo cuc_cmd;
    unsigned int cbs_avail;
    struct napi_struct napi;
    
}

回到前面的问题,如果我们对结构体T2的第一个字段加上____cacheline_aligned修饰,则该错误共享即可解决。

2.3.只读字段和读写字段隔离对齐
只读字段和读写字段隔离对齐的目的就是为了尽量保证那些只读字段和读写字段分别集中在CACHE的不同CACHE LINE中。由于只读字段几乎不需要进行更新,因而能在CACHE中得以稳定的缓存,减少由于混合有读写字段导致的对应CACHE LINE的频繁失效问题,以便提高效率;而读写字段相对集中在一起,这样也能保证当程序读写结构体时,污染的CACHE LINE条数也就相对的较少。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
typedef struct {
    /* ro data */
    size_t block_count;     // number of total blocks
    size_t meta_block_size; // sizeof per skb meta block
    size_t data_block_size; // sizeof per skb data block
    u8 *meta_base_addr;     // base address of skb meta buffer
    u8 *data_base_addr;     // base address of skb data buffer
    /* rw data */
    size_t current_index    ____cacheline_aligned;  // index
} bc_buff, * bc_buff_t;

3.申请地址连续的内存空间
随着地址空间由32位转到64位,页内存管理的目录分级也越来越多,4级的目录地址转换也是一笔不小是开销。硬件产商为我们提供了TLB缓冲,加速虚拟地址到物理地址的换算。但是,毕竟TLB是有限,对地址连续的内存空间进行访问时,TLB能得到更多的命中,同时CACHE高速缓冲命中的几率也更大。
两段代码,实现同一功能,但第一种方法在实际使用中,内存读写效率就会相对较好,特别是在申请的内存很大时(未考虑malloc异常):
方法一:

1
2
3
4
5
6
7
8
9
#define MAX 100
int i;
char *p;
struct box_d *box[MAX];
p = (char *)malloc(sizeof(struct box_d) * MAX);
for (i = 0; i < MAX; i ++)
{
    box[i] = (struct box_d *)(p + sizeof(struct box_d) * i);
}

方法二:

1
2
3
4
5
6
7
#define MAX 100
int i;
struct box_d *box[MAX];
for (i = 0; i < MAX; i ++)
{
    box[i] = (struct box_d *)malloc(sizeof(struct box_d));
}

另外,如果我们使用更大页面(比如2M或1G)的分页机制,同样能够提升性能;因为相比于原本每页4K大小的分页机制,应用程序申请同样大小的内存,大页面分页机制需要的页面数目更少,从而占用的TLB项目也更少,减少虚拟地址到物理地址的转换次数的同时,提高TLB的命中率,缩短每次转换所需要的时间。因为大多数操作系统在分配内存时候都需要按页对齐,所以大页面分页机制的缺点就是内存浪费相对比较严重。只有在物理内存足够充足的情况下,大页面分页机制才能够体现出优势。

4.其它优化
4.1.预读指令读内存
提前预取内存中数据到CACHE内,提高CACHE的命中率,加速内存读取速度,这是设计预读指令的主要目的。如果当前运算复杂度比较高,那么预取和运算就可同步进行,从而消除下一步内存访问的时延。相应的预读汇编指令有prefetch0、prefetch1、prefetch2、 prefetchnta。
预取指令只是给CPU一个提示,所以它可被CPU忽略,而且就算预取一段错误的地址也不会导致CPU异常。一般使用prefetchnta预取指令,因为它不会污染CACHE,它把每次取得的数据都存放到L2 CACHE的第一条CACHE LINE,而另外几条指令会替换CACHE中最近最少使用的CACHE LINE。

4.2.非暂时移动指令写内存
我们知道为了保证CACHE与内存之间的数据一致性,CPU对CACHE的写操作主要有两种方式同步到内存,写透式(Write Through)和写回式(Write-back)。不管哪种同步方式都是要消耗性能的,而在某些情况下,写CACHE是不必要的:
有哪些情况不需要写CACHE呢?比如做数据拷贝(高效memcpy函数实现)时,或者我们已经知道写的数据在最近一段时间内(或者永远)都不会再使用了,那么此时就可以不用写CACHE,让对应的CACHE LINE自动失效,以便缓存其它数据。这在某些特殊场景非常有用,相应的汇编指令有movntq、movntsd、movntss、movntps、movntpd、movntdq、movntdqa。
完整的利用预读指令和非暂时移动指令实现的高速内存拷贝函数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
void X_aligned_memcpy_sse2(void* dest, const void* src, const unsigned long size_t)
{
  __asm
  {
    mov esi, src;    //src pointer
    mov edi, dest;   //dest pointer
    mov ebx, size_t; //ebx is our counter
    shr ebx, 7;      //divide by 128 (8 * 128bit registers)
    loop_copy:
      prefetchnta 128[ESI]; //SSE2 prefetch
      prefetchnta 160[ESI];
      prefetchnta 192[ESI];
      prefetchnta 224[ESI];
      movdqa xmm0, 0[ESI]; //move data from src to registers
      movdqa xmm1, 16[ESI];
      movdqa xmm2, 32[ESI];
      movdqa xmm3, 48[ESI];
      movdqa xmm4, 64[ESI];
      movdqa xmm5, 80[ESI];
      movdqa xmm6, 96[ESI];
      movdqa xmm7, 112[ESI];
      movntdq 0[EDI], xmm0; //move data from registers to dest
      movntdq 16[EDI], xmm1;
      movntdq 32[EDI], xmm2;
      movntdq 48[EDI], xmm3;
      movntdq 64[EDI], xmm4;
      movntdq 80[EDI], xmm5;
      movntdq 96[EDI], xmm6;
      movntdq 112[EDI], xmm7;
      add esi, 128;
      add edi, 128;
      dec ebx;
      jnz loop_copy; //loop please
    loop_copy_end:
  }
}

总结
要高效的访问内存,必须充分利用系统CACHE的缓存功能,因为就目前来说,CACHE的访问速度比内存快太多了。具体优化方法有:
1.用设计上压缩结构体大小。
2.结构体尽量做到机器字(倍数)对齐。
3.结构体中频繁访问的字段尽量放在机器字对齐的位置。
4.频繁读写的多个结构体变量尽量同时申请,使得它们尽可能的分布在较小的线性空间范围内,这样可利用TLB缓冲。
5.当结构体比较大时,对结构体字段进行初始化或设置值时最好从第一个字段依次往后进行,这样可保证对内存的访问是顺序进行。
6.额外的优化可以采用非暂时移动指令(如movntdq)与预读指令(如prefetchnta)。
7.特殊情况可考虑利用多媒体指令SSE2、SSE4等。
当然,上面某些步骤之间存在冲突,比如压缩结构体和结构体对齐,这就需要实际综合考虑。

linux SMP 关于cpu的几个变量和宏

最近有点迷惑kernel关于cpu枚举的几个宏和全局变量,看了一下文章感觉清醒一点

Overview

This page is a completely impromptu coverage of managing multiple CPUs, cpumasks, iterating through all possible CPUs and so on. I didn’t think it was going to be that big a deal, but there’s a lot more going on here than you might think. And plenty more to come, depending on how ambitious I get.

If you want to keep up with notifications on new Linux kernel tutorials, you’re welcome to follow me on Twitter athttps://twitter.com/rpjday.

Files

A tentative list of kernel source files related to this topic (list will probably change over time as I run across additional entries to add or delete some of these that aren’t really relevant):

  • include/linux/
    • cpu.h
    • cpumask.h
    • smp.h
    • smpboot.h
  • arch/x86/
    • configs/x86_64_defconfig
    • kernel/smpboot.c
    • include/asm/
      • smp.h
      • cpumask.h
  • kernel/
    • cpu.c
    • smp.c
    • smpboot.h
    • smpboot.c
  • lib/cpumask.c

What am I missing?

Your current setup

All this will be demonstrated on my quad-core ASUS laptop so, technically, I will have eight processors.

First, let’s verify that:

$ grep ^processor /proc/cpuinfo
processor	: 0
processor	: 1
processor	: 2
processor	: 3
processor	: 4
processor	: 5
processor	: 6
processor	: 7
$

Also, check under sysfs:

$ ls -1 /sys/devices/system/cpu
cpu0
cpu1
cpu2
cpu3
cpu4
cpu5
cpu6
cpu7
cpufreq
cpuidle
... snip ...

So far, so good. Onward.

NR_CPUS versus nr_cpu_ids

The variable NR_CPUS scattered throughout the kernel source is not your actual number of CPUs, it’s the maximumnumber set by the config process found in arch/x86/Kconfig:

config NR_CPUS
        int "Maximum number of CPUs" if SMP && !MAXSMP
        range 2 8 if SMP && X86_32 && !X86_BIGSMP
        range 2 512 if SMP && !MAXSMP
        default "1" if !SMP
        default "4096" if MAXSMP
        default "32" if SMP && (X86_NUMAQ || X86_SUMMIT || X86_BIGSMP || X86_ES7000)
        default "8" if SMP
        ---help---
          This allows you to specify the maximum number of CPUs which this
          kernel will support.  The maximum supported value is 512 and the
          minimum value which makes sense is 2.

          This is purely to save memory - each supported CPU adds
          approximately eight kilobytes to the kernel image.

In my case, I can verify that my .config file contains:

CONFIG_NR_CPUS=256

That’s different from the global variable nr_cpu_ids initialized in kernel/smp.c:

/* Setup number of possible processor ids */
int nr_cpu_ids __read_mostly = NR_CPUS;
EXPORT_SYMBOL(nr_cpu_ids);

/* An arch may set nr_cpu_ids earlier if needed, so this would be redundant */
void __init setup_nr_cpu_ids(void)
{
        nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
}

You can see the initialization scattered throughout the dmesg file at boot time (the following lines are not sequential in that file):

[    0.000000] smpboot: Allowing 16 CPUs, 8 hotplug CPUs
[    0.000000] setup_percpu: NR_CPUS:256 nr_cpumask_bits:256 nr_cpu_ids:16 nr_node_ids:1
[    0.000000] SLUB: Genslabs=15, HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=1
[    0.000000]  RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=16.
[    0.344852] smpboot: Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7
[    0.437322] Brought up 8 CPUs
[    0.437326] smpboot: Total of 8 processors activated (31925.90 BogoMIPS)

The default value for NR_CPUS from the x86_64 defconfig file is:

CONFIG_NR_CPUS=64

I just used an overly generous, explicit value of 256. No reason.

Command-line parameters controlling SMP

From kernel/smp.c, there appear to be three kernel command-line parameters you can play with related to this topic:

  • nosmp
  • nr_cpus=
  • maxcpus=
/* Setup configured maximum number of CPUs to activate */
unsigned int setup_max_cpus = NR_CPUS;
EXPORT_SYMBOL(setup_max_cpus);

/*
 * Setup routine for controlling SMP activation
 *
 * Command-line option of "nosmp" or "maxcpus=0" will disable SMP
 * activation entirely (the MPS table probe still happens, though).
 *
 * Command-line option of "maxcpus=<NUM>", where <NUM> is an integer
 * greater than 0, limits the maximum number of CPUs activated in
 * SMP mode to <NUM>.
 */

void __weak arch_disable_smp_support(void) { }

static int __init nosmp(char *str)
{
        setup_max_cpus = 0;
        arch_disable_smp_support();

        return 0;
}

early_param("nosmp", nosmp);

/* this is hard limit */
static int __init nrcpus(char *str)
{
        int nr_cpus;

        get_option(&str, &nr_cpus);
        if (nr_cpus > 0 && nr_cpus < nr_cpu_ids)
                nr_cpu_ids = nr_cpus;

        return 0;
}

early_param("nr_cpus", nrcpus);

static int __init maxcpus(char *str)
{
        get_option(&str, &setup_max_cpus);
        if (setup_max_cpus == 0)
                arch_disable_smp_support();

        return 0;
}

early_param("maxcpus", maxcpus);

/* Setup number of possible processor ids */
int nr_cpu_ids __read_mostly = NR_CPUS;
EXPORT_SYMBOL(nr_cpu_ids);

/* An arch may set nr_cpu_ids earlier if needed, so this would be redundant */
void __init setup_nr_cpu_ids(void)
{
        nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
}

Bringing up the CPUs

Also from kernel/smp.c:

/* Called by boot processor to activate the rest. */
void __init smp_init(void)
{
        unsigned int cpu;

        idle_threads_init();

        /* FIXME: This should be done in userspace --RR */
        for_each_present_cpu(cpu) {
                if (num_online_cpus() >= setup_max_cpus)
                        break;
                if (!cpu_online(cpu))
                        cpu_up(cpu);
        }

        /* Any cleanup work */
        printk(KERN_INFO "Brought up %ld CPUs\n", (long)num_online_cpus());
        smp_cpus_done(setup_max_cpus);
}

which explains the line you saw in dmesg (in my case):

[    0.437322] Brought up 8 CPUs

The cpumask for identifying CPUs

Pretty much everything you need is declared/defined in the header file include/linux/cpumask.h:

/*
 * The following particular system cpumasks and operations manage
 * possible, present, active and online cpus.
 *
 *     cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
 *     cpu_present_mask - has bit 'cpu' set iff cpu is populated
 *     cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
 *     cpu_active_mask  - has bit 'cpu' set iff cpu available to migration
... snip ...
#if NR_CPUS > 1
#define num_online_cpus()       cpumask_weight(cpu_online_mask)
#define num_possible_cpus()     cpumask_weight(cpu_possible_mask)
#define num_present_cpus()      cpumask_weight(cpu_present_mask)
#define num_active_cpus()       cpumask_weight(cpu_active_mask)
#define cpu_online(cpu)         cpumask_test_cpu((cpu), cpu_online_mask)
#define cpu_possible(cpu)       cpumask_test_cpu((cpu), cpu_possible_mask)
#define cpu_present(cpu)        cpumask_test_cpu((cpu), cpu_present_mask)
#define cpu_active(cpu)         cpumask_test_cpu((cpu), cpu_active_mask)
#else

and you can use these macros to iterate through your CPUs:

#define for_each_cpu(cpu, mask)                         \
        for ((cpu) = -1;                                \
                (cpu) = cpumask_next((cpu), (mask)),    \
                (cpu) < nr_cpu_ids;)

#define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)
#define for_each_online_cpu(cpu)   for_each_cpu((cpu), cpu_online_mask)
#define for_each_present_cpu(cpu)  for_each_cpu((cpu), cpu_present_mask)

and so on, and so on. There is a ridiculous amount of detail you can get into here, but this is a start until I decide how to rewrite it. Feel free to scan the kernel source tree to see the numerous calls to for_each_cpu()for_each_online_cpu(),for_each_possible_cpu(), etc.

关于cpumask的一些解释:

/*
* The following particular system cpumasks and operations manage
* possible, present, active and online cpus.
*
* cpu_possible_mask- has bit ‘cpu’ set iff cpu is populatable
* cpu_present_mask – has bit ‘cpu’ set iff cpu is populated
* cpu_online_mask – has bit ‘cpu’ set iff cpu available to scheduler
* cpu_active_mask – has bit ‘cpu’ set iff cpu available to migration
*
* If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
*
* The cpu_possible_mask is fixed at boot time, as the set of CPU id’s
* that it is possible might ever be plugged in at anytime during the
* life of that system boot. The cpu_present_mask is dynamic(*),
* representing which CPUs are currently plugged in. And
* cpu_online_mask is the dynamic subset of cpu_present_mask,
* indicating those CPUs available for scheduling.
*
* If HOTPLUG is enabled, then cpu_possible_mask is forced to have
* all NR_CPUS bits set, otherwise it is just the set of CPUs that
* ACPI reports present at boot.
*
* If HOTPLUG is enabled, then cpu_present_mask varies dynamically,
* depending on what ACPI reports as currently plugged in, otherwise
* cpu_present_mask is just a copy of cpu_possible_mask.
*
* (*) Well, cpu_present_mask is dynamic in the hotplug case. If not
* hotplug, it’s a copy of cpu_possible_mask, hence fixed at boot.
*
* Subtleties:
* 1) UP arch’s (NR_CPUS == 1, CONFIG_SMP not defined) hardcode
* assumption that their single CPU is online. The UP
* cpu_{online,possible,present}_masks are placebos. Changing them
* will have no useful affect on the following num_*_cpus()
* and cpu_*() macros in the UP case. This ugliness is a UP
* optimization – don’t waste any instructions or memory references
* asking if you’re online or how many CPUs there are if there is
* only one CPU.
*/