momo zone

调核人的blog

Monthly Archives: 六月 2015

kdbus没有合并进kernel 4.1

在最近的内核maillist中,Linus Torvalds对合并失败作出自己的解释:

kdbus的卖点是性能,但dbus性能差是因为糟糕的用户态代码,就像是一个智障的猴子写的。我不能因为像屎一样的用户态代码而把另一套代码塞进内核。

With the new Linux kernel mailing list thread about the prospects of merging KDBUS into the mainline Linux kernel, Linus Torvalds has provided his thoughts on the matter for this controversial feature backed by systemd developers for trying to provide a high-performance, kernel-based IPC solution.

Linus put it quite simply that he still is planning to merge it once it’s been reviewed and called for inclusion (after failing to be merged for Linux 4.1), “So I am still expecting to merge it, mainly for a rather simple reason: I trust my submaintainers, and Greg in particular. So when a major submaintainer wants to merge something, that pulls a *lot* of weight with me.”

One of the major selling points of KDBUS has been “better performance” than the user-space D-Bus solution for what it’s based. However, this doesn’t carry much weight with Torvalds. He explained, “I have to admit to being particularly disappointed with the performance argument for merging it. Having looked at the dbus performance, and come to the conclusion that the reason dbus performs abysmally badly is just pure shit user space code, I am not AT ALL impressed by the performance argument. We don’t merge kernel code just because user space was written by a retarded monkey on crack. Kernel code has higher standards, and yes, that also means that it tends to perform better, but no, ‘user space code is shit’ is not a valid reason for pushing things into the kernel. So quite frankly, the “better performance” argument is bogus in my opinion. That still leaves other arguments, but it does weaken the case for kdbus quite a bit.”

RADVD与DHCPv6

这两套东西让人迷惑,起初我认为RADVD是用来取代DHCPv6的,后来发现不是那么回事。
RADVD全称是Router AdVertisement Daemon,是ipv6协议栈的一部分,用来依据icmp6报文中的RA信息使客户端进行ipv6 prefix和gw的zero config。由于RADVD是协议栈的一部分所以没有客户端程序,内核收到icmp6报文的RA信息的流程:
net/ipv6/ndisc.c中icmpv6_rcv()->ndisc_rcv()->ndisc_router_discovery()->rt6_route_rcv()

在ndisc_router_discovery有一个ndisc_ra_useropt分支需要注意:

net/ipv6/ndisc.c:

static void ndisc_ra_useropt(struct sk_buff *ra, struct nd_opt_hdr *opt)
{
struct icmp6hdr *icmp6h = (struct icmp6hdr *)skb_transport_header(ra);
struct sk_buff *skb;
struct nlmsghdr *nlh;
struct nduseroptmsg *ndmsg;
struct net *net = dev_net(ra->dev);
int err;
int base_size = NLMSG_ALIGN(sizeof(struct nduseroptmsg)
+ (opt->nd_opt_len nduseropt_ifindex = ra->dev->ifindex;
ndmsg->nduseropt_icmp_type = icmp6h->icmp6_type;
ndmsg->nduseropt_icmp_code = icmp6h->icmp6_code;
ndmsg->nduseropt_opts_len = opt->nd_opt_len goto nla_put_failure;
nlmsg_end(skb, nlh);

rtnl_notify(skb, net, 0, RTNLGRP_ND_USEROPT, NULL, GFP_ATOMIC);
return;

nla_put_failure:
nlmsg_free(skb);
err = -EMSGSIZE;
errout:
rtnl_set_sk_err(net, RTNLGRP_ND_USEROPT, err);
}

RA有一个RFC5006(http://tools.ietf.org/html/rfc5006),扩展RA允许携带DNS配置信息。但内核向来不管DNS配置,都是应用层管理,所以这个ndisc_ra_useropt就是处理这类扩展配置。它所做的就是把配置信息通过netlink发给监听特定类型nl group的应用程序,然后应用程序针对ND_OPT_RDNSS的opt类型进行处理。

useropt的类型在include/net/ndisc.h中定义:

enum {
__ND_OPT_PREFIX_INFO_END = 0,
ND_OPT_SOURCE_LL_ADDR = 1, /* RFC2461 */
ND_OPT_TARGET_LL_ADDR = 2, /* RFC2461 */
ND_OPT_PREFIX_INFO = 3, /* RFC2461 */
ND_OPT_REDIRECT_HDR = 4, /* RFC2461 */
ND_OPT_MTU = 5, /* RFC2461 */
__ND_OPT_ARRAY_MAX,
ND_OPT_ROUTE_INFO = 24, /* RFC4191 */
ND_OPT_RDNSS = 25, /* RFC5006 */
ND_OPT_DNSSL = 31, /* RFC6106 */
__ND_OPT_MAX
};

那么ND_OPT_RDNSS的消息通过netlink传递给哪个用户态程序?
对于大多数桌面发行版是NetworkManager,他调用libndp中的ndp_msg_opt_rdnss_addr


/**
* ndp_msg_opt_rdnss_addr:
* @msg: message structure
* @offset: in-message offset
* @addr_index: address index
*
* Get Recursive DNS Server address.
* User should use this function only inside ndp_msg_opt_for_each_offset()
* macro loop.
*
* Returns: address.
**/
NDP_EXPORT
struct in6_addr *ndp_msg_opt_rdnss_addr(struct ndp_msg *msg, int offset,
int addr_index)
{
static struct in6_addr addr;
struct __nd_opt_rdnss *rdnss =
ndp_msg_payload_opts_offset(msg, offset);
size_t len = rdnss->nd_opt_rdnss_len nd_opt_rdnss_addresses[addr_index * sizeof(addr)],
sizeof(addr));
return &addr;
}

或在其src/rdisc/nm-lndp-rdisc.c来获得dns信息


static int
receive_ra (struct ndp *ndp, struct ndp_msg *msg, gpointer user_data)
{

..................

/* DNS information */
ndp_msg_opt_for_each_offset(offset, msg, NDP_MSG_OPT_RDNSS) {
static struct in6_addr *addr;
int addr_index;

ndp_msg_opt_rdnss_for_each_addr (addr, addr_index, msg, offset) {
NMRDiscDNSServer dns_server;

memset (&dns_server, 0, sizeof (dns_server));
dns_server.address = *addr;
dns_server.timestamp = now;
dns_server.lifetime = ndp_msg_opt_rdnss_lifetime (msg, offset);
/* Pad the lifetime somewhat to give a bit of slack in cases
* where one RA gets lost or something (which can happen on unreliable
* links like WiFi where certain types of frames are not retransmitted).
* Note that 0 has special meaning and is therefore not adjusted.
*/
if (dns_server.lifetime && dns_server.lifetime < 7200)
dns_server.lifetime = 7200;
if (add_dns_server (rdisc, &dns_server))
changed |= NM_RDISC_CONFIG_DNS_SERVERS;
}

.............
}

回过头来看为什么不直接用DHCPv6? 绕这么一圈
这个是个好问题,答案如下:

1. RADVD部署在路由器上,因为他的设计初衷是实现主机端的ipv6的zero config。而DHCP是部署在server上的,server不一定是一台路由器。
2. RADVD是stateless,RADVD无法得知每个主机的ipv6地址是什么,它仅仅是个发布者,不像DHCP那样有复杂的策略和本地存储。
3. RADVD功能不如DHCP,好在已经支持DNS。但还有其他如NTP仍不支持。

补充:启用RADVD需要协议栈参数的调整,对应上述总结的第一条。

forwarding

Type: BOOLEAN

Default: FALSE if global forwarding is disabled (default), otherwise TRUE

Configure interface-specific Host/Router behaviour.

Note: It is recommended to have the same setting on all interfaces; mixed router/host scenarios are rather uncommon.

Value FALSE: By default, Host behaviour is assumed. This means:

IsRouter flag is not set in Neighbour Advertisements.
Router Solicitations are being sent when necessary.
If accept_ra is TRUE (default), accept Router Advertisements (and do autoconfiguration).
If accept_redirects is TRUE (default), accept Redirects.
Value TRUE: If local forwarding is enabled, Router behaviour is assumed. This means exactly the reverse from the above:

IsRouter flag is set in Neighbour Advertisements.
Router Solicitations are not sent.
Router Advertisements are ignored.
Redirects are ignored.
By setting forwarding to TRUE your machine will behave like a router. You either need to force it to accept Router Advertisements (RAs) or you need to manually configure your addresses and routes.

To enable autoconf even when using forwarding use this:

accept_ra

Type: BOOLEAN

Accept Router Advertisements; autoconfigure using them.

Possible values are:

0: Do not accept Router Advertisements.
1: Accept Router Advertisements if forwarding is disabled.
2: Overrule forwarding behaviour. Accept Router Advertisements even if forwarding is enabled.
Yes, it’s a boolean with values 0, 1 and 2 🙂

 

补充2:

再看一下netconfig。

我找遍了相关资料,发现描述最清楚的也就是man了。注意是man(8),man(5)的netconfig是unix上RPC库使用的配置文件。回顾一下,发行版有如下几种网络配置方法:

NetworkManager (NM)

network service 脚本,ifcfg-*,ifup等等

后者功能强大,可以固化配置绝大多少的情况。但缺少动态特性,比如接入一个网络,手动修改地址,掩码,网关,dns,也不支持wifi,vpn,ppp等多种接入方式。NM克服了脚本的缺点,用C写了一套非常复杂的网络管理系统(70w行代码)。

NM在编译的时候有一个–with-netconfig选项,对于suse默认是yes。这样NM在自动配置DNS server,DNS searchlist等信息时将调用netconfig脚本来管理和生成配置文件,而不是使用内部实现。

还有一个–with-resolvconf选项,这样NM在自动配置DNS server,DNS searchlist等信息时将调用resolvconf脚本来管理和生成配置文件,而不是使用内部实现。

如果两个选项都没有打开则NM自己直接修改/etc/resolv.conf。

具体代码在src/dns-manager/nm-dns-manager.c update_dns函数中。

NM调用netconfig将dns配置通过pipe写入标准输入:

src/dns-manager/nm-dns-manager.c:


#if defined(NETCONFIG_PATH)
/**********************************/
/* SUSE */

static void
netconfig_child_setup (gpointer user_data G_GNUC_UNUSED)
{
pid_t pid = getpid ();
setpgid (pid, pid);

/*
* We blocked signals in main(). We need to restore original signal
* mask for netconfig here so that it can receive signals.
*/
nm_unblock_posix_signals (NULL);
}

static GPid
run_netconfig (GError **error, gint *stdin_fd)
{
char *argv[5];
char *tmp;
GPid pid = -1;

argv[0] = NETCONFIG_PATH;
argv[1] = "modify";
argv[2] = "--service";
argv[3] = "NetworkManager";
argv[4] = NULL;

tmp = g_strjoinv (" ", argv);
nm_log_dbg (LOGD_DNS, "spawning '%s'", tmp);
g_free (tmp);

if (!g_spawn_async_with_pipes (NULL, argv, NULL, 0, netconfig_child_setup,
NULL, &pid, stdin_fd, NULL, NULL, error))
return -1;

return pid;
}

static void
write_to_netconfig (gint fd, const char *key, const char *value)
{
char *str;
int x;

str = g_strdup_printf ("%s='%s'\n", key, value);
nm_log_dbg (LOGD_DNS, "writing to netconfig: %s", str);
x = write (fd, str, strlen (str));
g_free (str);
}

static gboolean
dispatch_netconfig (char **searches,
char **nameservers,
const char *nis_domain,
char **nis_servers,
GError **error)
{
char *str, *tmp;
GPid pid;
gint fd;
int ret;

pid = run_netconfig (error, &fd);
if (pid < 0)
return FALSE;

/* NM is writing already-merged DNS information to netconfig, so it
* does not apply to a specific network interface.
*/
write_to_netconfig (fd, "INTERFACE", "NetworkManager");

if (searches) {
str = g_strjoinv (" ", searches);

write_to_netconfig (fd, "DNSSEARCH", str);
g_free (str);
}

if (nameservers) {
str = g_strjoinv (" ", nameservers);
write_to_netconfig (fd, "DNSSERVERS", str);
g_free (str);
}

if (nis_domain)
write_to_netconfig (fd, "NISDOMAIN", nis_domain);

if (nis_servers) {
str = g_strjoinv (" ", nis_servers);
write_to_netconfig (fd, "NISSERVERS", str);
g_free (str);
}

close (fd);

/* Wait until the process exits */

again:

ret = waitpid (pid, NULL, 0);
if (ret < 0 && errno == EINTR)
goto again;
else if (ret < 0 && errno == ECHILD) { /* When the netconfig exist, the errno is ECHILD, it should return TRUE */ return TRUE; } return ret > 0;
}
#endif