momo zone


Monthly Archives: 七月 2011


KDE4.5 好像有个严重bug ,用dolphin  打开一个NTFS 分区然后整个KDE 崩溃,内核也崩溃了,硬件重启后发现该分区读取错误,内核直接提示不可纠正的扇区发现,  提示十几次后内核才会继续初始化。windows 下chkdsk i:/f 发现该分区的安全描述符无法打开,幸运的是还可以恢复。估计他正好位于那个坏扇区。当然文件系统恢复无法纠正坏扇区,该死这块硬盘用smartctl –test=long 却无法执行,也无法进行表面扫描 。 简直糟糕透了…….

找到一篇文章,真正从S.M.A.R.T 技术的角度来看如何处理磁盘坏扇区问题:

[转]Citrix IMA 架构和原理

by Brian Madden

Remember that a Citrix Presentation Server farm is really just a database (called the IMA Data Store), and Presentation Servers are said to be part of the same farm if they’re sharing the same data store. The data store stores all configuration information for all farm servers. When a Presentation Server starts up (or, more correctly, when the IMA service on a Presentation Server starts up), the following process takes place:

  1. The IMA service checks the registry to find out what DSN contains the connection information to the data store. This registry location is HLKM\SOFTWARE\Citrix\IMA\DataSourceName
  2. By default, that registry key points to a file called MF20.dsn in the “%ProgramFiles%\Citrix\Independent Management Architecture” folder.
  3. The IMA service connects to the database specified in that DSN file. (Database credentials are encrypted and stored in the registry.)
  4. The IMA service downloads information that pertains to it from the central database into a local MS Jet (Access) database.
  5. Throughout its operation, the IMA service interacts with the locally cached subset of the central data store.
  6. Every 30 minutes, the IMA service contacts the central data store to see if anything has changed.

The Local Host Cache

As previously stated, the IMA service running on each Presentation Server downloads the information it needs from the central data store into a local MDB database called the local host cache, or “LHC.” (The location of the local host cache is specified via a DSN referenced in the registry of the Presentation Server, at HKLM\SOFTWARE\Citrix\IMA\LHCDatasource\DataSourceName. By default this is a file called “Imalhc.dsn” and is stored in the same place as MF20.dsn.)

Each Presentation Server is smart enough to only download information from the data store that is relevant to it, meaning that the local host cache is unique for every server. Citrix created the local host cache for two reasons:

  • Increased Redundancy. If communication with the central data store is lost, the Presentation Server can continue to function since the information it is available locally.
  • Increased Speed. Since the local host cache contains information the Presentation Server refers to often, the server doesn’t have to access the IMA data store across the network every time any bit of information is needed.

The LHC is critical in a CPS environment. In fact, it’s the exclusive interface of the data store to the local server. (In other words, the local server’s IMA service only interacts with the LHC. It never contacts the central data store except when it’s updating the LHC.)

If the server loses its connection to the central data store, there’s no limit to how long it will continue to function. (In the days of MetaFrame XP, this limit was 48 or 96 hours, but that was because the data store also stored license information.) But today, the server can run forever from the LHC and won’t even skip a beat if the central connection is lost. In fact now you can even reboot the server when the central data store is down, and the IMA service will start from the LHC no problem. (Older versions of MetaFrame required a registry modification to start the IMA service from the LHC.)

The LHC file is always in use when IMA is running, so it’s not possible to delete it or anything. In theory it’s possible that this file could become corrupted, and if this happens I guess all sorts of weird things could happen to your server. If you think this is the case in your environment, you can stop the IMA service and run the command “dsmaint recreatelhc” to recreate the local host cache file, although honestly I don’t think this fixes anything very often. (I think it’s more to make people feel better. “Ahhh. I recreated the LHC, so we’ll see if the problem goes away.”)

Data Store Architecture

Now let’s take a closer look at the actual database that’s used to power the IMA data store. If you open this database with SQL Enterprise Manager (or whatever Oracle calls their database management tool), you’ll see it has four tables:


If you’re at all familiar with databases, you’re probably thinking this is kind of weird. Wouldn’t the central database of a complex product like Citrix Presentation Server have hundreds of tables? Shouldn’t there be tables that list servers, apps, users, and policies, not to mention more tables linking them all together? The reason you don’t see the database structure you’d expect is because the IMA data store is not a real relational database. It’s actually an LDAP database that Citrix sort of hacked to work on top of a relational database like SQL Server.

This is because Citrix first came up with the concept of the IMA data store when they were working on MetaFrame XP in 2000. At that time they had planned to use Active Directory as the data store instead of a database. They developed the entire MetaFrame XP product around an LDAP-based data store instead of a relational database-based data store. Then towards the end of the development process, Citrix (smartly) realized that not too many people would want to extend their AD schemas to just to use Citrix, so they quickly moved to using a regualar database instead. The only problem was that the entire IMA service and data store were all ready to go using LDAP, and Citrix couldn’t just re-write the entire product to use a relational database instead. The solution was that Citrix had to implement their own LDAP-like engine that runs on top of a normal relational database. (On top of all that, Citrix encrypts this whole thing, so the contents really are gobbledygook to the casual observer.)

This is the reason you can’t just access the IMA data store directly through SQL Enterprise Manager. (Well, technically you can, but if you run a query you’ll get meaningless hex results.) If you try to edit any of the contents of the data store directly in the database, you will definitely break it and have to restore from backup.

For those curious to learn more about the LDAP-like structure of the data store, there’s a tool on the Presentation Server installation CD called “dsview.” DSview is fun to play with but not really that useful.

One final word of caution: There is a tool in existence called “dsedit.” As you can probably guess from the name, dsedit is basically a “write-enabled” version of dsview. If you happen to find this tool out on the Internet, DO NOT use it in your environment! This is an internal Citrix tool that is not meant for general use.

Now if you’re thinking, “I know what I’m doing, so I can play with dsedit,” I’ll warn you again: Don’t do it! The problem is that since dsedit is an internal-only tool, it’s not externally version-controlled. Citrix has many different compiled versions of this tool for all different versions of Presentation Server (and in some cases with specifics for certain hotfixes). So if you just happen to find some random hacker site with dsedit for download, you have no idea whether that dsedit version is the version that’s compiled to work with your specific version of the data store. (Chances are it’s not.) And using the wrong version of dsedit with your data store an easily corrupt the entire store (since data store items are maintained in long HEX strings that represent the LDAP-like node items.)

IMA service to data store communication

Let’s take a closer look at how a Presentation Server communicates with the central data store. We initially outlined the process that takes place when the IMA service starts up. In it, we described the IMA service downloading information from the central data store that’s used to create the local host cache. Of course if the local host cache is already on the server (and up-to-date) when IMA starts, there’s no need to download everything again.

So how does the server know whether its local host cache is current? Citrix makes this possible via a series of “sequence numbers.” Every single configuration change made to the data store is assigned a number. The number of the most recent change is stored in the local host cache. Then when the IMA service checks the central data store for changes, it only needs to download the value of the most recent sequence number. If that number is the same as what it was last time (i.e. the same number that’s in the local host cache), then no further action is needed and the server knows its local host cache is up-to-date.

If sequence number of the most recent change in the central data store is newer than what’s in the local host cache, then more data is exchanged to determine what the changes are. If they apply to the specific server requesting the updates, they’re downloaded to that server and the local host cache is updated accordingly. If the changes do not apply to the requesting server, that server still updates the most recent sequence number in its local host cache so it can continue to look for changes in the future.

The IMA service on each Presentation Server looks for changes in the central data store every 30 minutes. You can adjust this value via the registry of the Presentation Server (CTX111914), although there’s typically no reason to do that since this exchange is less than 1k if there’s no change.

IMA Data Store Database Type

Since Citrix’s implementation of the IMA data store runs on top of a regular relational database, you can pretty much use whatever kind of database server you want. Most people end up using SQL Server, although others are supported. (See CTX112591 for a complete list.)

For smaller environments, Citrix used to recommend using a Microsoft Access database running locally on one of your Presentation Servers. Nowadays that’s not really used anymore, having been replaced by SQL Server Express. (SQL Express is free and based on “real” SQL Server technology.)

A big topic of discussion has been what constitutes a “smaller” environment? Or to be more blunt, at what point do you need to switch to using a real database instead of using Access or SQL Express? A lot of people argue about this in the online forums, with the general consensus being in the five-to-ten server range. I don’t agree though. I’ve personally seen farms (even back in the MetaFrame XP days) of 50 servers running their data stores on Access, and that was fine. Since each Presentation Server only really interacts with its local host cache, a 50-server farm using Access still wouldn’t put much strain on the Access database.

To be honest, the real problem with using Access or SQL Express for your data store is that it has to be accessed “indirectly” (to use Citrix’s term). This means that the actual files that make up your data store are physically sitting on one of your Presentation Servers. The IMA service on that server accesses the database locally, and every other server in your farm accesses the data store via the IMA protocol (on port 2512) through the Presentation Server that’s hosting it. This is bad because it’s a single point of failure. If that Presentation Server goes down, your data store won’t be accessible and you won’t be able to manage your environment.

This might not be a problem in a small farm of just a few servers, but you’ll probably want a more redundant database long before your farm outgrows this architecture from a technical capacity standpoint.

IMA Data Store Size

Another question that often comes up when designing Presentation Server environments is, “How big will this IMA data store get?” The answer, very seriously, is “Not very big!”

Of course “very big” is a relative term, but in today’s world of multi-core servers with gigabytes of memory, the data store just isn’t going to grow large enough to really matter. Citrix very roughly estimates 1MB per server. And even if you built a single farm with 1,000 servers, a 1GB database in today’s world just isn’t that big anymore.

If you want more precise numbers as to the size of your data store, the Advanced Concepts Guide for CPS 4 (CTX107159) has a chart that lists exactly how many bytes each object type needs in the data store. (I have not been able to find this info for CPS 4.5, but I’m going to assume it’s pretty close to 4.0.

IMA Data Store replication strategy

If your server farm spans multiple physical locations, you might want to replicate your data store so that a local copy is running at each location. There are two (potential) advantages to this:

  • Redundancy. You don’t want a single database server failure to negatively impact your overall environment.
  • Performance. If your farm spans multiple WAN locations, you might want to have a local database at each location.

Before we discuss this further, I want to make a few things clear: we’re talking about doing a full replication of the entire data store, so that each replica is 100% identical. Unfortunately due to the binary LDAP structure of the data store, it’s not possible to just replicate a subset of the data store to a remote site.

Second, we’re talking about replicating the data store between physical sites for site-to-site performance and redundancy reasons. If you want to cluster your data store servers, this is entirely possible, but not what we’re talking about now. (For more information about clustering your data store servers, read the High Availability chapter later in this book.)

Figure 3.x [Replicated data store between two physical locations]

Replicating your data store for redundancy

If your farm spans multiple physical locations, you might be concerned about what happens when a WAN link goes down. There’s an entire chapter later in this book dedicated to helping you design a fully-redundant environment based on everything that you’ll read up until that point. But right now we can discuss the mechanics of the data store when it comes to replication for redundancy purposes.

The first and most important thing to know is this: A Citrix Presentation Server will work indefinitely even if it loses connectivity to the central data store. (Again, remember that the local IMA service on a Presentation Server works off of its local host cache, not the central data store.) So really before you can decide whether you want to replicate your database for redundancy purposes, you have to decide understand what the impact is of losing connectivity to the data store.

The main thing is that in order to use either one of the two CPS management consoles, you have to connect to a Citrix server that is communicating with its data store. So if your data store is lost, even though your Presentation Server will run and will accept new connections and otherwise be totally normal, you won’t actually be able to connect to that server with a management console.

What’s interesting is that this doesn’t mean that you can’t manage sessions on that server. If you can connect to a different server in your farm that is connected to the data store, then you can view all activity and all sessions from your farm–even the ones from servers that aren’t connected to the data store. But think about this for a minute. How is it possible that your management console is able to connect to a server that can access the datas store, and it’s able to see servers not connected to the data store? If this is the case, wouldn’t your “down” servers also be able to see the data store?

A more likely scenario is that you have multiple WAN locations each with their own Presentation Servers all in the same farm. If a WAN link goes down and some sites do not have their own replica of the data store, the servers, sessions, and users on that site will be fine. The problem will be that admins and help desk folks won’t be able to connect to any admin consoles at that site. (And people at the site with the data store will be able to connect, but of course they won’t be able to see or manage servers from the site with the down WAN link.)

A solution to this is to replicate your data store so that if a WAN link goes down, there’s a local replica at each location. This means that local admins will be able to connect to the management tools on those local servers and perform their typical routine maintenance tasks. (Resetting sessions, shadowing, etc.)

Of course if any admin from the “down” site makes any configuration change that’s saved to the data store, that change will be lost once the WAN link comes back up and the central data store re-replicates with the local data store. (As you can imagine, “merge” replication is not possible with this binary encrypted LDAP data store format.)

Replicating your data store for performance reasons

Some people also choose to replicate their data stores to multiple locations for performance reasons. The idea is that by doing this, you Presentation Servers can always access the data store via a local network instead of via the WAN. To be honest, this probably isn’t that big of a deal. Remember that each Presentation Server interacts with its own local host cache for standard operational purposes. The central data store is only accessed to download additional configuration changes. Sure, recreating the local host cache will require the download of all the contents to rebuild the MDB cache file, but that too is not typically very large. (A few megabytes maybe?) And if your WAN can’t support the transfer of a few megabytes every once and a while, then you probably shouldn’t have a single farm that spans multiple sites anyway.

All that said, it’s a nice “clean” solution when all the Presentation Servers of a remote location can access everything they need on their own local LAN, and there’s certainly nothing wrong with that scenario.

Advantages of replicating

  • You can manage your servers when the WAN is down
  • Less WAN traffic (Read the “zones” section of this chapter to understand why.)
  • It just “feels” better, especially for a global environment

Disadvantages of replicating

  • More complex
  • Additional database servers required

Configuring IMA data store replication

If you decide that you’d like to replicate your data store, you’ll need to do two things:

  1. Configure the database software for replication
  2. Reconfigure your Presentation Servers to point them towards the local replica

Configuring the database for replication

All the real database servers support replication. (i.e. if you want to do this, you can’t use Access or SQL Express.) Configuring the replication of your data store is 100% a function of your database software. In fact, your Citrix servers won’t even know they’re connecting to a database replica versus the real thing.

If you’ve never done this before and you’re in the unfortunate position of making it happen in your environment,CTX112125 has more information and links to step-by-step instructions for configuring replication with SQL Server. The main thing you have to do is make sure that your replicas are “writable” by the local Presentation Servers. There’s a few ways this can be set up, but in a CPS environment you need to ensure that one master copy of the database is in charge of all changes to it. (With that weird binary encrypted LDAP format, you don’t want the database server to try to sort out two changes entered into two different replicas at the same time.)

Pointing your Presentation Servers to a new replica

Once you’ve got your data store replicated to your new location, you need to reconfigure the local Presentation Servers there to use the new replica instead of the old central location. Remember from the beginning of this section that a Presentation Server knows where to find the central data store via a file called MF20.dsn (which is specified in the registry). If you want to point your Presentation Server to another database (i.e. a local replica), all you have to do is to change that DSN and then restart the IMA service. (There are some command-line options for the dsmaint utility that let you change the location of the data store, but I personally find it easier to just edit the MF20.dsn file itself.)

Again, this will only work if you’re pointing a server to a new database that is 100% identical to the old database. You cannot use this technique to “migrate” a server between farms since a new farm wouldn’t know anything about a server that was just randomly pointed to it.


N久前某某人问我: “WIFI 板上,CPU通过什么方式和RF芯片连接?”  ,当时我确实不清楚,因为我对嵌入系统的总线 一无所知,所以就把计算机中最高效的一种方式说了出去,我答道:“内存共享”  …… 现在想想我当时真是白痴,对方听了肯定不会再有兴趣谈下去了。时至今日我也不清楚这个问题具体是问的什么,因为不清楚这个CPU具体是什么, 是DSP,MCU,SOC 还是什么? 这几天我翻了一些production brief  ,把broadcom,mavell, atheros(已被高通收购)的产品都看了一下,总结了一下。

一般来说数字无线系统都是有AP(MCU),BB,RF 构成。但对于小型的高集成系统,AP和BB往往集成在一起,叫基带处理单元(BBP)。特定于WIFI这个行业则直接叫做SoC 了。

1.先说说从marvell 那里看到的:

这就是marvell的top dog 方案, 高端SoC(附带3X3 MIMO)+ 2.5GHz和5GHz 双频:

  左边的是SOC 右边的是RF ,之间通过BBU 连接 (基带传输单元), I/Q是啥 ? 这个不明白的只有去翻通信方面的书了(OFDM方面的)。 SOC 88W8366 右边的PCIE接口就是这张卡和主机的接口了 ,很熟悉了。 还有JTAG,UART/GPIO,另外EEPROM 是通过SPI连接的。

2.再看一下atheros 的方案:

左边的AR7010 是一个SOC,右边的 AR9280 是被称为信号加速器的东西,最独特的是他们之间通过PCIe连接, 而SOC和host interface 通过USB连接 。既然能通过PCIe 连接 那么这个信号加速器的构成应该是比单纯的RF 要复杂。

还有这个AR4100,算是broadcom的单芯wifi 方案:

蓝色部分的把基带,MAC/RF全都集成在一起(SIP – system-in-package), 灰色的是MCU, 接口部分是SPI , SPI Slave 是连接的host interface

3. 最后看一下broadcom

应该说和marvell的一样,通过BBU直接把基带信号传给RFIC, 不过产品PDF里面并没有讲 ,不过透过逻辑图可以看出来:

BCM5352 和BCM2050 之间通过什么连接? 答案是BBU ,就是和marvell的一样了。从下图的I/Q 输入输出可看出。

__KERNEL__ 宏作用是什么?

这个宏在内核及应用程序代码中均能看到。它仅起到判断作用,而不是在实际的代码逻辑中被替换,就如之前讲的避免define重复定义的用法一样。但不同的是这个宏的目的并不是避免重复定义,那么这个宏到底是什么意思?  先看下这一段:

cmd_kernel/sched.o := gcc -Wp,-MD,kernel/.sched.o.d  -nostdinc -isystem /usr/lib/gcc/i586-suse-linux/4.5/include -I/usr/src/linux- -Iinclude  -include include/generated/autoconf.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -O2 -m32 -msoft-float -mregparm=3 -freg-struct-return -mpreferred-stack-boundary=2 -march=i686 -mtune=core2 -mtune=generic -maccumulate-outgoing-args -Wa,-mtune=generic32 -ffreestanding -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -pipe -Wno-sign-compare -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wframe-larger-than=2048 -fno-stack-protector -fno-omit-frame-pointer -fno-optimize-sibling-calls -fasynchronous-unwind-tables -g -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack -DCC_HAVE_ASM_GOTO -fno-omit-frame-pointer    -D”KBUILD_STR(s)=\#s” -D”KBUILD_BASENAME=KBUILD_STR(sched)”  -D”KBUILD_MODNAME=KBUILD_STR(sched)” -c -o kernel/.tmp_sched.o kernel/sched.c

这是内核中编译kernel/sched.c 时的编译参数,可以看到gcc -D__KERNEL__  ,其实几乎所有内核编译参数都包括它。 再看看include/linux/sched.h

#ifndef _LINUX_SCHED_H
#define _LINUX_SCHED_H

 * cloning flags:
#define CSIGNAL		0x000000ff	/* signal mask to be sent at exit */
#define CLONE_VM	0x00000100	/* set if VM shared between processes */
#define CLONE_DETACHED		0x00400000	/* Unused, ignored */
#define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_NEWPID		0x20000000	/* New pid namespace */
#define CLONE_NEWNET		0x40000000	/* New network namespace */
#define CLONE_IO		0x80000000	/* Clone io context */

 * Scheduling policies
#define SCHED_NORMAL		0
#define SCHED_FIFO		1
#define SCHED_RR		2
#define SCHED_BATCH		3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE		5
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK     0x40000000

#ifdef __KERNEL__

struct sched_param {
	int sched_priority;

#include <asm/param.h>	/* for HZ */

#include <linux/capability.h>

可以看到如果定义了 __KERNEL__ 则会多编译一段。而多编译的这一段是只有像内核这样的代码会用到的。
这样考虑一下:我为某个设备写了一个设备驱动,毫无疑问编译的时候肯定会加上__KERNEL__,然后这个驱动往往还要提供一个库文件,为应用程序提供访问某些变量或函数的接口。最后我们要写一个应用程序来操纵设备。那么在编写这个应用程序的时候假如库文件提供的东西还不够,比如库文件没有定义上述的#define CLONE_DETACHED 0x00400000,而库文件的API 函数包含返回CLONE_DETACHED 的可能,那么就无法用if(CLONE_DETACHED==fun(arg)) 来做判断了(当然你可以去扒内核代码去比较0x004000000)。那么应用程序代码中肯定要包含这内核头文件了(现在不推荐这样做了),那么就出现了一个问题,让应用程序引用内核头文件会暴露很多内核的细节和增加目标文件的大小,因为很多内核结构或变量,应用程序几乎用不着。那么就要设置一个边界,设定哪些是只有给内核代码可见的,哪些是开放给所有代码可见的。 这个边界就是__KERNEL__ 宏。其实实际的做法是所有该引用的内核头文件,变量或结构都要库的头文件中包含,这些属于内核的东东往往要重新定义在库的头文件中,然后再提供给应用程序去引用。


Paul Mackerras writes:

> The only valid reason for userspace programs to be including kernel
> headers is to get definitions that are part of the kernel API. (And
> in fact others here will go further and assert that there are *no*
> valid reasons for userspace programs to include kernel headers.)
> If you want some atomic functions or whatever for your userspace
> program and the ones in the kernel look like they would be useful,
> then take a copy of the relevant kernel code if you like, but don’t
> include the kernel headers directly.

Sure. That copy belongs in /usr/include/asm for all programs
to use, and it should match the libc that will be linked against.
(note: “copy”, not a symlink)

Red Hat 7 gets this right:

$ ls -ldog /usr/include/asm /usr/include/linux
drwxr-xr-x 2 root 2048 Sep 28 2000 /usr/include/asm
drwxr-xr-x 10 root 10240 Sep 28 2000 /usr/include/linux

Debian’s “unstable” is correct too:

$ ls -ldog /usr/include/asm /usr/include/linux
drwxr-xr-x 2 root 6144 Mar 12 15:57 /usr/include/asm
drwxr-xr-x 10 root 23552 Mar 12 15:57 /usr/include/linux

> This is why I added #ifdef __KERNEL__ around most of the contents
> of include/asm-ppc/*.h. It was done deliberately to flush out those
> programs which are depending on kernel headers when they shouldn’t.

What, is </usr/src/linux/asm/foo.h> being used? I doubt it.

If /usr/include/asm is a link into /usr/src/linux, then you
have a problem with your Linux distribution. Don’t blame the
apps for this problem.

Adding “#ifdef __KERNEL__” causes extra busywork for someone
trying to adapt kernel headers for userspace use. At least do
something easy to rip out. Three lines, all together at the top:

#ifndef __KERNEL__
#error Raw kernel headers may not be compatible with user code.

[转] ccache


   cache is a compiler cache. It speeds up recompilation by caching previous compilations and detecting when the same compilation is being done again. Supported languages are C, C++, Objective-C and Objective-C++.




  • 编译指令前增加ccache. $ ccache gcc xxx
  • 创建软链接。 $ ln -s ccache /usr/local/bin/gcc

建议使用第一种方式,因为ccache偶尔也犯晕,当由于它出现错误的时候, 很难看出端倪。曾在某次编译某份代码时,ccache对某个编译选项的判断失误导致编译失败,死活查不出来为什么原因。所以当出现某些怪异的现象时,请用正常的方式编译试试。



  • 编译软件包
  1. [/tmp/bash-4.1 0]$ uname -a
  2. Linux AP 2.6.37-gentoo #1 SMP PREEMPT Sun Jan 16 14:55:15 CST 2011 i686 Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz GenuineIntel GNU/Linux
  3. [/tmp/bash-4.1 0]$ CC=”ccache gcc” ./configure
  4. [/tmp/bash-4.1 0]$ time make
  5. real 0m47.343s
  6. user 0m39.572s
  7. sys 0m3.244s
  8. [/tmp/bash-4.1 0]$ make clean
  9. [/tmp/bash-4.1 0]$ time make
  10. real 0m10.131s
  11. user 0m5.597s
  12. sys 0m1.077s

由上可以看出,使用ccache后, 编译速度快了5倍(中间很长一段时间不是gcc在编译,否则更快)。Wonderful..


  • 编译内核
  1. [/tmp/linux-2.6.34 0]$ uname -a
  2. Linux boeye-AP 2.6.37-gentoo #1 SMP PREEMPT Wed Jan 12 20:06:14 CST 2011 x86_64 AMD Athlon(tm) II X4 630 Processor AuthenticAMD GNU/Linux
  3. [/tmp/linux-2.6.34 0]$ grep “make” build
  4. 28:make -j4 ARCH=arm CROSS_COMPILE=”ccache arm-linux-” O=$outdir $@
  5. [/tmp/linux-2.6.34 0]$ time ./build
  6. real 3m4.146s
  7. user 10m30.640s
  8. sys 0m37.138s
  9. [/tmp/linux-2.6.34 0]$ ./build clean
  10. [/tmp/linux-2.6.34 0]$ time ./build
  11. real 0m23.714s
  12. user 0m31.603s
  13. sys 0m12.777s



  • 编译Android

Android中,使用ccache,只需要添加环境变量’$ export USE_CCACHE=1′, 不同的是,默认它不用HOST的ccache程式,而使用自带的ccache. 编译android需要较大的缓冲区:

  1. $ ccache -M 3G    // 将缓冲区设置为3G












ARM在这样的大背景之下诞生,这注定了这些创始人不会也不愿意使ARM成为巨型公司,这也是取得如此成就的ARM,人数尚不过两千的最重要原因。ARM最初的简称是Acorn RISC Machine。Acorn Computer创立于1978年,总部位于剑桥,由Andy Hopper(剑桥大学), Chris Curry(Sinclair Research)和Herman Hauser(剑桥大学)创建[48]

Acorn最初使用MOS Technology 6502处理器搭建处理器系统。MOS Technology 6502是一颗8位处理器,设计这颗处理器的工程师来自摩托罗拉的MC6800设计团队[48]。基于6502处理器,Acorn开发了最令其自豪的处理器系统BBC Micro[49]

在上世纪80年代至90年代,BBC Micro处理器系统主宰英国的教育市场。当时还有另外一个基于6502处理器的系统,Apple II[50]。从这时起,Acorn和Apple这两个设计理念及产品形态相似的公司结下了不解之缘,有些人将Acorn公司称呼为“The British Apple”[51]。也是在这个时候,Acorn迎来了一生中的对手Intel。基于Intel x86构架的PC对同时代的处理器厂商是一场噩梦,很少有公司能够醒来。服从或者灭亡,别无选择。Acorn选择服从,向Intel申请80286处理器样片,Intel拒绝了这个请求[52]


1983年10月,Acorn启动了代号为Acorn RISC的项目,由VLSI Technology负责生产。1985年4月26日,VLSI成产出第一颗Acorn RISC处理器,ARM1。ARM1的结构非常简单,仅有个25,000晶体管,甚至没有乘法部件[52]。当时并没有人留意这颗芯片,更多的人关注Intel在1985年10月17日发布的80386处理器[36]


Acorn不得不选择回避,这也决定了ARM处理器的设计理念是low-cost, low-power和high-performance。这个理念与21世纪智能手机的需求不谋而合,却是Intel强加给ARM的。Intel在不经意间为自己树立了一个强大的对手,这个对手在Intel的庇护之下一步步长大。并不夸张地说,没有Intel就没有ARM的今天。



Acorn无论是在财务上还是在技术上都遭遇了瓶颈。销售量达到150万台的BBC Micro没有给Acorn带来足够的财富,与席卷天下的PC相比这微不足道[54]。ARM3与Intel在1989年发布的80486也没有太多可比性。危机最终降临到Acorn这个年轻的公司,1985年2月,当时的IT巨头Olivetti出资12M英镑收购Acorn 49.3%的股份[55]。Olivetti的庇护没有给Acorn带来机遇。


Olivetti收购Acorn后,更多地将ARM处理器用于研发,真正的产品使用Zilog系列。这段时间是Acorn最艰难的日子。Acorn的创始人Andy Hopper最终选择从Olivetti独立。出乎意料之外,Olivetti支持了Andy的决定。

1990年11月,Acorn(事实上是Olivetti Research Lab),Apple和VLSI共同出资创建了ARM。Acorn RISC Machine正式更名为Advanced RISC Machine[55]。在1996年,Olivetti在最困难的时候将所持有的14.7%的Acorn股份出售给了雷曼兄弟[56]

当时的Apple正在为代号Newton的项目寻找低功耗处理器。Newton项目的终极目标是实现地球上第一个Tablet。Apple对Tablet的前景寄予厚望,他们直接将公司Logo上的Isaac Newton作为项目的名称。Apple最初的Logo是在苹果树下深思的牛顿。两个Steve[i]将公司命名为Apple,与喜欢吃苹果没有任何联系,只因为是苹果而不是鸭梨砸到了牛顿头上。

Newton Tablet的想法过于超前,最糟糕的是Jobs当时并不在Apple。Apple用并不太短的时间证明了一条真理,没有Jobs的Apple和没有乔丹的公牛没有太大区别。1996年3月,Steve Jobs再次回到Apple,两年后取消了这个并不成功的项目[57]。等到Jobs再次推出iPad Newton时,已是十几年之后的事情了[58]






在ARM的起步阶段,鼎力相助的是Apple,最先License ARM内核的却是英国本土的GEC半导体公司。在1993年因为Apple的引荐,ARM处理器跋山涉水来到日本,与Sharp建立了合作关系。在此之前Sharp与Apple一直在合作开发Newton项目。

这些合作并没有缓解ARM的财务危机,ARM一直在追寻真正属于自己的客户。1993年,Cirrus Logic[iv]和德州仪器公司TI(Texas Instruments)先后加入ARM阵营。TI给予了ARM雪中送炭的帮助。TI正在说服当时一家并不知名的芬兰公司Nokia与他们一同进入通信移动市场。TI在DSP领域已经取得了领袖地位,但并不熟悉CPU业务,在屈指可数的可以被操控的公司中,最终选择了ARM[67]


同年ARM迎来了公司成立以来最重要的一颗处理器内核,ARM7[67]。ARM7使用的Die尺寸是Intel 80486的十六分之一,售价仅为50美金[v]左右。较小的Die尺寸,使ARM7处理器获得了较低的功耗,适合手持式应用[67]

ARM7处理器引起了当时的处理器巨头DEC的关注。1995年,DEC开始研发StrongARM。与其他License ARM内核的半导体厂商不同。DEC获得了ARM架构的完整授权,DEC可以使用ARM的指令集,设计新的处理器架构,这个特权后来被Intel和Marvell陆续继承。第二年的2月5日,DEC正式发布SA110处理器,开始提供样片[68]。SA110处理器迅速得到了业界的认可,Apple开始使用SA110处理器开发MessagePAD 2000 [69]

StrongARM处理器在设计中注入了Alpha处理器的一些元素。StrongARM使用5级顺序执行的流水线,分离了指令和数据Cache,添加了DMMU和IMMU功能部件,每个MMU中包含32个全互连结构的TLB,添加了16级深度的WB(Write Buffer)[70]。至此ARM处理器更像是一颗微处理器,而不再是微控制器。


StrongARM的成功没有帮助DEC摆脱财务危机。而DEC却找到了一个更容易赚钱的途径。1997年5月,DEC正式起诉Intel,宣称Intel在设计Pentium,Pentium Pro和Pentium II处理器时侵犯了DEC的10条专利。1997年9月,Intel反诉DEC在设计Alpha系列处理器时侵犯了Intel多达14条专利[72]

在IT界,这样的官司大多不了了之。1997年11月27日,DEC和Intel选择和解。DEC向Intel提供除Alpha处理器之外的所有硬件设计授权,进一步支持Intel开发IA64处理器。同时Intel花费了625M美金购买DEC在Hudson的工厂,Israel Jerusalem和Texas Austin的芯片设计中心。这两个公司还签署了长达十年的交叉授权协议[72]



一时间,XScale处理器遍及嵌入式应用的每一个领域,用于手持终端的PXA系列,用于消费类电子的IXC/Intel CE系列,用于存储的IOP系列,用于通信的IXP系列。Intel的处理器技术极大地促进了ARM内核的发展,借用PC帝国的Ecosystem使ARM处理器从生产到设计一步领先于所有嵌入式行业的竞争者。首先成为XScale处理器试金石的是摩托罗拉半导体的68K处理器。

在XScale系列处理器诞生之前,68K处理器主宰嵌入式领域,Apple Macintosh最初也使用68K处理器。在1997年,摩托罗拉销售了79M片68K处理器,而Intel的x86处理器一共卖出了75M片[73]。这是68K处理器最后的辉煌。Intel和TI主导的ARM处理器终结了68K处理器。摩托罗拉半导体面对ARM的强势出击毫无准备。ARM处理器不断地蚕食68K的市场份额,直到完全占有。

1995年,摩托罗拉半导体的香港设计中心发布第一颗用于手持式设备的DragonBall处理器,MC68328(EZ/VZ/SZ)[74],这是香港半导体界最好的时代。而StrongARM/XScale很快结束了香港设计中心的幸福生活。面对ARM的挑战,DragonBall最终屈服,DragonBall MX(Freescale i.MX)系列处理器开始使用ARM9。使用ARM内核并没有改变摩托罗拉香港设计中心的命运,这个设计中心最终不复存在。




Intel虽然保留了ARM处理器的授权,却已彻底退出了ARM阵营。这是Intel一个非常谨慎而且坚决的选择。Intel需要扑灭后院的熊熊烈火。在PC领域,AMD率先推出了64位的K8处理器[75],并在2005的Computex 上,发布双核处理器Athlon 64。Intel x86最引以为豪的性能优势已不复存在。

这段时间Intel只能依靠工艺与强大的商务能力与AMD的Athlon64处理器周旋。2008年11月,Intel正式发布基于Nehalem内核,用于台式机的Core i7处理器[76],用于服务器的Xeon处理器,Core i3/i5也如期而至。Nehalem内核使Intel彻底战胜了AMD。这颗处理器也是Intel开始研发x86处理器以来,第三个里程碑产品,之前的两个里程碑分别是80386和Pentium Pro。从这时起AMD处理器在性能上再也没有超过Intel。Intel解决了最大的隐患后,却发现ARM处理器已非吴下阿蒙。




ARM9的指令执行部件分离了Memory和Write Back阶段,这两个阶段分别用于访问存储器和将结果回写到寄存器。这些技术的应用使得ARM9可以在一个周期内完成Load和Store指令,而在ARM7中,Load指令需要使用3拍,而Store指令需要使用2拍。


ARM7与ARM9的合理布局,使得ARM阵营迅猛发展。基于ARM7和ARM9内核的SoC处理器迅速遍及世界的每一个角落。ARM内核依然在前进。1998年的EPF(Embedded Processor Forum) ARM10内核正式推出。2000年4月12日,Lucent发布了第一颗基于ARM10的处理器芯片[83]

ARM10内核的设计目标依然是在相同的工艺下,双倍提升ARM9的性能。而提高性能的第一步是提高指令流水线的时钟频率,而流水线中最慢的逻辑单元决定了时钟频率。ARM10使用了6级流水线结构,但并不是在ARM9的5级流水线的基础上增加了一级,而是进行了细致取舍而调优。最终的结果是在使用相同的工艺时,ARM10内核可使用时钟频率为ARM9内核的1.5倍[82] [84]

ARM10内核重新使用了ARM8内核的系统总线,将ARM9的32位系统总线提高到64位。这也使得ARM10可以在一个时钟周期内完成两条寄存器与存储器之间的数据传递,大幅提高了Load Multiple和Store Multiple指令的效率[84]

ARM10改动了Cache Memory系统,与ARM9相比提高了存储器系统的效率。ARM10的指令与数据Cache使用虚拟地址,64路组相连结构,引入了高端处理器中流水线与Cache交换数据的Streaming Buffer和Cache Line filling部件[84]




Intel在保证XScale架构低功耗的同时,引入已经在Pentium Pro系列处理器上非常成熟的Superpipelined RISC技术[85],借助Intel的工艺优势,使得XScale处理器的最高运行频率达到了1.25GHz[86]。此时Intel开发的处理器步入了高频低能的陷阱,1.25GHz的PXA3XX性能仅比624MHz的PXA270的执行效率高25%[86]

XScale架构并没有使Intel盈利。ICG(Intel Communication Group)部门和WCCG(Wireless Communications and Computing Group)部门给Intel带来的是巨额亏损,ICG在2002~2004年的亏损分别为$817M, $824M和$791M[87]。2003年12月11日,Intel宣布将WCCG合并到ICG中,并在2004年1月1日生效。


在此之前Intel将XScale处理器中Marvell还愿意接收的部分出售[12]。Marvell需要的并不是XScale内核,而是Intel从DEC获得的对ARM指令集的完整授权,很快Marvell推出了基于标准ARM v5/v6/v7的处理器,而不再单独依靠XScale。XScale,这个几乎耗尽Intel全部心血的架构,已经走到了最后尽头。


ARM的廉价License的获益者是ARM自身,随着处理器厂商的不断加入, ARM阵营获得了迅猛发展,这也加速了处理器厂商的优胜劣汰。但是Intel发现的事实依然适用于所有正在使用ARM授权的半导体厂商。




ARM11基于ARMv6指令集,之前ARM还开发了V1,V2,V2a,V3,V4和V5指令集。ARM使用的内核与指令集并不一一对应。如ARM9使用V4和V5指令集,XScale使用V5指令集。ARM7最初使用V3,而后使用V4,最后升级到V5。在ARM指令集后还包含一些后缀如ARMv5TEJ,其中T表示支持Thumb指令集,E表示支持Enhanced DSP指令,而J表示支持Jazelle DBX指令。

ARM v4包含最基础的ARM指令集;v5增强了ARM与Thumb指令间交互的同时增加了CLZ(Count Leading Zero)和BKPT(Software Breakpoint)指令;ARMv5TE增加了一系列Enhanced DSP指令,如PLD(Preload Data),LDRD(Dual Word Load),STRD(Dual Word Store)和64位的寄存器传送指令如MCRR和MRRC。ARM v4和v5在指令集上变化不大,v5也可以向前兼容v4指令集[94]


ARM的指令集使用RISC架构,但是在ARM指令集中依然包含许多CISC元素。与PowerPC指令集相比,ARM的指令集凌乱得多,这为指令流水线的译码部件制造了不小的麻烦。在ARM内核包含三类指令集,一个是32b长度的ARM指令,一个是16b长度的Thumb指令,还有一类由8位组成的变长Jazelle DBX(Direct Bytecode eXecution)指令集。在ARM架构为数不多的指令集中,有两类指令值得特别关注,一个是Conditional Execution指令,另一个是移位指令。

绝大多数ARM的数据访问指令都支持条件执行功能。所谓条件执行是指指令可以根据状态位,有选择地执行。使用这种方式可以在一定程度上降低条件转移指令预测失败时所带来的系统延时。在计算GCD(Greatest Common Divisor)时,ARM的条件执行指令发挥了巨大的作用,如图2所示。

图2 gcd算法的实现[94]

通过上图可以发现由于SUBGT和SUBLE指令可以根据CMP指令产生的状态决定是否执行,因此显著降低了代码长度。ARM指令集还对移位操作进行了特别的处理,ARM不含有单独的移位指令,使用了Barrel Shifter技术,与其他指令联合实现移位操作,使用这种方法可以有效提高某些运算的效率,如图3所示。

图3 Barrel Shifter的使用


ARM内核在条件执行指令时占用了4个状态位,影响了指令集和寄存器的扩展。在绝大多数RISC处理器中具有32个通用寄存器,而ARM内核仅有16个通用寄存器[x]。ARM的特殊移位操作,增加了指令的相关性,在有些情况下,不利于多发射流水线的实现,也增加了指令流水中预约站RS(Reservation Station)的实现难度。


ARM11内核使用了现代处理器中常用的一些提高IPC的技术,这是ARM处理器的一个重要里程碑。ARM11内核引起了计算机科学的两个泰山北斗,David A. Patterson和John L. Hennessy的注意。他们以ARM11内核为主体,而不再是MIPS,书写了计算机体系结构的权威著作,《Computer Organization and Design, Fourth Edition: The Hardware/Software Interface》。这也是学术界对ARM处理器有史以来的最大认可。

ARM11可以支持多核,采用了8级流水线结构,率先发布的内核其主频在350~500MHz之间,最高主频可达1GHz。在使用0.13μm工艺,工作电压为1.2V时,ARM11处理器的功耗主频之比仅为0.4mW/MHz。ARM11增加了SIMD指令,与ARM9相比MPEG4的编解码算法实现速度提高了一倍,改变了Cache memory的结构,使用物理地址对Cache行进行索引[95]。ARM11终于使用了动态分支预测功能,设置了64个Entry,4个状态的BTAC(Branch Target Address Cache)[95]

ARM11进一步优化了指令流水线对存储器系统的访问,特别是在Cache Miss的情况之下的存储器读写访问。在ARM11内核中,当前存储器读指令并不会阻塞后续不相关的指令执行,即便后续指令依然是存储器读指令,只有3个存储器读指令都发生Cache Miss的情况,才会阻塞指令流水线[95]

虽然ARM11没有使用RISC处理器常用的out-of-order加Superscaler技术,在一个时钟周期之内仅能顺序发射一条指令,但是支持out-of-order completion功能,即在执行单元中的不相关的指令可以乱序结束,而不用等待之前的指令执行完毕。


依靠着强大的性能功耗比,ARM11内核取得了巨大的商业成功。ARM11内核并不是一个性能很高的处理器,但是随着处理器性能的不断提升,量变引发了质变。ARM11内核的出现,使得Smart Phone的出现成为可能。

在此之前,基于ARM9,XScale处理器的手机只是在Feature Phone的基础上添加了少许智能部件。ARM11的出现加速了手机阵营的优胜劣汰,Apple,HTC在智能手机领域异军突起,Motorola一蹶不振。ARM11之后,ARM迎来了爆发式增长,迅速陆续发布了Cortex A8和A9内核。

ARM处理器内核的快速更新,使Nokia这个对新技术反应迟钝的公司,一步步走向衰退。在2010年9月底开始出货的Nokia N8[96],居然还在使用着680MHz主频的ARM11处理器[97],这款产品却号称是Nokia最新的旗舰产品,它的竞争对手早已使用了1GHz主频的Cortex A8处理器。



至此ARM之于PC领域,x86之于手机领域的野心,已昭然若揭。2010年9月9日,ARM正式发布代号为Eagle,5倍ARM9架构的Cortex A15内核,这颗处理器所关注的应用是高端手机,家庭娱乐,无线架构,还有低端服务器[98]。Cortex A15向世人宣布除了PC,他们还要Server。


[i] 苹果公司的两个创始人都叫Steve,一个是Steve Wozniak,另一个是众所周知的Steve Jobs。Steven Wozniak是Apple I和Apple II的发明者。两个Steve在1976年4月,在一个车库中成立众所周知的Apple。

[ii] 英国的谷仓文化与美国的车库文化相近,是新技术的摇篮。

[iii] ARM公司从ARM3直接升级到ARM6。

[iv] 我第一次准备使用的ARM处理器是Cirrus Logic的EP7312。当时我还在使用Altera的EPLD,名称是EP7132,我偶尔混淆这两个芯片的名称。在一个机缘巧合之下,粗心的采购将我需要购买的EP7132买成了EP7312,这颗芯片也是我购买的第一颗ARM处理器。

[v] 当时的处理器价格高得离谱,50美金已经是很廉价了。

[vi] 我从SA1110开始接触ARM处理器,那是一个永远值得回忆的时代。

[vii] 我在摩托罗拉半导体部门第一次接触的就是Coldfire处理器,目前这颗处理器仍然在不断发展中,这颗芯片与68K在汇编语言层面兼容,但是目标代码并不兼容。

[viii] QorIQ系列处理器基于E500 mc内核,与E500 v2有些微小差异。我的第一本著作是基于E500内核的《Linux PowerPC详解—核心篇》,当时准备写一套丛书,包括核心篇和应用篇。应用篇主要写外部设备,后来的《PCI Express体系结构导读》源自《Linux PowerPC详解—应用篇》,应用篇应该包含网络协议,PCI Express和USB总线,后来把网络协议部分和USB总线部分删掉了。

[ix] 在处理器体系结构中,重点关注的有三类相关问题,RAW,WAR和WAW。使用寄存器重命名技术可以解决WAR和WAW相关。

[x] 考虑到ARM在ARM11内核之前都不支持动态分支预测,和多发射,使用条件执行指令还是能够提高ARM7/9内核的执行效率。

Hardirq ,Softirq,Tasklet和Workqueue

中断是一个繁杂的话题,由中断引发的问题很容易引发争论。除我之前有讲过中断睡眠的问题,还有关于tasklet 和workqueue。 这里有必要重新整理总结一下了。



1. 同步中断是当指令执行时由 CPU 控制单元产生,之所以称为同步,是因为只有在一条指令执行完毕后 CPU 才会发出中断,而不是发生在代码指令执行期间,比如系统调用。

2. 异步中断是指由其他硬件设备依照 CPU 时钟信号随机产生,即意味着中断能够在指令之间发生,例如键盘中断。

根据 Intel 官方资料,同步中断称为异常(exception),异步中断被称为中断(interrupt)。

中断可分为可屏蔽中断(Maskable interrupt 比如打印机中断)和非屏蔽中断(Nomaskable interrupt)。异常可分为故障(fault)比如缺页异常、陷阱(trap)比如调试异常、终止(abort)三类。

从广义上讲,中断可分为四类:中断故障陷阱终止。这些类别之间的异同点请参看 表 1。

表 1:中断类别及其行为
类别 原因 异步/同步 返回行为
中断 来自I/O设备的信号 异步 总是返回到下一条指令
陷阱 有意的异常 同步 总是返回到下一条指令
故障 潜在可恢复的错误 同步 返回到当前指令
终止 不可恢复的错误 同步 不会返回

X86 体系结构的每个中断都被赋予一个唯一的编号或者向量(8 位无符号整数)。非屏蔽中断和异常向量是固定的,而可屏蔽中断向量可以通过对中断控制器的编程来改变。

ok ,上面都是教课书或各种手册上的陈词滥调,那么在具体某个操作系统实现整个中断处理的时候却完全不像前面说的那样那么简单。


前面所讲的基本都是硬中断(因教课书的局限性以及手册的严谨性),也就是传统中断的处理方式,硬件的支持贯穿于整个中断处理过程。后来发现在关中断-〉处理中断,完毕-〉开中断 之间,由于关中断会造成中断丢失,尤其是中断处理过程花费较长时间的情况下。所以从 linux1.x版本开始,中断处理程序从概念上被分为上半部分(top half)和下半部分(bottom half)。

在中断发生时上半部分的处理 过程立即执行,因为它是完全屏蔽中断的,所以要快,否则其它的中断就得不到及时的处理。但是下半部分(如果有的话)几乎做了中断处理程序所有的事情,可以 推迟执行。内核把上半部分和下半部分作为独立的函数来处理,上半部分的功能就是“登记中断”,决定其相关的下半部分是否需要执行。需要立即执行的部分必须 位于上半部分,而可以推迟的部分可能属于下半部分。下半部分的任务就是执行与中断处理密切相关但上半部分本身不执行的工作,如查看设备以获得产生中断的时 间信息,并根据这些信息(一般通过读设备上的寄存器得来)进行相应的处理。从这里我们可以看出下半部分其实是上半部分引起的,例如当打印机端口产生一个中 断时,其中断处理程序会立即执行相关的上半部分,上半部分就会产生一个软中断(下半部分的一种,后面再介绍)并送到操作系统内核里,这样内核就会根据这个软中断唤醒睡眠的打印机任务队列中的处理进程。

它们最大的不同是上半部分不可中断,而下半部分可中断。在理想的情况下,最好是中断处 理程序上半部分将所有工作都交给下半部分执行,这样的话在中断处理程序上半部分中完成的工作就很少,也就能尽可能快地返回。但是,中断处理程序上半部分一 定要完成一些工作,例如,通过操作硬件对中断的到达进行确认,还有一些从硬件拷贝数据等对时间比较敏感的工作。剩下的其他工作都可由下半部分执行(一个典型的情景就是网络数据包到达网卡,上半部必须要给该数据包上时间戳,然后其他处理推迟到后半部再处理)。

内核中的中断处理机制在不断变化,而变化的要点并不是在上半部,而是下半部。由上面介绍可以看出上半部仍然是遵循传统的中断机制,也就是依赖硬件的中断。所以上半部也可以称为硬中断。下半部由于只是处理上半部推托过来的任务,完全依赖代码实现,所以也可以理解成”软中断” ,只是这里的“软中断”非内核文档中说的软中断softirq,真正的软中断softirq(作为下半部实现的一种)是在2.4中引入的。内核中实现下半部的手段不断演化,目前已经从最原始的BH(bottom half)进化到软中断(softirq在2.3引 入)、tasklet(在2.3引入)、工作队列(work queue在2.5引入)。也就是说在2.6 中传统的BH 机制已经被剔除 ,现在提到BH 其实就是指softirq, tasklet.workequeue 这三种实现。







后来SMP 普及后,上述缺点就成了致命伤。softirq支持SMP,同一个softirq可以在不同的CPU上同时运行,softirq必须是可重入的。整个softirq机制的设计与实现始终贯穿着 一个思想:“谁触发,谁执行”(Who marks, who runs),也就是说,每个CPU都单独负责它所触发的软中断,互不干扰。这就有效地利用了SMP系统的性能和特点,极大地提高了处理效率。


struct softirq_action
void (*action)(struct softirq_action *);



static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;

对应NR_SOFTIRQS个 softirq_action结构表示的软中断描述符。内核预定义了一些软中断向量的含义供我们使用:

 RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */



#define HI_SOFTIRQ 0
#define NR_SOFTIRQS 10


void open_softirq(int nr, void (*action)(struct softirq_action *))
 softirq_vec[nr].action = action;


处理时机1 :由硬中断直接调用执行软中断

1.上半部(硬中断)处理函数 do_IRQ  in arch/x86/kernel/irq.c:

 * do_IRQ handles all normal device IRQ's (the special
 * SMP cross-CPU interrupts have their own specific
 * handlers).
unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 struct pt_regs *old_regs = set_irq_regs(regs);

 /* high bit used in ret_from_ code */
 unsigned vector = ~regs->orig_ax;
 unsigned irq;


 irq = __get_cpu_var(vector_irq)[vector];

 if (!handle_irq(irq, regs)) {

 if (printk_ratelimit())
 pr_emerg("%s: %d.%d No irq handler for vector (irq %d)\n",
 __func__, smp_processor_id(), vector, irq);


 return 1;

特殊一点的比如apic时钟中断 in arch/x86/kernel/apic/apic.c:

 * Local APIC timer interrupt. This is the most natural way for doing
 * local interrupts, but local timer interrupts can be emulated by
 * broadcast interrupts too. [in case the hw doesn't support APIC timers]
 * [ if a single-CPU system runs an SMP kernel then we call the local
 * interrupt as well. Thus we cannot inline the local irq ... ]
void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
 struct pt_regs *old_regs = set_irq_regs(regs);

 * NOTE! We'd better ACK the irq immediately,
 * because timer handling can be slow.
 * update_process_times() expects us to have done irq_enter().
 * Besides, if we don't timer interrupts ignore the global
 * interrupt lock, which is the WrongThing (tm) to do.


2.上半部(硬中断)退出处理函数 irq_exit()   in kernel/softirq.c  :

void irq_exit(void)
 if (!in_interrupt() && local_softirq_pending())

 /* Make sure that timer wheel updates are propagated */
 if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())

3.软中断调用函数 invoke_softirq() in kernel/softirq.c :

/*macro if defined, means that the IRQs are guaranteed to be disabled when irq_exit() function is called.
 In such a case, the kernel may skip some instructions (disabling IRQs etc).... and thus call __do_IRQ() instead of do_IRQ.*/

static inline void invoke_softirq(void)
 if (!force_irqthreads)
static inline void invoke_softirq(void)
 if (!force_irqthreads)

X86 实际调用的是do_softirq , ARM架构的会调用__do_softirq 区别在于前者保证硬中断已关闭。

不过这里do_softirq 并不是在kernel/softirq.c 中定义的那个,因为宏__ARCH_HAS_DO_SOFTIRQ 在x86下被定义了,所以真正的do_softirq在arch/x86/kernel/irq_32.c :

asmlinkage void do_softirq(void)
 unsigned long flags;
 struct thread_info *curctx;
 union irq_ctx *irqctx;
 u32 *isp;

 if (in_interrupt())


 if (local_softirq_pending()) {
 curctx = current_thread_info();
 irqctx = __this_cpu_read(softirq_ctx);
 irqctx->tinfo.task = curctx->task;
 irqctx->tinfo.previous_esp = current_stack_pointer;

 /* build the stack frame on the softirq stack */
 isp = (u32 *) ((char *)irqctx + sizeof(*irqctx));

 call_on_stack(__do_softirq, isp);
 * Shouldn't happen, we returned above if in_interrupt():


这里有个令人疑惑的问题,local_irq_save(flags) 和local_irq_restore(flags) 之间是关中断的,那么__do_softirq 也就是在关中断情况下执行。 这样不就和下半部在开中断下执行的设计初衷相违背了吗? 答案就在__do_softirq 中

最终还是要执行kernel/softirq.c 中的 __do_softirq:

 * We restart softirq processing MAX_SOFTIRQ_RESTART times,
 * and we fall back to softirqd after that.
 * This number has been established via experimentation.
 * The two things to balance is latency against fairness -
 * we want to handle softirqs as soon as possible, but they
 * should not be able to lock up the box.

asmlinkage void __do_softirq(void)
 struct softirq_action *h;
 __u32 pending;
 int max_restart = MAX_SOFTIRQ_RESTART;
 int cpu;

 pending = local_softirq_pending();

 __local_bh_disable((unsigned long)__builtin_return_address(0),

 cpu = smp_processor_id();
 /* Reset the pending bitmask before enabling irqs */


 h = softirq_vec;

 do {
 if (pending & 1) {
 unsigned int vec_nr = h - softirq_vec;
 int prev_count = preempt_count();


 if (unlikely(prev_count != preempt_count())) {
 printk(KERN_ERR "huh, entered softirq %u %s %p"
 "with preempt_count %08x,"
 " exited with %08x?\n", vec_nr,
 softirq_to_name[vec_nr], h->action,
 prev_count, preempt_count());
 preempt_count() = prev_count;

 pending >>= 1;
 } while (pending);


 pending = local_softirq_pending();
 if (pending && --max_restart)
 goto restart;

 if (pending)



哈,软中断处理函数的执行h->action(h) ,夹在local_irq_enable() 和local_irq_disable() 之间。前面的那个疑问解决了。

看到这里softirq 的流程就这些吗?  当然不是,除了在硬中断执行完后进入irq_exit 直接触发softirq 执行,还可以通过先预约再择机(推迟)执行的方式。这里就体现出软中断推迟执行的特点了。

处理时机2 :ksoftirq 内核线程执行软中断


预约具体是通过raise_softirq 函数实现的:

in kernel/softirq.c

void raise_softirq(unsigned int nr) 
{ unsigned long flags; 

inline void raise_softirq_irqoff(unsigned int nr) 
{ __raise_softirq_irqoff(nr); 
/* * If we're in an interrupt or softirq, we're done 
* (this also catches softirq-disabled code). We will 
* actually run the softirq once we return from 
* the irq or softirq. 
* * Otherwise we wake up ksoftirqd to make sure we 
* schedule the softirq soon. */ 
if (!in_interrupt()) 
in include/linux/interrupt.h
static inline void __raise_softirq_irqoff(unsigned int nr) 
 or_softirq_pending(1UL << nr); 

破了几层窗户最后其实就是在softirq 位图对对应的软中断号上标记。

2. ksoftirq 内核线程

预约完毕后将在raise_softirq_irqoff 中唤醒ksoftirq :

in kernel/softirq.c

void wakeup_softirqd(void) { 
/* Interrupts are disabled: no need to stop preemption 
struct task_struct *tsk = __get_cpu_var(ksoftirqd); 
if (tsk && tsk->state != TASK_RUNNING) 
 static int run_ksoftirqd(void * __bind_cpu)

         while (!kthread_should_stop()) {
                 if (!local_softirq_pending()) {


                 while (local_softirq_pending()) {
                         /* Preempt disable stops cpu going offline.
                            If already offline, we'll be on wrong CPU:
                            don't process */
                         if (cpu_is_offline((long)__bind_cpu))
                                 goto wait_to_die;
                         if (local_softirq_pending())
         return 0;

         /* Wait for kthread_stop */
         while (!kthread_should_stop()) {
         return 0;

处理时机3 :调用local_bh_enable显式执行软中断
void local_bh_enable(void)
         _local_bh_enable_ip((unsigned long)__builtin_return_address(0));
void local_bh_enable_ip(unsigned long ip)

static inline void _local_bh_enable_ip(unsigned long ip)
         WARN_ON_ONCE(in_irq() || irqs_disabled());
          * Are softirqs going to be turned on now:
         if (softirq_count() == SOFTIRQ_DISABLE_OFFSET)
          * Keep preemption disabled until we are done with
          * softirq processing:
         sub_preempt_count(SOFTIRQ_DISABLE_OFFSET - 1);

         if (unlikely(!in_interrupt() && local_softirq_pending()))


这种方式在协议栈代码中用的很多,因为协议栈往往造成大量中断的产生,催促软中断的处理似乎是一个好的选择。而且软中断向量NET_TX_SOFTIRQ, NET_RX_SOFTIRQ 优先级除TIMER高于其他,能够保证它及时处理。




上文讲软中断向量的时候看到了 TASKLET_SOFTIRQ,没错它就是用来为tasklet 机制服务的软中断。 说白了,tasklet机制就是基于softirq的:

另外 HI_SOFTIRQ 也是用来服务tasklet 的,只不过他是所有软中断中优先级最高的。

tasklet和基础softirq 的不同主要是:

1.softirq 能够让同一个中断处理函数在不同的CPU上同时执行(注意,这里的同时是真的同时,因为SMP环境拥有两颗或两颗以上的cpu核心),要求该函数必须是可重入的。所以内核或驱动开发者要在中断处理函数中注意互斥问题,也就是加锁,当然这里不能加睡眠锁,只能上自旋锁。

2.tasklet 机制实现了同一tasklet 只能在一个cpu上执行,但不同的tasklet却可以在不同的cpu上执行。这样开发者就可以把互斥问题抛至脑后了。另外,tasklet在执行的时候是非积累的,比如一个时间内某个tasklet被触发3次,那么待轮到tasklet handle 被执行时,实际只执行1次(不可重入)。而且每个tasklet总是在第一次执行的那个 cpu 上执行 ,这样有利于cpu 缓存。

tasklet 结构:

in kernel/softirq.c:

struct tasklet_struct


       struct tasklet_struct *next;

       unsigned long state;

       atomic_t count;

       void (*func)(unsigned long);

       unsigned long data;


struct tasklet_head { struct tasklet_struct *list; }; 
(2)state定义了tasklet的当前状态,这是一个32位无符号整数,不过目前只使用了bit 0和bit 1,
bit 0为1表示tasklet已经被调度去执行了,而bit 1是专门为SMP系统设置的,
内核对这两 个位的含义也进行了预定义:

TASKLET_STATE_SCHED, /* Tasklet is scheduled for execution */
TASKLET_STATE_RUN     /* Tasklet is running (SMP only) */

(3)count是一个原子计数(其实它只有0或1 两种值),对tasklet的引用进行计数。目的是在tasklet已经挂上的情况下enable 或disable这个tasklet 。需要注意的是,只有当count的值为0的时候,tasklet代码段才能执 行,即这个时候该tasklet才是enable的;如果count值非0,则该tasklet是被禁止的(disable)。因此,在执行 tasklet代码段之前,必须先检查其原子值count是否为0。


tasklet 调度函数:

tasklet 可以看作是软中断的step 2 ,所以他的softirq handler 就是tasklet 的执行/调度 函数:
in kernel/softirq.c:
static void tasklet_action(struct softirq_action *a)
 struct tasklet_struct *list;

 list = __get_cpu_var(tasklet_vec).head;
 __get_cpu_var(tasklet_vec).head = NULL;
 __get_cpu_var(tasklet_vec).tail = &__get_cpu_var(tasklet_vec).head;

 while (list) {
 struct tasklet_struct *t = list;

 list = list->next;

 if (tasklet_trylock(t)) {
 if (!atomic_read(&t->count)) {
 if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))

 t->next = NULL;
 *__get_cpu_var(tasklet_vec).tail = t;
 __get_cpu_var(tasklet_vec).tail = &(t->next);

用tasklet_trylock()宏试图对当前要执行的tasklet(由指针t所指向)进行加锁,如果加锁成功 (当前没有任何其他CPU正在执行这个tasklet),则用原子读函数atomic_read()进一步判断count成员的值。如果count为0, 说明这个tasklet是允许执行的。如果tasklet_trylock()宏加锁不成功,或者因为当前tasklet的count值非0而不允许执行 时,我们必须将这个tasklet重新放回到当前CPU的tasklet队列中,以留待这个CPU下次服务软中断向量TASKLET_SOFTIRQ时再 执行。为此进行这样几步操作:(1)先关 CPU中断,以保证下面操作的原子性。(2)把这个tasklet重新放回到当前CPU的tasklet队列的 首部;(3)调用__cpu_raise_softirq()函数在当前CPU上再触发一次软中断请求TASKLET_SOFTIRQ;(4)开中断。




工 作队列是Linux 2.6 内核中新增加的一种下半部机制。它与其它几种下半部分机制最大的区别就是它可以把工作推后,交由一个内核线程–工作者线程 (内核线程)去执行。内核线程只在内核空间运行,没有自己的用户空间,它和普通进程一样可以被调度,也可以被抢占。该工作队列总是会在进程上下文执行。缺 省的工作者线程叫做events/n,n是处理器的编号。如果要在工作者线程中执行大量的处理操作时,可以创建属于自己的工作者线程。这样,通过工作队列 执行的代码能占尽进程上下文的所有优势,最重要的就是工作队列允许重新调度甚至是睡眠。

由于softirq和 tasklet在同一个CPU上的串行执行,不利于多媒体实时任务和其它要求严格的任务的处理。在有些系统中采用了新的工作队列机制取代软中断机制来完成 网络接收中断后的推后处理工作。通过由具有最高实时优先级的工作者线程来处理实时多媒体任务或其它要求较高的任务,而由优先级次高的工作者线程来处理其他 的非实时数据业务。Linux 2.6 内核的调度系统采用了内核抢占和O(1)调度,能够满足软实时的要求,因此几乎总能保证处理实时多媒体任务或要求 较高任务的工作者线程优先执行。这样,就保证了多媒体实时任务或要求较高任务得到优先的处理。



IOWAIT 到底是什么?

iowait 在vmstat 和iostat 以及top 中都能看到,它到底是什么? 看完这篇文章似乎一切都明白了,太长,暂时没有时间翻译,凑合看吧。

What exactly is "iowait"?

To summarize it in one sentence, 'iowait' is the percentage
of time the CPU is idle AND there is at least one I/O
in progress.

Each CPU can be in one of four states: user, sys, idle, iowait.
Performance tools such as vmstat, iostat, sar, etc. print
out these four states as a percentage.  The sar tool can
print out the states on a per CPU basis (-P flag) but most
other tools print out the average values across all the CPUs.
Since these are percentage values, the four state values
should add up to 100%.

The tools print out the statistics using counters that the
kernel updates periodically (on AIX, these CPU state counters
are incremented at every clock interrupt (these occur
at 10 millisecond intervals).
When the clock interrupt occurs on a CPU, the kernel
checks the CPU to see if it is idle or not. If it's not
idle, the kernel then determines if the instruction being
executed at that point is in user space or in kernel space.
If user, then it increments the 'user' counter by one. If
the instruction is in kernel space, then the 'sys' counter
is incremented by one.

If the CPU is idle, the kernel then determines if there is at least one I/O currently in progress to either a local disk or a remotely mounted disk (NFS) which had been initiated from that CPU. If there is, then the 'iowait' counter is incremented by one. If there is no I/O in progress that was
initiated from that CPU, the 'idle' counter is incremented
by one.

When a performance tool such as vmstat is invoked, it reads
the current values of these four counters. Then it sleeps
for the number of seconds the user specified as the interval
time and then reads the counters again. Then vmstat will
subtract the previous values from the current values to
get the delta value for this sampling period. Since vmstat
knows that the counters are incremented at each clock
tick (10ms), second, it then divides the delta value of
each counter by the number of clock ticks in the sampling
period. For example, if you run 'vmstat 2', this makes
vmstat sample the counters every 2 seconds. Since the
clock ticks at 10ms intervals, then there are 100 ticks
per second or 200 ticks per vmstat interval (if the interval
value is 2 seconds).   The delta values of each counter
are divided by the total ticks in the interval and
multiplied by 100 to get the percentage value in that

iowait can in some cases be an indicator of a limiting factor
to transaction throughput whereas in other cases, iowait may be completely meaningless.
Some examples here will help to explain this. The first
example is one where high iowait is a direct cause
of a performance issue.

Example 1:
Let's say that a program needs to perform transactions on behalf of
a batch job. For each transaction, the program will perform some
computations which takes 10 milliseconds and then does a synchronous
write of the results to disk. Since the file it is writing to was
opened synchronously, the write does not return until the I/O has
made it all the way to the disk. Let's say the disk subsystem does
not have a cache and that each physical write I/O takes 20ms.
This means that the program completes a transaction every 30ms.
Over a period of 1 second (1000ms), the program can do 33
transactions (33 tps).  If this program is the only one running
on a 1-CPU system, then the CPU usage would be busy 1/3 (10ms task running / (10 ms computation + 20ms I/O sync)) of the
time and waiting on I/O the rest of the time - so 66% iowait
and 34% CPU busy.

If the I/O subsystem was improved (let's say a disk cache is
added) such that a write I/O takes only 1ms. This means that
it takes 11ms to complete a transaction, and the program can
now do around 90-91 transactions a second. Here the iowait time
would be around 8%. Notice that a lower iowait time directly affects the throughput of the program (more higher tps).

Example 2:

Let's say that there is one program running on the system - let's assume
that this is the 'dd' program, and it is reading from the disk 4KB at
a time. Let's say that the subroutine in 'dd' is called main() and it
invokes read() to do a read. Both main() and read() are user space
subroutines. read() is a libc.a subroutine which will then invoke
the kread() system call at which point it enters kernel space.
kread() will then initiate a physical I/O to the device and the 'dd'
program is then put to sleep until the physical I/O completes.
The time to execute the code in main, read, and kread is very small -
probably around 50 microseconds at most. The time it takes for
the disk to complete the I/O request will probably be around 2-20
milliseconds depending on how far the disk arm had to seek. This
means that when the clock interrupt occurs, the chances are that
the 'dd' program is asleep and that the I/O is in progress. Therefore,
the 'iowait' counter is incremented. If the I/O completes in
2 milliseconds, then the 'dd' program runs again to do another read.
But since 50 microseconds is so small compared to 2ms (2000 microseconds), the chances are that when the clock interrupt occurs, the CPU will again be idle with a I/O in progress. So again, 'iowait' is incremented.  If 'sar -P ' is run to show the CPU
utilization for this CPU, it will most likely show 97-98% iowait.
If each I/O takes 20ms, then the iowait would be 99-100%.
Even though the I/O wait is extremely high in either case,
the throughput is 10 times better in one case.

Example 3:

Let's say that there are two programs running on a CPU. One is a 'dd'
program reading from the disk. The other is a program that does no
I/O but is spending 100% of its time doing computational work.
Now assume that there is a problem with the I/O subsystem and that
physical I/Os are taking over a second to complete. Whenever the
'dd' program is asleep while waiting for its I/Os to complete,
the other program is able to run on that CPU. When the clock
interrupt occurs, there will always be a program running in
either user mode or system mode. Therefore, the %idle and %iowait
values will be 0. Even though iowait is 0 now, that does not mean there is NOT a I/O problem because there obviously is one
if physical I/Os are taking over a second to complete.

Example 4:

Let's say that there is a 4-CPU system where there are 6 programs
running. Let's assume that four of the programs spend 70% of their
time waiting on physical read I/Os and the 30% actually using CPU time.
Since these four  programs do have to enter kernel space to execute the
kread system calls, it will spend a percentage of its time in
the kernel; let's assume that 25% of the time is in user mode,
and 5% of the time in kernel mode.
Let's also assume that the other two programs spend 100% of their
time in user code doing computations and no I/O so that two CPUs
will always be 100% busy. Since the other four programs are busy
only 30% of the time, they can share that are not busy.

If we run 'sar -P ALL 1 10' to run 'sar' at 1-second intervals
for 10 intervals, then we'd expect to see this for each interval:

         cpu    %usr    %sys    %wio   %idle
          0       50      10      40       0
          1       50      10      40       0
          2      100       0       0       0
          3      100       0       0       0
          -       75       5      20       0

Notice that the average CPU utilization will be 75% user, 5% sys,
and 20% iowait. The values one sees with 'vmstat' or 'iostat' or
most tools are the average across all CPUs.

Now let's say we take this exact same workload (same 6 programs
with same behavior) to another machine that has 6 CPUs (same
CPU speeds and same I/O subsytem).  Now each program can be
running on its own CPU. Therefore, the CPU usage breakdown
would be as follows:

         cpu    %usr    %sys    %wio   %idle
          0       25       5      70       0
          1       25       5      70       0
          2       25       5      70       0
          3       25       5      70       0
          4      100       0       0       0
          5      100       0       0       0
          -       50       3      47       0

So now the average CPU utilization will be 50% user, 3% sy,
and 47% iowait.  Notice that the same workload on another
machine has more than double the iowait value.


The iowait statistic may or may not be a useful indicator of
I/O performance - but it does tell us that the system can
handle more computational work. Just because a CPU is in iowait state does not mean that it can't run other threads on that CPU; that is, iowait is simply a form of idle time.


关于#ifndef #define

几乎所有的header 文件都是这种写法,目的是防止重复包含。 比如
in kernel_path   include/linux/times.h
#ifndef _LINUX_TIMES_H
#define _LINUX_TIMES_H

#include <linux/types.h>

struct tms {
 __kernel_clock_t tms_utime;
 __kernel_clock_t tms_stime;
 __kernel_clock_t tms_cutime;
 __kernel_clock_t tms_cstime;


#ifndef _LINUX_TIMES_H
#define _LINUX_TIMES_H
所以几乎所有的header 文件的内容都包含在#ifndef #endif 之间。

copy 一个文件也能让内核挂掉?!

把某个1.6G 的文件copy到 USB -> SCIS 设备(PATA 硬盘) ,文件系统是FAT32, 大概到630 MB 的时候 内核会挂掉,但如果是另外一个文件,copy全程无问题。试过其他文件系统也是同样情况。

Jul 17 22:59:52 dekernel kernel: [11660.092862] usb 2-6: USB disconnect, device number 26
Jul 17 22:59:52 dekernel kernel: [11660.096871] sd 31:0:0:0: [sdg] Unhandled error code
Jul 17 22:59:52 dekernel kernel: [11660.096874] sd 31:0:0:0: [sdg]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 17 22:59:52 dekernel kernel: [11660.096877] sd 31:0:0:0: [sdg] CDB: Write(10): 2a 00 00 73 6c d8 00 00 f0 00
Jul 17 22:59:52 dekernel kernel: [11660.096885] end_request: I/O error, dev sdg, sector 7564504
Jul 17 22:59:52 dekernel kernel: [11660.098355] sd 31:0:0:0: [sdg] Unhandled error code
Jul 17 22:59:52 dekernel kernel: [11660.098358] sd 31:0:0:0: [sdg]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 17 22:59:52 dekernel kernel: [11660.098360] sd 31:0:0:0: [sdg] CDB: Write(10): 2a 00 00 73 6d c8 00 00 f0 00
Jul 17 22:59:52 dekernel kernel: [11660.098367] end_request: I/O error, dev sdg, sector 7564744
Jul 17 22:59:52 dekernel kernel: [11660.124608] FAT: FAT read failed (blocknr 1930)
Jul 17 22:59:52 dekernel kernel: [11660.124835] FAT: FAT read failed (blocknr 1656)
Jul 17 22:59:52 dekernel kernel: [11660.124854] FAT: FAT read failed (blocknr 1930)
Jul 17 22:59:52 dekernel kernel: [11660.124871] FAT: FAT read failed (blocknr 1602)
Jul 17 22:59:52 dekernel kernel: [11660.154598] BUG: unable to handle kernel paging request at 36391000
Jul 17 22:59:52 dekernel kernel: [11660.154642] IP: [<c042271f>] __percpu_counter_add+0x1f/0xd0
Jul 17 22:59:52 dekernel kernel: [11660.154678] *pdpt = 0000000021a7c001 *pde = 0000000000000000
Jul 17 22:59:52 dekernel kernel: [11660.154714] Oops: 0000 [#1] PREEMPT SMP
Jul 17 22:59:52 dekernel kernel: [11660.154743] last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/class
Jul 17 22:59:52 dekernel kernel: [11660.154779] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat tun af_packet snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd vboxnetadp vboxnetflt vboxdrv mperf binfmt_misc fuse ext4 jbd2 crc16 loop snd_hda_codec_analog arc4 ecb rtl8187 iwlagn snd_hda_intel snd_hda_codec mac80211 snd_hwdep snd_pcm cfg80211 snd_timer firewire_ohci snd firewire_core sr_mod eeprom_93cx6 skge iTCO_wdt sg 8139too cdrom pcspkr i2c_i801 floppy 8139cp sky2 soundcore asus_atk0110 snd_page_alloc rfkill iTCO_vendor_support crc_itu_t button reiserfs radeon ttm drm_kms_helper drm i2c_algo_bit dm_snapshot dm_mod fan thermal processor thermal_sys ata_generic pata_jmicron [last unloaded: speedstep_lib]
Jul 17 22:59:52 dekernel kernel: [11660.155003]
Jul 17 22:59:52 dekernel kernel: [11660.155003] Pid: 17, comm: bdi-default Not tainted #1 System manufacturer System Product Name/P5B-Deluxe
Jul 17 22:59:52 dekernel kernel: [11660.155003] EIP: 0060:[<c042271f>] EFLAGS: 00010002 CPU: 0
Jul 17 22:59:52 dekernel kernel: [11660.155003] EIP is at __percpu_counter_add+0x1f/0xd0
Jul 17 22:59:52 dekernel kernel: [11660.155003] EAX: 00000000 EBX: f661f374 ECX: ffffffff EDX: ffffffff
Jul 17 22:59:52 dekernel kernel: [11660.155003] ESI: f2c7be40 EDI: 00000000 EBP: f3531d2c ESP: f3531d14
Jul 17 22:59:52 dekernel kernel: [11660.155003]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jul 17 22:59:52 dekernel kernel: [11660.155003] Process bdi-default (pid: 17, ti=f3530000 task=f34db2c0 task.ti=f3530000)
Jul 17 22:59:52 dekernel kernel: [11660.155003] Stack:
Jul 17 22:59:52 dekernel kernel: [11660.155003]  ffffffec c0a766f4 00000000 00000292 f2c7be40 00000000 f3531d40 c02dd5d0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  00000018 f4b5b5a0 00000000 f3531dd0 c02dd861 00000000 0000000e 00000001
Jul 17 22:59:52 dekernel kernel: [11660.155003]  00001747 00000001 00000000 f3531db4 c0355ad0 00000a83 0002914a 00000002
Jul 17 22:59:52 dekernel kernel: [11660.155003] Call Trace:
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c02dd5d0>] clear_page_dirty_for_io+0xb0/0xe0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c02dd861>] write_cache_pages+0x141/0x370
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c035515a>] mpage_writepages+0x5a/0xa0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<f84d75fd>] fat_writepages+0xd/0x10 [fat]
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c02de997>] do_writepages+0x17/0x30
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c0346bd9>] writeback_single_inode+0xc9/0x200
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c0346f42>] writeback_sb_inodes+0xb2/0x180
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c0347885>] wb_writeback+0x155/0x3e0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c0347ba3>] wb_do_writeback+0x93/0x1f0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c02ee0e9>] bdi_forker_thread+0x89/0x3d0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c02655d4>] kthread+0x74/0x80
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c069a8e6>] kernel_thread_helper+0x6/0xd
Jul 17 22:59:52 dekernel kernel: [11660.155003] Code: c3 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 83 ec 18 89 5d f4 89 c3 89 e0 25 00 e0 ff ff 89 75 f8 89 7d fc 83 40 14 01 8b 43 14
Jul 17 22:59:52 dekernel kernel: [11660.155003] EIP: [<c042271f>] __percpu_counter_add+0x1f/0xd0 SS:ESP 0068:f3531d14
Jul 17 22:59:52 dekernel kernel: [11660.155003] CR2: 0000000036391000
Jul 17 22:59:52 dekernel kernel: [11660.168399] ---[ end trace f0a2c1711cf79bb9 ]---

没错,这块盘是有坏道,但坏道能让内核挂掉还真是奇怪,而且只有copy 这个文件会发生问题

2011/07/22 update:

发现好像是硬盘盒造成的,主控是国内的一家名叫super top , 可是是有bug 。 传输的数据内容触发了主控的bug 然后造成内核崩溃 ~ ??!!

Bus 002 Device 009: ID 14cd:6600 Super Top USB 2.0 IDE DEVICE
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0        64
  idVendor           0x14cd Super Top
  idProduct          0x6600 USB 2.0 IDE DEVICE
  bcdDevice            2.01
  iManufacturer           1 Super Top
  iProduct                3 USB 2.0  IDE DEVICE
  iSerial                 2 ??????????
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           32
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0
    bmAttributes         0xc0
      Self Powered
    MaxPower                2mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass         8 Mass Storage
      bInterfaceSubClass      6 SCSI
      bInterfaceProtocol     80 Bulk (Zip)
      iInterface              0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
Device Qualifier (for other device speed):
  bLength                10
  bDescriptorType         6
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0        64
  bNumConfigurations      1
Device Status:     0x0001
  Self Powered

这难道就是传说中的杂牌? 序列号是一串问号



Bus 002 Device 010: ID 152d:2338 JMicron Technology Corp. / JMicron USA Technology Corp. JM20337 Hi-Speed USB to SATA & PATA Combo Bridge
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0        64
  idVendor           0x152d JMicron Technology Corp. / JMicron USA Technology Corp.
  idProduct          0x2338 JM20337 Hi-Speed USB to SATA & PATA Combo Bridge
  bcdDevice            1.00
  iManufacturer           1 JMicron
  iProduct                2 USB to ATA/ATAPI bridge
  iSerial                 5 8020A4C30450
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           32
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          4 USB Mass Storage
    bmAttributes         0xc0
      Self Powered
    MaxPower                2mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass         8 Mass Storage
      bInterfaceSubClass      6 SCSI
      bInterfaceProtocol     80 Bulk (Zip)
      iInterface              6 MSC Bulk-Only Transfer
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
Device Qualifier (for other device speed):
  bLength                10
  bDescriptorType         6
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0        64
  bNumConfigurations      1
Device Status:     0x0001
  Self Powered