momo zone

调核人的blog

Monthly Archives: 七月 2011

又是磁盘/文件系统故障,抓狂中

KDE4.5 好像有个严重bug ,用dolphin  打开一个NTFS 分区然后整个KDE 崩溃,内核也崩溃了,硬件重启后发现该分区读取错误,内核直接提示不可纠正的扇区发现,  提示十几次后内核才会继续初始化。windows 下chkdsk i:/f 发现该分区的安全描述符无法打开,幸运的是还可以恢复。估计他正好位于那个坏扇区。当然文件系统恢复无法纠正坏扇区,该死这块硬盘用smartctl –test=long 却无法执行,也无法进行表面扫描 。 简直糟糕透了…….

找到一篇文章,真正从S.M.A.R.T 技术的角度来看如何处理磁盘坏扇区问题:

http://smartmontools.sourceforge.net/badblockhowto.html

Advertisements

[转]Citrix IMA 架构和原理

by Brian Madden

Remember that a Citrix Presentation Server farm is really just a database (called the IMA Data Store), and Presentation Servers are said to be part of the same farm if they’re sharing the same data store. The data store stores all configuration information for all farm servers. When a Presentation Server starts up (or, more correctly, when the IMA service on a Presentation Server starts up), the following process takes place:

  1. The IMA service checks the registry to find out what DSN contains the connection information to the data store. This registry location is HLKM\SOFTWARE\Citrix\IMA\DataSourceName
  2. By default, that registry key points to a file called MF20.dsn in the “%ProgramFiles%\Citrix\Independent Management Architecture” folder.
  3. The IMA service connects to the database specified in that DSN file. (Database credentials are encrypted and stored in the registry.)
  4. The IMA service downloads information that pertains to it from the central database into a local MS Jet (Access) database.
  5. Throughout its operation, the IMA service interacts with the locally cached subset of the central data store.
  6. Every 30 minutes, the IMA service contacts the central data store to see if anything has changed.

The Local Host Cache

As previously stated, the IMA service running on each Presentation Server downloads the information it needs from the central data store into a local MDB database called the local host cache, or “LHC.” (The location of the local host cache is specified via a DSN referenced in the registry of the Presentation Server, at HKLM\SOFTWARE\Citrix\IMA\LHCDatasource\DataSourceName. By default this is a file called “Imalhc.dsn” and is stored in the same place as MF20.dsn.)

Each Presentation Server is smart enough to only download information from the data store that is relevant to it, meaning that the local host cache is unique for every server. Citrix created the local host cache for two reasons:

  • Increased Redundancy. If communication with the central data store is lost, the Presentation Server can continue to function since the information it is available locally.
  • Increased Speed. Since the local host cache contains information the Presentation Server refers to often, the server doesn’t have to access the IMA data store across the network every time any bit of information is needed.

The LHC is critical in a CPS environment. In fact, it’s the exclusive interface of the data store to the local server. (In other words, the local server’s IMA service only interacts with the LHC. It never contacts the central data store except when it’s updating the LHC.)

If the server loses its connection to the central data store, there’s no limit to how long it will continue to function. (In the days of MetaFrame XP, this limit was 48 or 96 hours, but that was because the data store also stored license information.) But today, the server can run forever from the LHC and won’t even skip a beat if the central connection is lost. In fact now you can even reboot the server when the central data store is down, and the IMA service will start from the LHC no problem. (Older versions of MetaFrame required a registry modification to start the IMA service from the LHC.)

The LHC file is always in use when IMA is running, so it’s not possible to delete it or anything. In theory it’s possible that this file could become corrupted, and if this happens I guess all sorts of weird things could happen to your server. If you think this is the case in your environment, you can stop the IMA service and run the command “dsmaint recreatelhc” to recreate the local host cache file, although honestly I don’t think this fixes anything very often. (I think it’s more to make people feel better. “Ahhh. I recreated the LHC, so we’ll see if the problem goes away.”)

Data Store Architecture

Now let’s take a closer look at the actual database that’s used to power the IMA data store. If you open this database with SQL Enterprise Manager (or whatever Oracle calls their database management tool), you’ll see it has four tables:

  • DATATABLE
  • DELETETRACKER
  • INDEXTABLE
  • KEYTABLE

If you’re at all familiar with databases, you’re probably thinking this is kind of weird. Wouldn’t the central database of a complex product like Citrix Presentation Server have hundreds of tables? Shouldn’t there be tables that list servers, apps, users, and policies, not to mention more tables linking them all together? The reason you don’t see the database structure you’d expect is because the IMA data store is not a real relational database. It’s actually an LDAP database that Citrix sort of hacked to work on top of a relational database like SQL Server.

This is because Citrix first came up with the concept of the IMA data store when they were working on MetaFrame XP in 2000. At that time they had planned to use Active Directory as the data store instead of a database. They developed the entire MetaFrame XP product around an LDAP-based data store instead of a relational database-based data store. Then towards the end of the development process, Citrix (smartly) realized that not too many people would want to extend their AD schemas to just to use Citrix, so they quickly moved to using a regualar database instead. The only problem was that the entire IMA service and data store were all ready to go using LDAP, and Citrix couldn’t just re-write the entire product to use a relational database instead. The solution was that Citrix had to implement their own LDAP-like engine that runs on top of a normal relational database. (On top of all that, Citrix encrypts this whole thing, so the contents really are gobbledygook to the casual observer.)

This is the reason you can’t just access the IMA data store directly through SQL Enterprise Manager. (Well, technically you can, but if you run a query you’ll get meaningless hex results.) If you try to edit any of the contents of the data store directly in the database, you will definitely break it and have to restore from backup.

For those curious to learn more about the LDAP-like structure of the data store, there’s a tool on the Presentation Server installation CD called “dsview.” DSview is fun to play with but not really that useful.

One final word of caution: There is a tool in existence called “dsedit.” As you can probably guess from the name, dsedit is basically a “write-enabled” version of dsview. If you happen to find this tool out on the Internet, DO NOT use it in your environment! This is an internal Citrix tool that is not meant for general use.

Now if you’re thinking, “I know what I’m doing, so I can play with dsedit,” I’ll warn you again: Don’t do it! The problem is that since dsedit is an internal-only tool, it’s not externally version-controlled. Citrix has many different compiled versions of this tool for all different versions of Presentation Server (and in some cases with specifics for certain hotfixes). So if you just happen to find some random hacker site with dsedit for download, you have no idea whether that dsedit version is the version that’s compiled to work with your specific version of the data store. (Chances are it’s not.) And using the wrong version of dsedit with your data store an easily corrupt the entire store (since data store items are maintained in long HEX strings that represent the LDAP-like node items.)

IMA service to data store communication

Let’s take a closer look at how a Presentation Server communicates with the central data store. We initially outlined the process that takes place when the IMA service starts up. In it, we described the IMA service downloading information from the central data store that’s used to create the local host cache. Of course if the local host cache is already on the server (and up-to-date) when IMA starts, there’s no need to download everything again.

So how does the server know whether its local host cache is current? Citrix makes this possible via a series of “sequence numbers.” Every single configuration change made to the data store is assigned a number. The number of the most recent change is stored in the local host cache. Then when the IMA service checks the central data store for changes, it only needs to download the value of the most recent sequence number. If that number is the same as what it was last time (i.e. the same number that’s in the local host cache), then no further action is needed and the server knows its local host cache is up-to-date.

If sequence number of the most recent change in the central data store is newer than what’s in the local host cache, then more data is exchanged to determine what the changes are. If they apply to the specific server requesting the updates, they’re downloaded to that server and the local host cache is updated accordingly. If the changes do not apply to the requesting server, that server still updates the most recent sequence number in its local host cache so it can continue to look for changes in the future.

The IMA service on each Presentation Server looks for changes in the central data store every 30 minutes. You can adjust this value via the registry of the Presentation Server (CTX111914), although there’s typically no reason to do that since this exchange is less than 1k if there’s no change.

IMA Data Store Database Type

Since Citrix’s implementation of the IMA data store runs on top of a regular relational database, you can pretty much use whatever kind of database server you want. Most people end up using SQL Server, although others are supported. (See CTX112591 for a complete list.)

For smaller environments, Citrix used to recommend using a Microsoft Access database running locally on one of your Presentation Servers. Nowadays that’s not really used anymore, having been replaced by SQL Server Express. (SQL Express is free and based on “real” SQL Server technology.)

A big topic of discussion has been what constitutes a “smaller” environment? Or to be more blunt, at what point do you need to switch to using a real database instead of using Access or SQL Express? A lot of people argue about this in the online forums, with the general consensus being in the five-to-ten server range. I don’t agree though. I’ve personally seen farms (even back in the MetaFrame XP days) of 50 servers running their data stores on Access, and that was fine. Since each Presentation Server only really interacts with its local host cache, a 50-server farm using Access still wouldn’t put much strain on the Access database.

To be honest, the real problem with using Access or SQL Express for your data store is that it has to be accessed “indirectly” (to use Citrix’s term). This means that the actual files that make up your data store are physically sitting on one of your Presentation Servers. The IMA service on that server accesses the database locally, and every other server in your farm accesses the data store via the IMA protocol (on port 2512) through the Presentation Server that’s hosting it. This is bad because it’s a single point of failure. If that Presentation Server goes down, your data store won’t be accessible and you won’t be able to manage your environment.

This might not be a problem in a small farm of just a few servers, but you’ll probably want a more redundant database long before your farm outgrows this architecture from a technical capacity standpoint.

IMA Data Store Size

Another question that often comes up when designing Presentation Server environments is, “How big will this IMA data store get?” The answer, very seriously, is “Not very big!”

Of course “very big” is a relative term, but in today’s world of multi-core servers with gigabytes of memory, the data store just isn’t going to grow large enough to really matter. Citrix very roughly estimates 1MB per server. And even if you built a single farm with 1,000 servers, a 1GB database in today’s world just isn’t that big anymore.

If you want more precise numbers as to the size of your data store, the Advanced Concepts Guide for CPS 4 (CTX107159) has a chart that lists exactly how many bytes each object type needs in the data store. (I have not been able to find this info for CPS 4.5, but I’m going to assume it’s pretty close to 4.0.

IMA Data Store replication strategy

If your server farm spans multiple physical locations, you might want to replicate your data store so that a local copy is running at each location. There are two (potential) advantages to this:

  • Redundancy. You don’t want a single database server failure to negatively impact your overall environment.
  • Performance. If your farm spans multiple WAN locations, you might want to have a local database at each location.

Before we discuss this further, I want to make a few things clear: we’re talking about doing a full replication of the entire data store, so that each replica is 100% identical. Unfortunately due to the binary LDAP structure of the data store, it’s not possible to just replicate a subset of the data store to a remote site.

Second, we’re talking about replicating the data store between physical sites for site-to-site performance and redundancy reasons. If you want to cluster your data store servers, this is entirely possible, but not what we’re talking about now. (For more information about clustering your data store servers, read the High Availability chapter later in this book.)

Figure 3.x [Replicated data store between two physical locations]

Replicating your data store for redundancy

If your farm spans multiple physical locations, you might be concerned about what happens when a WAN link goes down. There’s an entire chapter later in this book dedicated to helping you design a fully-redundant environment based on everything that you’ll read up until that point. But right now we can discuss the mechanics of the data store when it comes to replication for redundancy purposes.

The first and most important thing to know is this: A Citrix Presentation Server will work indefinitely even if it loses connectivity to the central data store. (Again, remember that the local IMA service on a Presentation Server works off of its local host cache, not the central data store.) So really before you can decide whether you want to replicate your database for redundancy purposes, you have to decide understand what the impact is of losing connectivity to the data store.

The main thing is that in order to use either one of the two CPS management consoles, you have to connect to a Citrix server that is communicating with its data store. So if your data store is lost, even though your Presentation Server will run and will accept new connections and otherwise be totally normal, you won’t actually be able to connect to that server with a management console.

What’s interesting is that this doesn’t mean that you can’t manage sessions on that server. If you can connect to a different server in your farm that is connected to the data store, then you can view all activity and all sessions from your farm–even the ones from servers that aren’t connected to the data store. But think about this for a minute. How is it possible that your management console is able to connect to a server that can access the datas store, and it’s able to see servers not connected to the data store? If this is the case, wouldn’t your “down” servers also be able to see the data store?

A more likely scenario is that you have multiple WAN locations each with their own Presentation Servers all in the same farm. If a WAN link goes down and some sites do not have their own replica of the data store, the servers, sessions, and users on that site will be fine. The problem will be that admins and help desk folks won’t be able to connect to any admin consoles at that site. (And people at the site with the data store will be able to connect, but of course they won’t be able to see or manage servers from the site with the down WAN link.)

A solution to this is to replicate your data store so that if a WAN link goes down, there’s a local replica at each location. This means that local admins will be able to connect to the management tools on those local servers and perform their typical routine maintenance tasks. (Resetting sessions, shadowing, etc.)

Of course if any admin from the “down” site makes any configuration change that’s saved to the data store, that change will be lost once the WAN link comes back up and the central data store re-replicates with the local data store. (As you can imagine, “merge” replication is not possible with this binary encrypted LDAP data store format.)

Replicating your data store for performance reasons

Some people also choose to replicate their data stores to multiple locations for performance reasons. The idea is that by doing this, you Presentation Servers can always access the data store via a local network instead of via the WAN. To be honest, this probably isn’t that big of a deal. Remember that each Presentation Server interacts with its own local host cache for standard operational purposes. The central data store is only accessed to download additional configuration changes. Sure, recreating the local host cache will require the download of all the contents to rebuild the MDB cache file, but that too is not typically very large. (A few megabytes maybe?) And if your WAN can’t support the transfer of a few megabytes every once and a while, then you probably shouldn’t have a single farm that spans multiple sites anyway.

All that said, it’s a nice “clean” solution when all the Presentation Servers of a remote location can access everything they need on their own local LAN, and there’s certainly nothing wrong with that scenario.

Advantages of replicating

  • You can manage your servers when the WAN is down
  • Less WAN traffic (Read the “zones” section of this chapter to understand why.)
  • It just “feels” better, especially for a global environment

Disadvantages of replicating

  • More complex
  • Additional database servers required

Configuring IMA data store replication

If you decide that you’d like to replicate your data store, you’ll need to do two things:

  1. Configure the database software for replication
  2. Reconfigure your Presentation Servers to point them towards the local replica

Configuring the database for replication

All the real database servers support replication. (i.e. if you want to do this, you can’t use Access or SQL Express.) Configuring the replication of your data store is 100% a function of your database software. In fact, your Citrix servers won’t even know they’re connecting to a database replica versus the real thing.

If you’ve never done this before and you’re in the unfortunate position of making it happen in your environment,CTX112125 has more information and links to step-by-step instructions for configuring replication with SQL Server. The main thing you have to do is make sure that your replicas are “writable” by the local Presentation Servers. There’s a few ways this can be set up, but in a CPS environment you need to ensure that one master copy of the database is in charge of all changes to it. (With that weird binary encrypted LDAP format, you don’t want the database server to try to sort out two changes entered into two different replicas at the same time.)

Pointing your Presentation Servers to a new replica

Once you’ve got your data store replicated to your new location, you need to reconfigure the local Presentation Servers there to use the new replica instead of the old central location. Remember from the beginning of this section that a Presentation Server knows where to find the central data store via a file called MF20.dsn (which is specified in the registry). If you want to point your Presentation Server to another database (i.e. a local replica), all you have to do is to change that DSN and then restart the IMA service. (There are some command-line options for the dsmaint utility that let you change the location of the data store, but I personally find it easier to just edit the MF20.dsn file itself.)

Again, this will only work if you’re pointing a server to a new database that is 100% identical to the old database. You cannot use this technique to “migrate” a server between farms since a new farm wouldn’t know anything about a server that was just randomly pointed to it.

未解问题的答案–关于MCU和RF的连接

N久前某某人问我: “WIFI 板上,CPU通过什么方式和RF芯片连接?”  ,当时我确实不清楚,因为我对嵌入系统的总线 一无所知,所以就把计算机中最高效的一种方式说了出去,我答道:“内存共享”  …… 现在想想我当时真是白痴,对方听了肯定不会再有兴趣谈下去了。时至今日我也不清楚这个问题具体是问的什么,因为不清楚这个CPU具体是什么, 是DSP,MCU,SOC 还是什么? 这几天我翻了一些production brief  ,把broadcom,mavell, atheros(已被高通收购)的产品都看了一下,总结了一下。

一般来说数字无线系统都是有AP(MCU),BB,RF 构成。但对于小型的高集成系统,AP和BB往往集成在一起,叫基带处理单元(BBP)。特定于WIFI这个行业则直接叫做SoC 了。

1.先说说从marvell 那里看到的:

这就是marvell的top dog 方案, 高端SoC(附带3X3 MIMO)+ 2.5GHz和5GHz 双频:

  左边的是SOC 右边的是RF ,之间通过BBU 连接 (基带传输单元), I/Q是啥 ? 这个不明白的只有去翻通信方面的书了(OFDM方面的)。 SOC 88W8366 右边的PCIE接口就是这张卡和主机的接口了 ,很熟悉了。 还有JTAG,UART/GPIO,另外EEPROM 是通过SPI连接的。

2.再看一下atheros 的方案:

左边的AR7010 是一个SOC,右边的 AR9280 是被称为信号加速器的东西,最独特的是他们之间通过PCIe连接, 而SOC和host interface 通过USB连接 。既然能通过PCIe 连接 那么这个信号加速器的构成应该是比单纯的RF 要复杂。

还有这个AR4100,算是broadcom的单芯wifi 方案:

蓝色部分的把基带,MAC/RF全都集成在一起(SIP – system-in-package), 灰色的是MCU, 接口部分是SPI , SPI Slave 是连接的host interface

3. 最后看一下broadcom

应该说和marvell的一样,通过BBU直接把基带信号传给RFIC, 不过产品PDF里面并没有讲 ,不过透过逻辑图可以看出来:

BCM5352 和BCM2050 之间通过什么连接? 答案是BBU ,就是和marvell的一样了。从下图的I/Q 输入输出可看出。

__KERNEL__ 宏作用是什么?

这个宏在内核及应用程序代码中均能看到。它仅起到判断作用,而不是在实际的代码逻辑中被替换,就如之前讲的避免define重复定义的用法一样。但不同的是这个宏的目的并不是避免重复定义,那么这个宏到底是什么意思?  先看下这一段:

cmd_kernel/sched.o := gcc -Wp,-MD,kernel/.sched.o.d  -nostdinc -isystem /usr/lib/gcc/i586-suse-linux/4.5/include -I/usr/src/linux-2.6.39.1-4/arch/x86/include -Iinclude  -include include/generated/autoconf.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -O2 -m32 -msoft-float -mregparm=3 -freg-struct-return -mpreferred-stack-boundary=2 -march=i686 -mtune=core2 -mtune=generic -maccumulate-outgoing-args -Wa,-mtune=generic32 -ffreestanding -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -pipe -Wno-sign-compare -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wframe-larger-than=2048 -fno-stack-protector -fno-omit-frame-pointer -fno-optimize-sibling-calls -fasynchronous-unwind-tables -g -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack -DCC_HAVE_ASM_GOTO -fno-omit-frame-pointer    -D”KBUILD_STR(s)=\#s” -D”KBUILD_BASENAME=KBUILD_STR(sched)”  -D”KBUILD_MODNAME=KBUILD_STR(sched)” -c -o kernel/.tmp_sched.o kernel/sched.c

这是内核中编译kernel/sched.c 时的编译参数,可以看到gcc -D__KERNEL__  ,其实几乎所有内核编译参数都包括它。 再看看include/linux/sched.h

#ifndef _LINUX_SCHED_H
#define _LINUX_SCHED_H

/*
 * cloning flags:
 */
#define CSIGNAL		0x000000ff	/* signal mask to be sent at exit */
#define CLONE_VM	0x00000100	/* set if VM shared between processes */
.......
#define CLONE_DETACHED		0x00400000	/* Unused, ignored */
#define CLONE_UNTRACED		0x00800000	/* set if the tracing process can't force CLONE_PTRACE on this clone */
.......
#define CLONE_NEWPID		0x20000000	/* New pid namespace */
#define CLONE_NEWNET		0x40000000	/* New network namespace */
#define CLONE_IO		0x80000000	/* Clone io context */

/*
 * Scheduling policies
 */
#define SCHED_NORMAL		0
#define SCHED_FIFO		1
#define SCHED_RR		2
#define SCHED_BATCH		3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE		5
/* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
#define SCHED_RESET_ON_FORK     0x40000000

#ifdef __KERNEL__

struct sched_param {
	int sched_priority;
};

#include <asm/param.h>	/* for HZ */

#include <linux/capability.h>
......
#endif

可以看到如果定义了 __KERNEL__ 则会多编译一段。而多编译的这一段是只有像内核这样的代码会用到的。
这样考虑一下:我为某个设备写了一个设备驱动,毫无疑问编译的时候肯定会加上__KERNEL__,然后这个驱动往往还要提供一个库文件,为应用程序提供访问某些变量或函数的接口。最后我们要写一个应用程序来操纵设备。那么在编写这个应用程序的时候假如库文件提供的东西还不够,比如库文件没有定义上述的#define CLONE_DETACHED 0x00400000,而库文件的API 函数包含返回CLONE_DETACHED 的可能,那么就无法用if(CLONE_DETACHED==fun(arg)) 来做判断了(当然你可以去扒内核代码去比较0x004000000)。那么应用程序代码中肯定要包含这内核头文件了(现在不推荐这样做了),那么就出现了一个问题,让应用程序引用内核头文件会暴露很多内核的细节和增加目标文件的大小,因为很多内核结构或变量,应用程序几乎用不着。那么就要设置一个边界,设定哪些是只有给内核代码可见的,哪些是开放给所有代码可见的。 这个边界就是__KERNEL__ 宏。其实实际的做法是所有该引用的内核头文件,变量或结构都要库的头文件中包含,这些属于内核的东东往往要重新定义在库的头文件中,然后再提供给应用程序去引用。

这里有另外一位牛人的解释:

Paul Mackerras writes:

> The only valid reason for userspace programs to be including kernel
> headers is to get definitions that are part of the kernel API. (And
> in fact others here will go further and assert that there are *no*
> valid reasons for userspace programs to include kernel headers.)
>
> If you want some atomic functions or whatever for your userspace
> program and the ones in the kernel look like they would be useful,
> then take a copy of the relevant kernel code if you like, but don’t
> include the kernel headers directly.

Sure. That copy belongs in /usr/include/asm for all programs
to use, and it should match the libc that will be linked against.
(note: “copy”, not a symlink)

Red Hat 7 gets this right:

$ ls -ldog /usr/include/asm /usr/include/linux
drwxr-xr-x 2 root 2048 Sep 28 2000 /usr/include/asm
drwxr-xr-x 10 root 10240 Sep 28 2000 /usr/include/linux

Debian’s “unstable” is correct too:

$ ls -ldog /usr/include/asm /usr/include/linux
drwxr-xr-x 2 root 6144 Mar 12 15:57 /usr/include/asm
drwxr-xr-x 10 root 23552 Mar 12 15:57 /usr/include/linux

> This is why I added #ifdef __KERNEL__ around most of the contents
> of include/asm-ppc/*.h. It was done deliberately to flush out those
> programs which are depending on kernel headers when they shouldn’t.

What, is </usr/src/linux/asm/foo.h> being used? I doubt it.

If /usr/include/asm is a link into /usr/src/linux, then you
have a problem with your Linux distribution. Don’t blame the
apps for this problem.

Adding “#ifdef __KERNEL__” causes extra busywork for someone
trying to adapt kernel headers for userspace use. At least do
something easy to rip out. Three lines, all together at the top:

#ifndef __KERNEL__
#error Raw kernel headers may not be compatible with user code.
#endif

[转] ccache

如果你经常编译大型的C/C++工程,不使用ccache你就out了。

   cache is a compiler cache. It speeds up recompilation by caching previous compilations and detecting when the same compilation is being done again. Supported languages are C, C++, Objective-C and Objective-C++.

 

Usage

使用ccache

  • 编译指令前增加ccache. $ ccache gcc xxx
  • 创建软链接。 $ ln -s ccache /usr/local/bin/gcc

建议使用第一种方式,因为ccache偶尔也犯晕,当由于它出现错误的时候, 很难看出端倪。曾在某次编译某份代码时,ccache对某个编译选项的判断失误导致编译失败,死活查不出来为什么原因。所以当出现某些怪异的现象时,请用正常的方式编译试试。

 

Example 

  • 编译软件包
  1. [/tmp/bash-4.1 0]$ uname -a
  2. Linux AP 2.6.37-gentoo #1 SMP PREEMPT Sun Jan 16 14:55:15 CST 2011 i686 Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz GenuineIntel GNU/Linux
  3. [/tmp/bash-4.1 0]$ CC=”ccache gcc” ./configure
  4. [/tmp/bash-4.1 0]$ time make
  5. real 0m47.343s
  6. user 0m39.572s
  7. sys 0m3.244s
  8. [/tmp/bash-4.1 0]$ make clean
  9. [/tmp/bash-4.1 0]$ time make
  10. real 0m10.131s
  11. user 0m5.597s
  12. sys 0m1.077s

由上可以看出,使用ccache后, 编译速度快了5倍(中间很长一段时间不是gcc在编译,否则更快)。Wonderful..

 

  • 编译内核
  1. [/tmp/linux-2.6.34 0]$ uname -a
  2. Linux boeye-AP 2.6.37-gentoo #1 SMP PREEMPT Wed Jan 12 20:06:14 CST 2011 x86_64 AMD Athlon(tm) II X4 630 Processor AuthenticAMD GNU/Linux
  3. [/tmp/linux-2.6.34 0]$ grep “make” build
  4. 28:make -j4 ARCH=arm CROSS_COMPILE=”ccache arm-linux-” O=$outdir $@
  5. [/tmp/linux-2.6.34 0]$ time ./build
  6. real 3m4.146s
  7. user 10m30.640s
  8. sys 0m37.138s
  9. [/tmp/linux-2.6.34 0]$ ./build clean
  10. [/tmp/linux-2.6.34 0]$ time ./build
  11. real 0m23.714s
  12. user 0m31.603s
  13. sys 0m12.777s

在交叉编译内核时,编译速度也快了近9倍。

 

  • 编译Android

Android中,使用ccache,只需要添加环境变量’$ export USE_CCACHE=1′, 不同的是,默认它不用HOST的ccache程式,而使用自带的ccache. 编译android需要较大的缓冲区:

  1. $ ccache -M 3G    // 将缓冲区设置为3G

[转]蝶变ARM

一篇不错的文章,介绍了ARM的前生今世

蝶变ARM

1929年开始的经济大萧条,改变了世界格局。前苏联的风景独好,使得相当多的人选择了马克思。惧怕布尔什维克红色力量的人投入了法西斯的怀抱,剩余的人选择了妥协与折中。整个世界的迅速分解使得第二次世界大战成为必然。

1933年,罗斯福成为美国第三十二任总统,开始实施新政。这些新政使美国摆脱了危机,决定了二战的走向。罗斯福的背后站着的是凯恩斯,凯恩斯的国家资本主义化解了整个资本主义阵营有史以来最大的一次危机。“妥协与折中”得以持续。

战后的世界属于巨型公司,这些公司借助国家资本的力量,持续着垄断。这些垄断的初衷并非都是恶意的,在美国这却是一个早在1890年就立法制止的。1911年,美国烟草公司被分拆。1982年,美国电报电话公司被分拆。

这些分拆很难抵达IT业。1975年成立的Microsoft虽然多次被推入拆分的风口浪尖,却从未被拆分。IBM和Intel多次遭到分拆的威胁,也安然无恙。连分拆的支持者都注意到这样一个事实,这些公司不是依靠国家资本获得垄断地位,而是依靠多年苦心积攒的知识产权坚持到现在。

这类垄断之忧不在颛臾而在萧墙之内。在这些巨型IT公司中,最低层的工作人员需要经过多达十几级的汇报关系,才能到达首席执行官。在这十几级汇报链中,向上所传递最多的就是粉饰太平。

微不足道的小问题在这些大公司中也能引发无尽的讨论。为解决某个问题而举行的会议,经常被无休止地扩大,从一个会议扩展为多个会议,从几个人参与变为几十人参与。这个问题变得已不再重要时,内部并无统一意见。

西方巨型公司的弊端在于欧美所倡导的民主代价过于昂贵。撒切尔夫人是欧洲第一个深刻认识到这些问题的最高执政官。历史上,英国并不重视中小企业的发展,在凯恩斯主义盛行的二十世纪五十年代,英国经历了三次大规模的企业兼并。至撒切尔夫人执政,巨型企业大行其道。更多的人发现这些大型企业并不能提高生产效率,大范围的垄断与集中,已使英国经济举步维艰。

上世纪八十年代,撒切尔夫人开始变革,剑锋所指,巨型公司纷纷解体,中小企业如雨后春笋般涌现。撒切尔的私有化,货币控制,削减福利与抑制工党的四项举措,客观上拯救了英国经济,也使这位值得尊敬的女士誉满天下,谤满天下。

ARM在这样的大背景之下诞生,这注定了这些创始人不会也不愿意使ARM成为巨型公司,这也是取得如此成就的ARM,人数尚不过两千的最重要原因。ARM最初的简称是Acorn RISC Machine。Acorn Computer创立于1978年,总部位于剑桥,由Andy Hopper(剑桥大学), Chris Curry(Sinclair Research)和Herman Hauser(剑桥大学)创建[48]

Acorn最初使用MOS Technology 6502处理器搭建处理器系统。MOS Technology 6502是一颗8位处理器,设计这颗处理器的工程师来自摩托罗拉的MC6800设计团队[48]。基于6502处理器,Acorn开发了最令其自豪的处理器系统BBC Micro[49]

在上世纪80年代至90年代,BBC Micro处理器系统主宰英国的教育市场。当时还有另外一个基于6502处理器的系统,Apple II[50]。从这时起,Acorn和Apple这两个设计理念及产品形态相似的公司结下了不解之缘,有些人将Acorn公司称呼为“The British Apple”[51]。也是在这个时候,Acorn迎来了一生中的对手Intel。基于Intel x86构架的PC对同时代的处理器厂商是一场噩梦,很少有公司能够醒来。服从或者灭亡,别无选择。Acorn选择服从,向Intel申请80286处理器样片,Intel拒绝了这个请求[52]

工程师对剩余的处理器,进行了充分的评估。结果令人失望。此时的Acorn没有选择,认真地考虑是否需要研制一颗属于自己的处理器。他们没有任何处理器设计经验,为数不多的工程师们除了才华,只有梦想。才华与梦想恰能改变整个世界。

1983年10月,Acorn启动了代号为Acorn RISC的项目,由VLSI Technology负责生产。1985年4月26日,VLSI成产出第一颗Acorn RISC处理器,ARM1。ARM1的结构非常简单,仅有个25,000晶体管,甚至没有乘法部件[52]。当时并没有人留意这颗芯片,更多的人关注Intel在1985年10月17日发布的80386处理器[36]

没有人认为这颗略显寒酸的ARM1能给80386带来任何冲击,甚至包括研发这颗芯片的Acorn工程师。做为处理器厂商,与Intel活在同一个时代是一场悲剧,无论是Acorn,IBM亦或是不可一世的DEC。Intel并不是不犯错误,只是有限的几个错误都能被及时修复。才华横溢的Intel工程师将处理器的故事演绎至巅峰,他们的竞争对手也因此步入地狱。

Acorn不得不选择回避,这也决定了ARM处理器的设计理念是low-cost, low-power和high-performance。这个理念与21世纪智能手机的需求不谋而合,却是Intel强加给ARM的。Intel在不经意间为自己树立了一个强大的对手,这个对手在Intel的庇护之下一步步长大。并不夸张地说,没有Intel就没有ARM的今天。

因为对low-cost和low-power的追求,Acorn选择了RISC,而不是CISC,在上世纪80年代,RISC与CISC孰优孰劣尚无定论。在当时采用RISC技术可以看得到的优势是可以用更少的芯片资源,更少的开发人员实现一个性能相对较高的处理器芯片[53]。Intel使用了CISC架构,很大程度上也决定了Acorn选择了RISC。刘备的“每与操相反,事乃可成”,对于Acorn就是“与Intel不同,便有机会”。

ARM的成长仍然缓慢,陆续发布的ARM2与ARM3没有激起波澜。只有为数不多的公司选择ARM3处理器开发产品。一些公司将ARM3处理器用于研发,最有名的公司就是Apple[53]。在当时,Apple也是屈指可数的对ARM友善的公司。

Acorn无论是在财务上还是在技术上都遭遇了瓶颈。销售量达到150万台的BBC Micro没有给Acorn带来足够的财富,与席卷天下的PC相比这微不足道[54]。ARM3与Intel在1989年发布的80486也没有太多可比性。危机最终降临到Acorn这个年轻的公司,1985年2月,当时的IT巨头Olivetti出资12M英镑收购Acorn 49.3%的股份[55]。Olivetti的庇护没有给Acorn带来机遇。

Olivetti创建于上世纪初,对智慧与品质苛刻地执着使得他们的产品陈列在纽约的现代艺术博物馆中,出现在许多经典的影片中。这些产品没有改变这个公司的最终命运。Olivetti最终涉足PC领域,使用Zilog的Z8000,挑战在这个领域所向无敌的Intel。

Olivetti收购Acorn后,更多地将ARM处理器用于研发,真正的产品使用Zilog系列。这段时间是Acorn最艰难的日子。Acorn的创始人Andy Hopper最终选择从Olivetti独立。出乎意料之外,Olivetti支持了Andy的决定。

1990年11月,Acorn(事实上是Olivetti Research Lab),Apple和VLSI共同出资创建了ARM。Acorn RISC Machine正式更名为Advanced RISC Machine[55]。在1996年,Olivetti在最困难的时候将所持有的14.7%的Acorn股份出售给了雷曼兄弟[56]

当时的Apple正在为代号Newton的项目寻找低功耗处理器。Newton项目的终极目标是实现地球上第一个Tablet。Apple对Tablet的前景寄予厚望,他们直接将公司Logo上的Isaac Newton作为项目的名称。Apple最初的Logo是在苹果树下深思的牛顿。两个Steve[i]将公司命名为Apple,与喜欢吃苹果没有任何联系,只因为是苹果而不是鸭梨砸到了牛顿头上。

Newton Tablet的想法过于超前,最糟糕的是Jobs当时并不在Apple。Apple用并不太短的时间证明了一条真理,没有Jobs的Apple和没有乔丹的公牛没有太大区别。1996年3月,Steve Jobs再次回到Apple,两年后取消了这个并不成功的项目[57]。等到Jobs再次推出iPad Newton时,已是十几年之后的事情了[58]

Apple投入3百万美金拥有了ARM公司43%的股份[60],但是并没有把宝押在ARM公司,Apple真正关注的是在1991年与IBM和Motorola组建的AIM[59]。在1998年,ARM公司在英国和美国同时上市后,Apple逐渐卖出了这些股份。在2010年,Apple即便准备好了80亿美金,却也无法收购ARM。

上世纪九十年代初期的ARM公司,财务依然拮据,起初12个员工只能挤在谷仓[ii]中办公,廉价License的商业模式更不被人看好。虽然依靠Apple的鼎力相助,ARM6[iii]得以问世,却没有改变Apple和ARM的命运。Newton项目设计的是本应该属于下一个世纪的Tablet,ARM6被PC处理器和当时多如牛毛的RISC处理器笼罩,无所作为。

上世纪90年代属于PC领域。AMD的异军突起,及其与Intel的竞争,构建了上世纪九十年代处理器领域一道最炫目的风景线,服务器领域属于DEC。1992年2月25日,DEC发布的Alpha21064处理器,主频达到150MHz[61],而Intel在第二年发布的Pentium处理器,主频仅有66MHz[62]

整个90年代,处理器世界都在惊叹着Alpha处理器所创造的奇迹。DEC陆续发布的Alpha系列处理器既便是放到二十一世纪的今天,设计理念依然并不落后。DEC工程师是在为21世纪设计处理器芯片。在Alpha21x64系列处理器的编号中,’21’代表二十一世纪,而’64’代表64位处理器[63]

上帝并不青睐DEC公司,科技与商业的严重背离酿成了巨大的灾难。Alpha处理器的技术尚未抵达巅峰,DEC的财务已捉襟见肘。1994~1998年,DEC不断地向世界各地兜售资产。至1997年,DEC出售的资产已遍及五大洲,二十多个国家[64]。1998年1月26日,DEC正式被Compaq收购[65]。在DEC解体的最后一段日子里两个公司最为受益,一个是Intel,另一个就是ARM。

在ARM的起步阶段,鼎力相助的是Apple,最先License ARM内核的却是英国本土的GEC半导体公司。在1993年因为Apple的引荐,ARM处理器跋山涉水来到日本,与Sharp建立了合作关系。在此之前Sharp与Apple一直在合作开发Newton项目。

这些合作并没有缓解ARM的财务危机,ARM一直在追寻真正属于自己的客户。1993年,Cirrus Logic[iv]和德州仪器公司TI(Texas Instruments)先后加入ARM阵营。TI给予了ARM雪中送炭的帮助。TI正在说服当时一家并不知名的芬兰公司Nokia与他们一同进入通信移动市场。TI在DSP领域已经取得了领袖地位,但并不熟悉CPU业务,在屈指可数的可以被操控的公司中,最终选择了ARM[67]

ARM迎来了上天赐予的机会。通过与Nokia和TI的密切合作,ARM发明了16位的Thumb指令集,真正意义上创建了基于ARM/Thumb的SoC商业模式[67]。ARM已经逐渐摆脱了财务危机,业务不断扩大。至1993年底,ARM已有50个员工,销售额达到10M英镑。

同年ARM迎来了公司成立以来最重要的一颗处理器内核,ARM7[67]。ARM7使用的Die尺寸是Intel 80486的十六分之一,售价仅为50美金[v]左右。较小的Die尺寸,使ARM7处理器获得了较低的功耗,适合手持式应用[67]

ARM7处理器引起了当时的处理器巨头DEC的关注。1995年,DEC开始研发StrongARM。与其他License ARM内核的半导体厂商不同。DEC获得了ARM架构的完整授权,DEC可以使用ARM的指令集,设计新的处理器架构,这个特权后来被Intel和Marvell陆续继承。第二年的2月5日,DEC正式发布SA110处理器,开始提供样片[68]。SA110处理器迅速得到了业界的认可,Apple开始使用SA110处理器开发MessagePAD 2000 [69]

StrongARM处理器在设计中注入了Alpha处理器的一些元素。StrongARM使用5级顺序执行的流水线,分离了指令和数据Cache,添加了DMMU和IMMU功能部件,每个MMU中包含32个全互连结构的TLB,添加了16级深度的WB(Write Buffer)[70]。至此ARM处理器更像是一颗微处理器,而不再是微控制器。

DEC的帮助使ARM处理器达到了前所未有的高度。更为重要的是,这颗160MHz,DMIPS为185的处理器,功耗低于500mW[70]。这不仅引起了工业界的浓厚兴趣,学术界也开始真正关注ARM处理器。1997年,DEC如期发布了第二颗StrongARM芯片,SA1100。SA1100在SA110的基础上增加了一些外部设计。第二年Intel为SA1100提供了一个伴侣芯片SA1101,SA1100+SA1101也成为了许多PDA厂商的首选。1999年,Intel发布了最后一颗StrongARM处理器SA1110[vi],和对应的伴侣芯片SA1111。

StrongARM的成功没有帮助DEC摆脱财务危机。而DEC却找到了一个更容易赚钱的途径。1997年5月,DEC正式起诉Intel,宣称Intel在设计Pentium,Pentium Pro和Pentium II处理器时侵犯了DEC的10条专利。1997年9月,Intel反诉DEC在设计Alpha系列处理器时侵犯了Intel多达14条专利[72]

在IT界,这样的官司大多不了了之。1997年11月27日,DEC和Intel选择和解。DEC向Intel提供除Alpha处理器之外的所有硬件设计授权,进一步支持Intel开发IA64处理器。同时Intel花费了625M美金购买DEC在Hudson的工厂,Israel Jerusalem和Texas Austin的芯片设计中心。这两个公司还签署了长达十年的交叉授权协议[72]

DEC的技术注入使Intel的x86处理器迈入新的时代,Intel具备了向所有RISC处理器同时宣战的能力,最终一统PC和服务器领域。Intel还从DEC获得了StrongARM。克雷格·贝瑞特认为这是上天赐予Intel的机会,x86处理器与StrongARM的组合,将使Intel的处理器遍及世界上任何需要处理器的领域。

为了迎接StrongARM的到来,贝瑞特放弃了Intel自己的RICS处理器,i860和i960。Intel为StrongARM起了一个炫目的名字XScale,动用了积蓄已久史上最为强大的Ecosystem,强势进军嵌入式领域。

一时间,XScale处理器遍及嵌入式应用的每一个领域,用于手持终端的PXA系列,用于消费类电子的IXC/Intel CE系列,用于存储的IOP系列,用于通信的IXP系列。Intel的处理器技术极大地促进了ARM内核的发展,借用PC帝国的Ecosystem使ARM处理器从生产到设计一步领先于所有嵌入式行业的竞争者。首先成为XScale处理器试金石的是摩托罗拉半导体的68K处理器。

在XScale系列处理器诞生之前,68K处理器主宰嵌入式领域,Apple Macintosh最初也使用68K处理器。在1997年,摩托罗拉销售了79M片68K处理器,而Intel的x86处理器一共卖出了75M片[73]。这是68K处理器最后的辉煌。Intel和TI主导的ARM处理器终结了68K处理器。摩托罗拉半导体面对ARM的强势出击毫无准备。ARM处理器不断地蚕食68K的市场份额,直到完全占有。

1995年,摩托罗拉半导体的香港设计中心发布第一颗用于手持式设备的DragonBall处理器,MC68328(EZ/VZ/SZ)[74],这是香港半导体界最好的时代。而StrongARM/XScale很快结束了香港设计中心的幸福生活。面对ARM的挑战,DragonBall最终屈服,DragonBall MX(Freescale i.MX)系列处理器开始使用ARM9。使用ARM内核并没有改变摩托罗拉香港设计中心的命运,这个设计中心最终不复存在。

在工业控制领域,68K内核进化为ColdFire[vii]。ColdFire在HP的中低端打印机中取得的成就几乎是最后的绝唱。在通信领域,摩托罗拉半导体抛弃了基于68K内核的MC68360,研发出基于PowerPC架构的MPC860处理器。这颗处理器是通信时代的经典之作,摩托罗拉半导体陆续推出了一系列基于PowerPC内核的通信处理器,却再也没有重现MPC860时代的君临天下。近期推出的QorIQ[viii]系列处理器面对多核MIPS处理器总是滞后一拍。

摩托罗拉半导体,昔日的王者优雅地没落。摩托罗拉半导体于1955年推出第一个锗晶体管,开创了半导体集成电路产业,在整个60年代一骑绝尘,70年代末迎来了68K的辉煌。即使在1985年,摩托罗拉还是全球第三大半导体公司。而怀抱通吃整个产业链的野心,对封闭式系统的挚爱,使摩托罗拉连同半导体部门在同一棵石头上跌到了一次又一次。至21世纪,摩托罗拉半导体(Freescale)的排名在十名左右,2009年的排名仅为第17位。

击败了摩托罗拉半导体的Intel没有感到一丝喜悦,更多的是寒气。2006年,Intel的业绩跌入低谷,这也使得当时的CEO贝瑞特作出了一个艰难的选择,2006年6月27日,Intel将PXA系列处理器出售给了Marvell[12]

Intel虽然保留了ARM处理器的授权,却已彻底退出了ARM阵营。这是Intel一个非常谨慎而且坚决的选择。Intel需要扑灭后院的熊熊烈火。在PC领域,AMD率先推出了64位的K8处理器[75],并在2005的Computex 上,发布双核处理器Athlon 64。Intel x86最引以为豪的性能优势已不复存在。

这段时间Intel只能依靠工艺与强大的商务能力与AMD的Athlon64处理器周旋。2008年11月,Intel正式发布基于Nehalem内核,用于台式机的Core i7处理器[76],用于服务器的Xeon处理器,Core i3/i5也如期而至。Nehalem内核使Intel彻底战胜了AMD。这颗处理器也是Intel开始研发x86处理器以来,第三个里程碑产品,之前的两个里程碑分别是80386和Pentium Pro。从这时起AMD处理器在性能上再也没有超过Intel。Intel解决了最大的隐患后,却发现ARM处理器已非吴下阿蒙。

ARM7之后,ARM8内核于1996年发布。ARM8内核生不逢时。与ARM7相比,AMR8在没有显著提高功耗的前提下,性能提高了一倍,依然无法和DEC的StrongARM抗衡[77][78]。仅有少量手机在原型设计中考虑过使用ARM8内核,ARM也仅为用户提供了CPU样板。

ARM8的失败并没有阻碍ARM内核的进一步发展,与StrongARM的竞争没有消减ARM阵营的实力,反而激发了ARM处理器不断向前的动力。1997年ARM9正式发布,DMIPS指标首次超过了1.0大关。ARM9是一个重要的里程碑产品。这个产品标志着ARM处理器正式进入微处理器领域,而不再是简单的微控制器。

ARM9将ARM7的3级指令流水线提高到5级,与StrongARM使用的流水线结构较为相似。进一步细化的流水线使得ARM9最高的时钟频率达到220MHz,而ARM8仅为72MHz[78]。ARM9进一步优化了Load和Store指令的效率,ARM9不再使用普林斯顿结构,而转向哈佛结构,使用了独立的指令与数据Cache。

ARM9的指令执行部件分离了Memory和Write Back阶段,这两个阶段分别用于访问存储器和将结果回写到寄存器。这些技术的应用使得ARM9可以在一个周期内完成Load和Store指令,而在ARM7中,Load指令需要使用3拍,而Store指令需要使用2拍。

ARM9可以通过增强的编译器调整指令顺序来解决RAW(Read-after-Write)[ix]类相关。ARM9的这些功能增强,使得在相同的工艺下,其执行性能是ARM7的一倍左右[79]。ARM7并没有被淘汰,简练的设计极大降低了功耗,Apple在2001年10月23日[80]发布的iPod依然使用了ARM7处理器[81]

ARM7与ARM9的合理布局,使得ARM阵营迅猛发展。基于ARM7和ARM9内核的SoC处理器迅速遍及世界的每一个角落。ARM内核依然在前进。1998年的EPF(Embedded Processor Forum) ARM10内核正式推出。2000年4月12日,Lucent发布了第一颗基于ARM10的处理器芯片[83]

ARM10内核的设计目标依然是在相同的工艺下,双倍提升ARM9的性能。而提高性能的第一步是提高指令流水线的时钟频率,而流水线中最慢的逻辑单元决定了时钟频率。ARM10使用了6级流水线结构,但并不是在ARM9的5级流水线的基础上增加了一级,而是进行了细致取舍而调优。最终的结果是在使用相同的工艺时,ARM10内核可使用时钟频率为ARM9内核的1.5倍[82] [84]

ARM10内核重新使用了ARM8内核的系统总线,将ARM9的32位系统总线提高到64位。这也使得ARM10可以在一个时钟周期内完成两条寄存器与存储器之间的数据传递,大幅提高了Load Multiple和Store Multiple指令的效率[84]

ARM10改动了Cache Memory系统,与ARM9相比提高了存储器系统的效率。ARM10的指令与数据Cache使用虚拟地址,64路组相连结构,引入了高端处理器中流水线与Cache交换数据的Streaming Buffer和Cache Line filling部件[84]

ARM10内核优化了存储器读指令。实现了最为简单的乱序执行机制。当一条存储器读指令没有执行完毕,其后不相关的指令可以继续执行。ARM10对乘法指令进行了特别的优化,设置了一个新型的16×32的乘法和乘加部件,还设置了两级乘法指令流水,使得每一个时钟周期可以执行一条乘法指令[84]。同时ARM10内核增加了对浮点运算的支持。

从技术的角度上看,ARM10远胜过ARM9,但是没有办法在商业上与ARM9一较高下。ARM10的命运与ARM8惊人的一致。生不逢时的ARM8与StrongARM不期而遇,ARM10与XScale生活在同一年代。

Intel的工程师面对ARM的指令流水线耐不住技痒,ARM10的指令流水线与之前的ARM内核相比,可以说是一个飞跃,而与同年代的高端处理器相比只是一个玩具。Intel的帮助极大促进了ARM处理器的发展。

Intel在保证XScale架构低功耗的同时,引入已经在Pentium Pro系列处理器上非常成熟的Superpipelined RISC技术[85],借助Intel的工艺优势,使得XScale处理器的最高运行频率达到了1.25GHz[86]。此时Intel开发的处理器步入了高频低能的陷阱,1.25GHz的PXA3XX性能仅比624MHz的PXA270的执行效率高25%[86]

XScale架构并没有使Intel盈利。ICG(Intel Communication Group)部门和WCCG(Wireless Communications and Computing Group)部门给Intel带来的是巨额亏损,ICG在2002~2004年的亏损分别为$817M, $824M和$791M[87]。2003年12月11日,Intel宣布将WCCG合并到ICG中,并在2004年1月1日生效。

这次合并没有挽救XScale的命运。在2006年,AMD的步步紧逼使Intel迎来了20年以来最糟糕的一季财务报表。Intel开始了有史以来最大规模的裁员。2006年7月13日,Intel宣布取消1000个经理职务[89],2006年9月5日,Intel裁员10%[90]

在此之前Intel将XScale处理器中Marvell还愿意接收的部分出售[12]。Marvell需要的并不是XScale内核,而是Intel从DEC获得的对ARM指令集的完整授权,很快Marvell推出了基于标准ARM v5/v6/v7的处理器,而不再单独依靠XScale。XScale,这个几乎耗尽Intel全部心血的架构,已经走到了最后尽头。

Intel退出ARM阵营,不是因为缺少$600M现金。和许多人预料的并不相同,Intel并不是为了主推Atom处理器,而放弃XScale。而是因为Intel用长达八年的时间发现了一个事实,ARM的廉价License策略并不能使之获利,而必须做Atom。

ARM的廉价License的获益者是ARM自身,随着处理器厂商的不断加入, ARM阵营获得了迅猛发展,这也加速了处理器厂商的优胜劣汰。但是Intel发现的事实依然适用于所有正在使用ARM授权的半导体厂商。

最令ARM内核尴尬的是,依靠这个号称最为开放的处理器内核,获取暴利的是一些做着史上最为封闭系统的公司。凭借ARM内核,Qualcomm为3G专利找到了最佳载体,Apple不断兜售着各类新奇的电子设备。来自通信领域的Cisco,华为陆续加入ARM阵营。ARM,这个来自半导体领域的处理器,并没有使这个领域受益。ARM的出现,极大降低了处理器的设计门槛,使得单纯依靠半导体技术,为做处理器而做处理器的厂商举步维艰。

Intel首当其冲。Intel的错误在十几年前已然犯下。贝瑞特本应该做出对Intel最为有利选择,从DEC那里获得StrongARM后,再亲手终结StrongARM。贝瑞特不经意的失误为Intel的未来树立了一个强大的对手,也使整个处理器世界更加精彩。

ARM从XScale处理器中获得了足够的能量,可以不依赖任何厂商。他们的命运已经牢牢地掌握在自己手中。2002年12月,ARM1136内核发布[91]。2004年7月19日,ARM1176内核发布[92]。2005年3月10日,ARM1156内核发布[93]。在此之前的ARM处理器虽然得到了广泛应用,但是从纯技术的角度上看这些处理器微不足道。

ARM11基于ARMv6指令集,之前ARM还开发了V1,V2,V2a,V3,V4和V5指令集。ARM使用的内核与指令集并不一一对应。如ARM9使用V4和V5指令集,XScale使用V5指令集。ARM7最初使用V3,而后使用V4,最后升级到V5。在ARM指令集后还包含一些后缀如ARMv5TEJ,其中T表示支持Thumb指令集,E表示支持Enhanced DSP指令,而J表示支持Jazelle DBX指令。

ARM v4包含最基础的ARM指令集;v5增强了ARM与Thumb指令间交互的同时增加了CLZ(Count Leading Zero)和BKPT(Software Breakpoint)指令;ARMv5TE增加了一系列Enhanced DSP指令,如PLD(Preload Data),LDRD(Dual Word Load),STRD(Dual Word Store)和64位的寄存器传送指令如MCRR和MRRC。ARM v4和v5在指令集上变化不大,v5也可以向前兼容v4指令集[94]

而v6指令集并不能100%向前兼容v5的指令集。由于ARMv6对存储器访问模型的大规模更改,完全的向前兼容不再可能。从x86处理器苛求的向前兼容的角度上看,这些改动并不完美,正是这些不完美使ARM内核轻装前进。

ARM的指令集使用RISC架构,但是在ARM指令集中依然包含许多CISC元素。与PowerPC指令集相比,ARM的指令集凌乱得多,这为指令流水线的译码部件制造了不小的麻烦。在ARM内核包含三类指令集,一个是32b长度的ARM指令,一个是16b长度的Thumb指令,还有一类由8位组成的变长Jazelle DBX(Direct Bytecode eXecution)指令集。在ARM架构为数不多的指令集中,有两类指令值得特别关注,一个是Conditional Execution指令,另一个是移位指令。

绝大多数ARM的数据访问指令都支持条件执行功能。所谓条件执行是指指令可以根据状态位,有选择地执行。使用这种方式可以在一定程度上降低条件转移指令预测失败时所带来的系统延时。在计算GCD(Greatest Common Divisor)时,ARM的条件执行指令发挥了巨大的作用,如图2所示。

ARM与x86之3--蝶变ARM
图2 gcd算法的实现[94]

通过上图可以发现由于SUBGT和SUBLE指令可以根据CMP指令产生的状态决定是否执行,因此显著降低了代码长度。ARM指令集还对移位操作进行了特别的处理,ARM不含有单独的移位指令,使用了Barrel Shifter技术,与其他指令联合实现移位操作,使用这种方法可以有效提高某些运算的效率,如图3所示。

ARM与x86之3--蝶变ARM
图3 Barrel Shifter的使用

这些特殊的指令使ARM内核有别于其他处理器内核,但并不意味着极大提高了执行效率。首先CMP指令,SUBGT和SUBLE指令有较强的相关性,不能并发执行。此外现代处理器的条件预测单元也可以极大降低条件转移指令的命中率。一些处理器,如x86的CMOV指令和PowerPC的isel指令使用了更小的代价实现了ARM的条件执行功能。

ARM内核在条件执行指令时占用了4个状态位,影响了指令集和寄存器的扩展。在绝大多数RISC处理器中具有32个通用寄存器,而ARM内核仅有16个通用寄存器[x]。ARM的特殊移位操作,增加了指令的相关性,在有些情况下,不利于多发射流水线的实现,也增加了指令流水中预约站RS(Reservation Station)的实现难度。

计算机体系结构是一个权衡的艺术,尺有所短,寸有所长。不同的内核都有自己最为合适的应用,不经过认真的量化分析不能轻易得出孰优孰劣的结论。不过仍有一个结论,在现阶段依然适用,处理器领域历经多年的优胜劣汰,所剩无几的处理器内核在激烈的竞争中日渐趋同。

ARM11内核使用了现代处理器中常用的一些提高IPC的技术,这是ARM处理器的一个重要里程碑。ARM11内核引起了计算机科学的两个泰山北斗,David A. Patterson和John L. Hennessy的注意。他们以ARM11内核为主体,而不再是MIPS,书写了计算机体系结构的权威著作,《Computer Organization and Design, Fourth Edition: The Hardware/Software Interface》。这也是学术界对ARM处理器有史以来的最大认可。

ARM11可以支持多核,采用了8级流水线结构,率先发布的内核其主频在350~500MHz之间,最高主频可达1GHz。在使用0.13μm工艺,工作电压为1.2V时,ARM11处理器的功耗主频之比仅为0.4mW/MHz。ARM11增加了SIMD指令,与ARM9相比MPEG4的编解码算法实现速度提高了一倍,改变了Cache memory的结构,使用物理地址对Cache行进行索引[95]。ARM11终于使用了动态分支预测功能,设置了64个Entry,4个状态的BTAC(Branch Target Address Cache)[95]

ARM11进一步优化了指令流水线对存储器系统的访问,特别是在Cache Miss的情况之下的存储器读写访问。在ARM11内核中,当前存储器读指令并不会阻塞后续不相关的指令执行,即便后续指令依然是存储器读指令,只有3个存储器读指令都发生Cache Miss的情况,才会阻塞指令流水线[95]

虽然ARM11没有使用RISC处理器常用的out-of-order加Superscaler技术,在一个时钟周期之内仅能顺序发射一条指令,但是支持out-of-order completion功能,即在执行单元中的不相关的指令可以乱序结束,而不用等待之前的指令执行完毕。

ARM11的这些功能增强,相对于ARM9/10是一个不小的技术飞跃。但是与其他处于同一时代的x86,PowerPC和MIPS处理器相比,仍然有不小的差距。ARM11内核的存活之道依然是性能功耗比。

依靠着强大的性能功耗比,ARM11内核取得了巨大的商业成功。ARM11内核并不是一个性能很高的处理器,但是随着处理器性能的不断提升,量变引发了质变。ARM11内核的出现,使得Smart Phone的出现成为可能。

在此之前,基于ARM9,XScale处理器的手机只是在Feature Phone的基础上添加了少许智能部件。ARM11的出现加速了手机阵营的优胜劣汰,Apple,HTC在智能手机领域异军突起,Motorola一蹶不振。ARM11之后,ARM迎来了爆发式增长,迅速陆续发布了Cortex A8和A9内核。

ARM处理器内核的快速更新,使Nokia这个对新技术反应迟钝的公司,一步步走向衰退。在2010年9月底开始出货的Nokia N8[96],居然还在使用着680MHz主频的ARM11处理器[97],这款产品却号称是Nokia最新的旗舰产品,它的竞争对手早已使用了1GHz主频的Cortex A8处理器。

Cortex处理器是一个分水岭,从1983年开始的ARM内核,迎来了一颗真正意义的现代处理器。ARM已经破茧成蝶,不再是低功耗伴随着低能的处理器。从这一刻起,ARM处理器具备了和Intel,一较高下的能力。2010年4月3日,Apple的Jobs正式发布iPad,ARM随之进入平板电脑领域[99]。ARM已将战火烧到了Intel的后院。

抛弃了XScale架构的Intel,并没有放弃手机处理器。2009年1月23日,Nokia与Intel在手机领域建立长期合作伙伴关系[103]。2009年6月4日,Intel收购Windriver[102]。2010年5月4日,Intel正式发布用于智能手机和平板电脑,代号为Moorestown的处理器[100]。2010年8月29日,Intel收购Infinion的无线部门[104]。在2011年左右,Intel将发布用于智能手机,代号为Medfield的处理器[101]。一系列的合作与收购,使Intel具备了进入手机领域的能力。

至此ARM之于PC领域,x86之于手机领域的野心,已昭然若揭。2010年9月9日,ARM正式发布代号为Eagle,5倍ARM9架构的Cortex A15内核,这颗处理器所关注的应用是高端手机,家庭娱乐,无线架构,还有低端服务器[98]。Cortex A15向世人宣布除了PC,他们还要Server。

ARM,这个曾被Intel鄙视,被其扶植,被其抛弃的处理器,直面挑战Intel的x86处理器。这场较量是今后处理器领域5到10年的主旋律。最终结果将影响处理器领域今后20年的格局。不要认为ARM处理器没有进入PC领域的可能,也不要认为ARM处理器可以继续在手机领域中所向披靡。


[i] 苹果公司的两个创始人都叫Steve,一个是Steve Wozniak,另一个是众所周知的Steve Jobs。Steven Wozniak是Apple I和Apple II的发明者。两个Steve在1976年4月,在一个车库中成立众所周知的Apple。

[ii] 英国的谷仓文化与美国的车库文化相近,是新技术的摇篮。

[iii] ARM公司从ARM3直接升级到ARM6。

[iv] 我第一次准备使用的ARM处理器是Cirrus Logic的EP7312。当时我还在使用Altera的EPLD,名称是EP7132,我偶尔混淆这两个芯片的名称。在一个机缘巧合之下,粗心的采购将我需要购买的EP7132买成了EP7312,这颗芯片也是我购买的第一颗ARM处理器。

[v] 当时的处理器价格高得离谱,50美金已经是很廉价了。

[vi] 我从SA1110开始接触ARM处理器,那是一个永远值得回忆的时代。

[vii] 我在摩托罗拉半导体部门第一次接触的就是Coldfire处理器,目前这颗处理器仍然在不断发展中,这颗芯片与68K在汇编语言层面兼容,但是目标代码并不兼容。

[viii] QorIQ系列处理器基于E500 mc内核,与E500 v2有些微小差异。我的第一本著作是基于E500内核的《Linux PowerPC详解—核心篇》,当时准备写一套丛书,包括核心篇和应用篇。应用篇主要写外部设备,后来的《PCI Express体系结构导读》源自《Linux PowerPC详解—应用篇》,应用篇应该包含网络协议,PCI Express和USB总线,后来把网络协议部分和USB总线部分删掉了。

[ix] 在处理器体系结构中,重点关注的有三类相关问题,RAW,WAR和WAW。使用寄存器重命名技术可以解决WAR和WAW相关。

[x] 考虑到ARM在ARM11内核之前都不支持动态分支预测,和多发射,使用条件执行指令还是能够提高ARM7/9内核的执行效率。

Hardirq ,Softirq,Tasklet和Workqueue

中断是一个繁杂的话题,由中断引发的问题很容易引发争论。除我之前有讲过中断睡眠的问题,还有关于tasklet 和workqueue。 这里有必要重新整理总结一下了。

教课书或者intel手册上是这样划分中断的:

中断可分为同步(synchronous)中断和异步(asynchronous)中断:

1. 同步中断是当指令执行时由 CPU 控制单元产生,之所以称为同步,是因为只有在一条指令执行完毕后 CPU 才会发出中断,而不是发生在代码指令执行期间,比如系统调用。

2. 异步中断是指由其他硬件设备依照 CPU 时钟信号随机产生,即意味着中断能够在指令之间发生,例如键盘中断。

根据 Intel 官方资料,同步中断称为异常(exception),异步中断被称为中断(interrupt)。

中断可分为可屏蔽中断(Maskable interrupt 比如打印机中断)和非屏蔽中断(Nomaskable interrupt)。异常可分为故障(fault)比如缺页异常、陷阱(trap)比如调试异常、终止(abort)三类。

从广义上讲,中断可分为四类:中断故障陷阱终止。这些类别之间的异同点请参看 表 1。

表 1:中断类别及其行为
类别 原因 异步/同步 返回行为
中断 来自I/O设备的信号 异步 总是返回到下一条指令
陷阱 有意的异常 同步 总是返回到下一条指令
故障 潜在可恢复的错误 同步 返回到当前指令
终止 不可恢复的错误 同步 不会返回

X86 体系结构的每个中断都被赋予一个唯一的编号或者向量(8 位无符号整数)。非屏蔽中断和异常向量是固定的,而可屏蔽中断向量可以通过对中断控制器的编程来改变。

ok ,上面都是教课书或各种手册上的陈词滥调,那么在具体某个操作系统实现整个中断处理的时候却完全不像前面说的那样那么简单。

传统的中断机制是这样的:BIOS初始化-〉中断向量初始化-〉内核安装中断描述符表-〉发生中断-〉查找中断描述符表获得中断处理函数-〉关中断-〉处理中断,完毕-〉开中断。

前面所讲的基本都是硬中断(因教课书的局限性以及手册的严谨性),也就是传统中断的处理方式,硬件的支持贯穿于整个中断处理过程。后来发现在关中断-〉处理中断,完毕-〉开中断 之间,由于关中断会造成中断丢失,尤其是中断处理过程花费较长时间的情况下。所以从 linux1.x版本开始,中断处理程序从概念上被分为上半部分(top half)和下半部分(bottom half)。

在中断发生时上半部分的处理 过程立即执行,因为它是完全屏蔽中断的,所以要快,否则其它的中断就得不到及时的处理。但是下半部分(如果有的话)几乎做了中断处理程序所有的事情,可以 推迟执行。内核把上半部分和下半部分作为独立的函数来处理,上半部分的功能就是“登记中断”,决定其相关的下半部分是否需要执行。需要立即执行的部分必须 位于上半部分,而可以推迟的部分可能属于下半部分。下半部分的任务就是执行与中断处理密切相关但上半部分本身不执行的工作,如查看设备以获得产生中断的时 间信息,并根据这些信息(一般通过读设备上的寄存器得来)进行相应的处理。从这里我们可以看出下半部分其实是上半部分引起的,例如当打印机端口产生一个中 断时,其中断处理程序会立即执行相关的上半部分,上半部分就会产生一个软中断(下半部分的一种,后面再介绍)并送到操作系统内核里,这样内核就会根据这个软中断唤醒睡眠的打印机任务队列中的处理进程。

它们最大的不同是上半部分不可中断,而下半部分可中断。在理想的情况下,最好是中断处 理程序上半部分将所有工作都交给下半部分执行,这样的话在中断处理程序上半部分中完成的工作就很少,也就能尽可能快地返回。但是,中断处理程序上半部分一 定要完成一些工作,例如,通过操作硬件对中断的到达进行确认,还有一些从硬件拷贝数据等对时间比较敏感的工作。剩下的其他工作都可由下半部分执行(一个典型的情景就是网络数据包到达网卡,上半部必须要给该数据包上时间戳,然后其他处理推迟到后半部再处理)。

内核中的中断处理机制在不断变化,而变化的要点并不是在上半部,而是下半部。由上面介绍可以看出上半部仍然是遵循传统的中断机制,也就是依赖硬件的中断。所以上半部也可以称为硬中断。下半部由于只是处理上半部推托过来的任务,完全依赖代码实现,所以也可以理解成”软中断” ,只是这里的“软中断”非内核文档中说的软中断softirq,真正的软中断softirq(作为下半部实现的一种)是在2.4中引入的。内核中实现下半部的手段不断演化,目前已经从最原始的BH(bottom half)进化到软中断(softirq在2.3引 入)、tasklet(在2.3引入)、工作队列(work queue在2.5引入)。也就是说在2.6 中传统的BH 机制已经被剔除 ,现在提到BH 其实就是指softirq, tasklet.workequeue 这三种实现。

##############################################################

SOFTIRQ

##############################################################

引入softirq,替换传统BH的原因是:

1.系统中一次只能有一个CPU可以执行BH代码,

2.BH函数不允许嵌套。

后来SMP 普及后,上述缺点就成了致命伤。softirq支持SMP,同一个softirq可以在不同的CPU上同时运行,softirq必须是可重入的。整个softirq机制的设计与实现始终贯穿着 一个思想:“谁触发,谁执行”(Who marks, who runs),也就是说,每个CPU都单独负责它所触发的软中断,互不干扰。这就有效地利用了SMP系统的性能和特点,极大地提高了处理效率。

在include/linux/interrupt.h中定义了一个softirq_action结构来描述一个softirq请求,如下所示:

struct softirq_action
{
void (*action)(struct softirq_action *);
}

其中,函数指针action指向软中断请求的服务函数。

在kernel/softirq.c中定义了一个全局的softirq软中断向量表softirq_vec[NR_SOFTIRQS]:

static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;

对应NR_SOFTIRQS个 softirq_action结构表示的软中断描述符。内核预定义了一些软中断向量的含义供我们使用:

enum
{
 HI_SOFTIRQ=0,
 TIMER_SOFTIRQ,
 NET_TX_SOFTIRQ,
 NET_RX_SOFTIRQ,
 BLOCK_SOFTIRQ,
 BLOCK_IOPOLL_SOFTIRQ,
 TASKLET_SOFTIRQ,
 SCHED_SOFTIRQ,
 HRTIMER_SOFTIRQ,
 RCU_SOFTIRQ, /* Preferable RCU should always be the last softirq */

 NR_SOFTIRQS
};

没错,这个软中断向量表和硬中断向量表相仿,优先级从上到下依次降低。这里枚举类型用法比较巧妙,每个枚举值按顺序拥有一个索引值,

HI_SOFTIRQ就是0,到NR_SOFTIRQS的索引值就正好是枚举值的个数10。用宏也可以定义:
#define HI_SOFTIRQ 0
#define TIMER_SOFTIRQ 1
.......
#define NR_SOFTIRQS 10

open_softirq向内核注册一个软中断,其实质是设置软中断向量表相应槽位,注册其处理函数:

void open_softirq(int nr, void (*action)(struct softirq_action *))
{
 softirq_vec[nr].action = action;
}

下面介绍一下softirq的处理流程:

处理时机1 :由硬中断直接调用执行软中断

1.上半部(硬中断)处理函数 do_IRQ  in arch/x86/kernel/irq.c:

/*
 * do_IRQ handles all normal device IRQ's (the special
 * SMP cross-CPU interrupts have their own specific
 * handlers).
 */
unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
{
 struct pt_regs *old_regs = set_irq_regs(regs);

 /* high bit used in ret_from_ code */
 unsigned vector = ~regs->orig_ax;
 unsigned irq;

 exit_idle();
 irq_enter();

 irq = __get_cpu_var(vector_irq)[vector];

 if (!handle_irq(irq, regs)) {
 ack_APIC_irq();

 if (printk_ratelimit())
 pr_emerg("%s: %d.%d No irq handler for vector (irq %d)\n",
 __func__, smp_processor_id(), vector, irq);
 }

 irq_exit();

 set_irq_regs(old_regs);
 return 1;
}

特殊一点的比如apic时钟中断 in arch/x86/kernel/apic/apic.c:

/*
 * Local APIC timer interrupt. This is the most natural way for doing
 * local interrupts, but local timer interrupts can be emulated by
 * broadcast interrupts too. [in case the hw doesn't support APIC timers]
 *
 * [ if a single-CPU system runs an SMP kernel then we call the local
 * interrupt as well. Thus we cannot inline the local irq ... ]
 */
void __irq_entry smp_apic_timer_interrupt(struct pt_regs *regs)
{
 struct pt_regs *old_regs = set_irq_regs(regs);

 /*
 * NOTE! We'd better ACK the irq immediately,
 * because timer handling can be slow.
 */
 ack_APIC_irq();
 /*
 * update_process_times() expects us to have done irq_enter().
 * Besides, if we don't timer interrupts ignore the global
 * interrupt lock, which is the WrongThing (tm) to do.
 */
 exit_idle();
 irq_enter();
 local_apic_timer_interrupt();
 irq_exit();

 set_irq_regs(old_regs);
}

2.上半部(硬中断)退出处理函数 irq_exit()   in kernel/softirq.c  :

void irq_exit(void)
{
 account_system_vtime(current);
 trace_hardirq_exit();
 sub_preempt_count(IRQ_EXIT_OFFSET);
 if (!in_interrupt() && local_softirq_pending())
 invoke_softirq();

 rcu_irq_exit();
#ifdef CONFIG_NO_HZ
 /* Make sure that timer wheel updates are propagated */
 if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
 tick_nohz_stop_sched_tick(0);
#endif
 preempt_enable_no_resched();
}

3.软中断调用函数 invoke_softirq() in kernel/softirq.c :

/*macro if defined, means that the IRQs are guaranteed to be disabled when irq_exit() function is called.
 In such a case, the kernel may skip some instructions (disabling IRQs etc).... and thus call __do_IRQ() instead of do_IRQ.*/

#ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED 
static inline void invoke_softirq(void)
{
 if (!force_irqthreads)
 __do_softirq();
 else
 wakeup_softirqd();
}
#else
static inline void invoke_softirq(void)
{
 if (!force_irqthreads)
 do_softirq();
 else
 wakeup_softirqd();
}
#endif

X86 实际调用的是do_softirq , ARM架构的会调用__do_softirq 区别在于前者保证硬中断已关闭。

不过这里do_softirq 并不是在kernel/softirq.c 中定义的那个,因为宏__ARCH_HAS_DO_SOFTIRQ 在x86下被定义了,所以真正的do_softirq在arch/x86/kernel/irq_32.c :

asmlinkage void do_softirq(void)
{
 unsigned long flags;
 struct thread_info *curctx;
 union irq_ctx *irqctx;
 u32 *isp;

 if (in_interrupt())
 return;

 local_irq_save(flags);

 if (local_softirq_pending()) {
 curctx = current_thread_info();
 irqctx = __this_cpu_read(softirq_ctx);
 irqctx->tinfo.task = curctx->task;
 irqctx->tinfo.previous_esp = current_stack_pointer;

 /* build the stack frame on the softirq stack */
 isp = (u32 *) ((char *)irqctx + sizeof(*irqctx));

 call_on_stack(__do_softirq, isp);
 /*
 * Shouldn't happen, we returned above if in_interrupt():
 */
 WARN_ON_ONCE(softirq_count());
 }

 local_irq_restore(flags);
}

这里有个令人疑惑的问题,local_irq_save(flags) 和local_irq_restore(flags) 之间是关中断的,那么__do_softirq 也就是在关中断情况下执行。 这样不就和下半部在开中断下执行的设计初衷相违背了吗? 答案就在__do_softirq 中

最终还是要执行kernel/softirq.c 中的 __do_softirq:

/*
 * We restart softirq processing MAX_SOFTIRQ_RESTART times,
 * and we fall back to softirqd after that.
 *
 * This number has been established via experimentation.
 * The two things to balance is latency against fairness -
 * we want to handle softirqs as soon as possible, but they
 * should not be able to lock up the box.
 */
#define MAX_SOFTIRQ_RESTART 10

asmlinkage void __do_softirq(void)
{
 struct softirq_action *h;
 __u32 pending;
 int max_restart = MAX_SOFTIRQ_RESTART;
 int cpu;

 pending = local_softirq_pending();
 account_system_vtime(current);

 __local_bh_disable((unsigned long)__builtin_return_address(0),
 SOFTIRQ_OFFSET);
 lockdep_softirq_enter();

 cpu = smp_processor_id();
restart:
 /* Reset the pending bitmask before enabling irqs */
 set_softirq_pending(0);

 local_irq_enable();

 h = softirq_vec;

 do {
 if (pending & 1) {
 unsigned int vec_nr = h - softirq_vec;
 int prev_count = preempt_count();

 kstat_incr_softirqs_this_cpu(vec_nr);

 trace_softirq_entry(vec_nr);
 h->action(h);
 trace_softirq_exit(vec_nr);
 if (unlikely(prev_count != preempt_count())) {
 printk(KERN_ERR "huh, entered softirq %u %s %p"
 "with preempt_count %08x,"
 " exited with %08x?\n", vec_nr,
 softirq_to_name[vec_nr], h->action,
 prev_count, preempt_count());
 preempt_count() = prev_count;
 }

 rcu_bh_qs(cpu);
 }
 h++;
 pending >>= 1;
 } while (pending);

 local_irq_disable();

 pending = local_softirq_pending();
 if (pending && --max_restart)
 goto restart;

 if (pending)
 wakeup_softirqd();

 lockdep_softirq_exit();

 account_system_vtime(current);
 __local_bh_enable(SOFTIRQ_OFFSET);
}

哈,软中断处理函数的执行h->action(h) ,夹在local_irq_enable() 和local_irq_disable() 之间。前面的那个疑问解决了。

看到这里softirq 的流程就这些吗?  当然不是,除了在硬中断执行完后进入irq_exit 直接触发softirq 执行,还可以通过先预约再择机(推迟)执行的方式。这里就体现出软中断推迟执行的特点了。

处理时机2 :ksoftirq 内核线程执行软中断

1.预约

预约具体是通过raise_softirq 函数实现的:

in kernel/softirq.c

void raise_softirq(unsigned int nr) 
{ unsigned long flags; 
 local_irq_save(flags); 
 raise_softirq_irqoff(nr); 
 local_irq_restore(flags); 
}

inline void raise_softirq_irqoff(unsigned int nr) 
{ __raise_softirq_irqoff(nr); 
/* * If we're in an interrupt or softirq, we're done 
* (this also catches softirq-disabled code). We will 
* actually run the softirq once we return from 
* the irq or softirq. 
* * Otherwise we wake up ksoftirqd to make sure we 
* schedule the softirq soon. */ 
if (!in_interrupt()) 
 wakeup_softirqd(); 
}
in include/linux/interrupt.h
static inline void __raise_softirq_irqoff(unsigned int nr) 
{
 trace_softirq_raise(nr); 
 or_softirq_pending(1UL << nr); 
}

破了几层窗户最后其实就是在softirq 位图对对应的软中断号上标记。

2. ksoftirq 内核线程

预约完毕后将在raise_softirq_irqoff 中唤醒ksoftirq :

in kernel/softirq.c

void wakeup_softirqd(void) { 
/* Interrupts are disabled: no need to stop preemption 
*/ 
struct task_struct *tsk = __get_cpu_var(ksoftirqd); 
if (tsk && tsk->state != TASK_RUNNING) 
 wake_up_process(tsk); 
}
 static int run_ksoftirqd(void * __bind_cpu)
 {
         set_current_state(TASK_INTERRUPTIBLE);

         while (!kthread_should_stop()) {
                 preempt_disable();
                 if (!local_softirq_pending()) {
                         preempt_enable_no_resched();
                         schedule();
                         preempt_disable();
                 }

                 __set_current_state(TASK_RUNNING);

                 while (local_softirq_pending()) {
                         /* Preempt disable stops cpu going offline.
                            If already offline, we'll be on wrong CPU:
                            don't process */
                         if (cpu_is_offline((long)__bind_cpu))
                                 goto wait_to_die;
                         local_irq_disable();
                         if (local_softirq_pending())
                                 __do_softirq();
                         local_irq_enable();
                         preempt_enable_no_resched();
                         cond_resched();
                         preempt_disable();
                         rcu_note_context_switch((long)__bind_cpu);
                 }
                 preempt_enable();
                 set_current_state(TASK_INTERRUPTIBLE);
         }
         __set_current_state(TASK_RUNNING);
         return 0;

 wait_to_die:
         preempt_enable();
         /* Wait for kthread_stop */
         set_current_state(TASK_INTERRUPTIBLE);
         while (!kthread_should_stop()) {
                 schedule();
                 set_current_state(TASK_INTERRUPTIBLE);
         }
         __set_current_state(TASK_RUNNING);
         return 0;
 }

处理时机3 :调用local_bh_enable显式执行软中断
void local_bh_enable(void)
 {
         _local_bh_enable_ip((unsigned long)__builtin_return_address(0));
 }
void local_bh_enable_ip(unsigned long ip)
 {
         _local_bh_enable_ip(ip);
 }

static inline void _local_bh_enable_ip(unsigned long ip)
 {
         WARN_ON_ONCE(in_irq() || irqs_disabled());
 #ifdef CONFIG_TRACE_IRQFLAGS
         local_irq_disable();
 #endif
         /*
          * Are softirqs going to be turned on now:
          */
         if (softirq_count() == SOFTIRQ_DISABLE_OFFSET)
                 trace_softirqs_on(ip);
         /*
          * Keep preemption disabled until we are done with
          * softirq processing:
          */
         sub_preempt_count(SOFTIRQ_DISABLE_OFFSET - 1);

         if (unlikely(!in_interrupt() && local_softirq_pending()))
                 do_softirq();

         dec_preempt_count();
 #ifdef CONFIG_TRACE_IRQFLAGS
         local_irq_enable();
 #endif
         preempt_check_resched();
 }

这种方式在协议栈代码中用的很多,因为协议栈往往造成大量中断的产生,催促软中断的处理似乎是一个好的选择。而且软中断向量NET_TX_SOFTIRQ, NET_RX_SOFTIRQ 优先级除TIMER高于其他,能够保证它及时处理。

##############################################################

TASKLET

##############################################################

上文讲软中断向量的时候看到了 TASKLET_SOFTIRQ,没错它就是用来为tasklet 机制服务的软中断。 说白了,tasklet机制就是基于softirq的:

另外 HI_SOFTIRQ 也是用来服务tasklet 的,只不过他是所有软中断中优先级最高的。

tasklet和基础softirq 的不同主要是:

1.softirq 能够让同一个中断处理函数在不同的CPU上同时执行(注意,这里的同时是真的同时,因为SMP环境拥有两颗或两颗以上的cpu核心),要求该函数必须是可重入的。所以内核或驱动开发者要在中断处理函数中注意互斥问题,也就是加锁,当然这里不能加睡眠锁,只能上自旋锁。

2.tasklet 机制实现了同一tasklet 只能在一个cpu上执行,但不同的tasklet却可以在不同的cpu上执行。这样开发者就可以把互斥问题抛至脑后了。另外,tasklet在执行的时候是非积累的,比如一个时间内某个tasklet被触发3次,那么待轮到tasklet handle 被执行时,实际只执行1次(不可重入)。而且每个tasklet总是在第一次执行的那个 cpu 上执行 ,这样有利于cpu 缓存。

tasklet 结构:

in kernel/softirq.c:

struct tasklet_struct

{

       struct tasklet_struct *next;

       unsigned long state;

       atomic_t count;

       void (*func)(unsigned long);

       unsigned long data;

};


其中,各个成员的含义如下:
(1)next指针指向下一个tasklet,它用于将多个tasklet连接成一个单向循环链表。
为此,内核还专门在softirq.c中定义了一个tasklet_head结构用来表示tasklet队列:
struct tasklet_head { struct tasklet_struct *list; }; 
(2)state定义了tasklet的当前状态,这是一个32位无符号整数,不过目前只使用了bit 0和bit 1,
bit 0为1表示tasklet已经被调度去执行了,而bit 1是专门为SMP系统设置的,
为1时表示tasklet当前正在某个CPU上执行,这是为了防止多个CPU同时执行一个tasklet的情况。
内核对这两 个位的含义也进行了预定义:

enum
{
TASKLET_STATE_SCHED, /* Tasklet is scheduled for execution */
TASKLET_STATE_RUN     /* Tasklet is running (SMP only) */
};

(3)count是一个原子计数(其实它只有0或1 两种值),对tasklet的引用进行计数。目的是在tasklet已经挂上的情况下enable 或disable这个tasklet 。需要注意的是,只有当count的值为0的时候,tasklet代码段才能执 行,即这个时候该tasklet才是enable的;如果count值非0,则该tasklet是被禁止的(disable)。因此,在执行 tasklet代码段之前,必须先检查其原子值count是否为0。
(4)func是一个函数指针,指向一个可执行的tasklet代码段,data是func函数的参数。


 

tasklet 调度函数:

tasklet 可以看作是软中断的step 2 ,所以他的softirq handler 就是tasklet 的执行/调度 函数:
in kernel/softirq.c:
static void tasklet_action(struct softirq_action *a)
{
 struct tasklet_struct *list;

 local_irq_disable();
 list = __get_cpu_var(tasklet_vec).head;
 __get_cpu_var(tasklet_vec).head = NULL;
 __get_cpu_var(tasklet_vec).tail = &__get_cpu_var(tasklet_vec).head;
 local_irq_enable();

 while (list) {
 struct tasklet_struct *t = list;

 list = list->next;

 if (tasklet_trylock(t)) {
 if (!atomic_read(&t->count)) {
 if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state))
 BUG();
 t->func(t->data);
 tasklet_unlock(t);
 continue;
 }
 tasklet_unlock(t);
 }

 local_irq_disable();
 t->next = NULL;
 *__get_cpu_var(tasklet_vec).tail = t;
 __get_cpu_var(tasklet_vec).tail = &(t->next);
 __raise_softirq_irqoff(TASKLET_SOFTIRQ);
 local_irq_enable();
 }
}

用tasklet_trylock()宏试图对当前要执行的tasklet(由指针t所指向)进行加锁,如果加锁成功 (当前没有任何其他CPU正在执行这个tasklet),则用原子读函数atomic_read()进一步判断count成员的值。如果count为0, 说明这个tasklet是允许执行的。如果tasklet_trylock()宏加锁不成功,或者因为当前tasklet的count值非0而不允许执行 时,我们必须将这个tasklet重新放回到当前CPU的tasklet队列中,以留待这个CPU下次服务软中断向量TASKLET_SOFTIRQ时再 执行。为此进行这样几步操作:(1)先关 CPU中断,以保证下面操作的原子性。(2)把这个tasklet重新放回到当前CPU的tasklet队列的 首部;(3)调用__cpu_raise_softirq()函数在当前CPU上再触发一次软中断请求TASKLET_SOFTIRQ;(4)开中断。

##################################################################

Workqueue

#################################################################

工 作队列是Linux 2.6 内核中新增加的一种下半部机制。它与其它几种下半部分机制最大的区别就是它可以把工作推后,交由一个内核线程–工作者线程 (内核线程)去执行。内核线程只在内核空间运行,没有自己的用户空间,它和普通进程一样可以被调度,也可以被抢占。该工作队列总是会在进程上下文执行。缺 省的工作者线程叫做events/n,n是处理器的编号。如果要在工作者线程中执行大量的处理操作时,可以创建属于自己的工作者线程。这样,通过工作队列 执行的代码能占尽进程上下文的所有优势,最重要的就是工作队列允许重新调度甚至是睡眠。

由于softirq和 tasklet在同一个CPU上的串行执行,不利于多媒体实时任务和其它要求严格的任务的处理。在有些系统中采用了新的工作队列机制取代软中断机制来完成 网络接收中断后的推后处理工作。通过由具有最高实时优先级的工作者线程来处理实时多媒体任务或其它要求较高的任务,而由优先级次高的工作者线程来处理其他 的非实时数据业务。Linux 2.6 内核的调度系统采用了内核抢占和O(1)调度,能够满足软实时的要求,因此几乎总能保证处理实时多媒体任务或要求 较高任务的工作者线程优先执行。这样,就保证了多媒体实时任务或要求较高任务得到优先的处理。

工作队列靠内核线程来运行,可能会引起上下文切换(当任务睡眠、阻塞需要重新调度时),这样它造成的开销也比较大。 

由于内核线程已经脱离了内核中断范围所以这里不再讲实现细节了,后续总结内核线程的时候再说吧。

IOWAIT 到底是什么?

iowait 在vmstat 和iostat 以及top 中都能看到,它到底是什么? 看完这篇文章似乎一切都明白了,太长,暂时没有时间翻译,凑合看吧。
出处: http://blog.pregos.info/wp-content/uploads/2010/09/iowait.txt

What exactly is "iowait"?

To summarize it in one sentence, 'iowait' is the percentage
of time the CPU is idle AND there is at least one I/O
in progress.

Each CPU can be in one of four states: user, sys, idle, iowait.
Performance tools such as vmstat, iostat, sar, etc. print
out these four states as a percentage.  The sar tool can
print out the states on a per CPU basis (-P flag) but most
other tools print out the average values across all the CPUs.
Since these are percentage values, the four state values
should add up to 100%.

The tools print out the statistics using counters that the
kernel updates periodically (on AIX, these CPU state counters
are incremented at every clock interrupt (these occur
at 10 millisecond intervals).
When the clock interrupt occurs on a CPU, the kernel
checks the CPU to see if it is idle or not. If it's not
idle, the kernel then determines if the instruction being
executed at that point is in user space or in kernel space.
If user, then it increments the 'user' counter by one. If
the instruction is in kernel space, then the 'sys' counter
is incremented by one.

If the CPU is idle, the kernel then determines if there is at least one I/O currently in progress to either a local disk or a remotely mounted disk (NFS) which had been initiated from that CPU. If there is, then the 'iowait' counter is incremented by one. If there is no I/O in progress that was
initiated from that CPU, the 'idle' counter is incremented
by one.

When a performance tool such as vmstat is invoked, it reads
the current values of these four counters. Then it sleeps
for the number of seconds the user specified as the interval
time and then reads the counters again. Then vmstat will
subtract the previous values from the current values to
get the delta value for this sampling period. Since vmstat
knows that the counters are incremented at each clock
tick (10ms), second, it then divides the delta value of
each counter by the number of clock ticks in the sampling
period. For example, if you run 'vmstat 2', this makes
vmstat sample the counters every 2 seconds. Since the
clock ticks at 10ms intervals, then there are 100 ticks
per second or 200 ticks per vmstat interval (if the interval
value is 2 seconds).   The delta values of each counter
are divided by the total ticks in the interval and
multiplied by 100 to get the percentage value in that
interval.

iowait can in some cases be an indicator of a limiting factor
to transaction throughput whereas in other cases, iowait may be completely meaningless.
Some examples here will help to explain this. The first
example is one where high iowait is a direct cause
of a performance issue.

Example 1:
Let's say that a program needs to perform transactions on behalf of
a batch job. For each transaction, the program will perform some
computations which takes 10 milliseconds and then does a synchronous
write of the results to disk. Since the file it is writing to was
opened synchronously, the write does not return until the I/O has
made it all the way to the disk. Let's say the disk subsystem does
not have a cache and that each physical write I/O takes 20ms.
This means that the program completes a transaction every 30ms.
Over a period of 1 second (1000ms), the program can do 33
transactions (33 tps).  If this program is the only one running
on a 1-CPU system, then the CPU usage would be busy 1/3 (10ms task running / (10 ms computation + 20ms I/O sync)) of the
time and waiting on I/O the rest of the time - so 66% iowait
and 34% CPU busy.

If the I/O subsystem was improved (let's say a disk cache is
added) such that a write I/O takes only 1ms. This means that
it takes 11ms to complete a transaction, and the program can
now do around 90-91 transactions a second. Here the iowait time
would be around 8%. Notice that a lower iowait time directly affects the throughput of the program (more higher tps).

Example 2:

Let's say that there is one program running on the system - let's assume
that this is the 'dd' program, and it is reading from the disk 4KB at
a time. Let's say that the subroutine in 'dd' is called main() and it
invokes read() to do a read. Both main() and read() are user space
subroutines. read() is a libc.a subroutine which will then invoke
the kread() system call at which point it enters kernel space.
kread() will then initiate a physical I/O to the device and the 'dd'
program is then put to sleep until the physical I/O completes.
The time to execute the code in main, read, and kread is very small -
probably around 50 microseconds at most. The time it takes for
the disk to complete the I/O request will probably be around 2-20
milliseconds depending on how far the disk arm had to seek. This
means that when the clock interrupt occurs, the chances are that
the 'dd' program is asleep and that the I/O is in progress. Therefore,
the 'iowait' counter is incremented. If the I/O completes in
2 milliseconds, then the 'dd' program runs again to do another read.
But since 50 microseconds is so small compared to 2ms (2000 microseconds), the chances are that when the clock interrupt occurs, the CPU will again be idle with a I/O in progress. So again, 'iowait' is incremented.  If 'sar -P ' is run to show the CPU
utilization for this CPU, it will most likely show 97-98% iowait.
If each I/O takes 20ms, then the iowait would be 99-100%.
Even though the I/O wait is extremely high in either case,
the throughput is 10 times better in one case.

Example 3:

Let's say that there are two programs running on a CPU. One is a 'dd'
program reading from the disk. The other is a program that does no
I/O but is spending 100% of its time doing computational work.
Now assume that there is a problem with the I/O subsystem and that
physical I/Os are taking over a second to complete. Whenever the
'dd' program is asleep while waiting for its I/Os to complete,
the other program is able to run on that CPU. When the clock
interrupt occurs, there will always be a program running in
either user mode or system mode. Therefore, the %idle and %iowait
values will be 0. Even though iowait is 0 now, that does not mean there is NOT a I/O problem because there obviously is one
if physical I/Os are taking over a second to complete.

Example 4:

Let's say that there is a 4-CPU system where there are 6 programs
running. Let's assume that four of the programs spend 70% of their
time waiting on physical read I/Os and the 30% actually using CPU time.
Since these four  programs do have to enter kernel space to execute the
kread system calls, it will spend a percentage of its time in
the kernel; let's assume that 25% of the time is in user mode,
and 5% of the time in kernel mode.
Let's also assume that the other two programs spend 100% of their
time in user code doing computations and no I/O so that two CPUs
will always be 100% busy. Since the other four programs are busy
only 30% of the time, they can share that are not busy.

If we run 'sar -P ALL 1 10' to run 'sar' at 1-second intervals
for 10 intervals, then we'd expect to see this for each interval:

         cpu    %usr    %sys    %wio   %idle
          0       50      10      40       0
          1       50      10      40       0
          2      100       0       0       0
          3      100       0       0       0
          -       75       5      20       0

Notice that the average CPU utilization will be 75% user, 5% sys,
and 20% iowait. The values one sees with 'vmstat' or 'iostat' or
most tools are the average across all CPUs.

Now let's say we take this exact same workload (same 6 programs
with same behavior) to another machine that has 6 CPUs (same
CPU speeds and same I/O subsytem).  Now each program can be
running on its own CPU. Therefore, the CPU usage breakdown
would be as follows:

         cpu    %usr    %sys    %wio   %idle
          0       25       5      70       0
          1       25       5      70       0
          2       25       5      70       0
          3       25       5      70       0
          4      100       0       0       0
          5      100       0       0       0
          -       50       3      47       0

So now the average CPU utilization will be 50% user, 3% sy,
and 47% iowait.  Notice that the same workload on another
machine has more than double the iowait value.

Conclusion:

The iowait statistic may or may not be a useful indicator of
I/O performance - but it does tell us that the system can
handle more computational work. Just because a CPU is in iowait state does not mean that it can't run other threads on that CPU; that is, iowait is simply a form of idle time.



另外一篇中文的分析:
http://www.yybean.com/iowait%E7%9A%84%E6%88%90%E5%9B%A0%E3%80%81%E5%AF%B9%E7%B3%BB%E7%BB%9F%E5%BD%B1%E5%93%8D%E5%8F%8A%E5%AF%B9%E7%AD%96

关于#ifndef #define

几乎所有的header 文件都是这种写法,目的是防止重复包含。 比如
in kernel_path   include/linux/times.h
#ifndef _LINUX_TIMES_H
#define _LINUX_TIMES_H

#include <linux/types.h>

struct tms {
 __kernel_clock_t tms_utime;
 __kernel_clock_t tms_stime;
 __kernel_clock_t tms_cutime;
 __kernel_clock_t tms_cstime;
};

#endif

这里的
#ifndef _LINUX_TIMES_H
#define _LINUX_TIMES_H
其实就是上个标记,第二次再include的时候define会被忽略。
所以几乎所有的header 文件的内容都包含在#ifndef #endif 之间。

copy 一个文件也能让内核挂掉?!

把某个1.6G 的文件copy到 USB -> SCIS 设备(PATA 硬盘) ,文件系统是FAT32, 大概到630 MB 的时候 内核会挂掉,但如果是另外一个文件,copy全程无问题。试过其他文件系统也是同样情况。

Jul 17 22:59:52 dekernel kernel: [11660.092862] usb 2-6: USB disconnect, device number 26
Jul 17 22:59:52 dekernel kernel: [11660.096871] sd 31:0:0:0: [sdg] Unhandled error code
Jul 17 22:59:52 dekernel kernel: [11660.096874] sd 31:0:0:0: [sdg]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 17 22:59:52 dekernel kernel: [11660.096877] sd 31:0:0:0: [sdg] CDB: Write(10): 2a 00 00 73 6c d8 00 00 f0 00
Jul 17 22:59:52 dekernel kernel: [11660.096885] end_request: I/O error, dev sdg, sector 7564504
Jul 17 22:59:52 dekernel kernel: [11660.098355] sd 31:0:0:0: [sdg] Unhandled error code
Jul 17 22:59:52 dekernel kernel: [11660.098358] sd 31:0:0:0: [sdg]  Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 17 22:59:52 dekernel kernel: [11660.098360] sd 31:0:0:0: [sdg] CDB: Write(10): 2a 00 00 73 6d c8 00 00 f0 00
Jul 17 22:59:52 dekernel kernel: [11660.098367] end_request: I/O error, dev sdg, sector 7564744
Jul 17 22:59:52 dekernel kernel: [11660.124608] FAT: FAT read failed (blocknr 1930)
Jul 17 22:59:52 dekernel kernel: [11660.124835] FAT: FAT read failed (blocknr 1656)
Jul 17 22:59:52 dekernel kernel: [11660.124854] FAT: FAT read failed (blocknr 1930)
Jul 17 22:59:52 dekernel kernel: [11660.124871] FAT: FAT read failed (blocknr 1602)
Jul 17 22:59:52 dekernel kernel: [11660.154598] BUG: unable to handle kernel paging request at 36391000
Jul 17 22:59:52 dekernel kernel: [11660.154642] IP: [<c042271f>] __percpu_counter_add+0x1f/0xd0
Jul 17 22:59:52 dekernel kernel: [11660.154678] *pdpt = 0000000021a7c001 *pde = 0000000000000000
Jul 17 22:59:52 dekernel kernel: [11660.154714] Oops: 0000 [#1] PREEMPT SMP
Jul 17 22:59:52 dekernel kernel: [11660.154743] last sysfs file: /sys/devices/pci0000:00/0000:00:1d.7/class
Jul 17 22:59:52 dekernel kernel: [11660.154779] Modules linked in: nls_iso8859_1 nls_cp437 vfat fat tun af_packet snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd vboxnetadp vboxnetflt vboxdrv mperf binfmt_misc fuse ext4 jbd2 crc16 loop snd_hda_codec_analog arc4 ecb rtl8187 iwlagn snd_hda_intel snd_hda_codec mac80211 snd_hwdep snd_pcm cfg80211 snd_timer firewire_ohci snd firewire_core sr_mod eeprom_93cx6 skge iTCO_wdt sg 8139too cdrom pcspkr i2c_i801 floppy 8139cp sky2 soundcore asus_atk0110 snd_page_alloc rfkill iTCO_vendor_support crc_itu_t button reiserfs radeon ttm drm_kms_helper drm i2c_algo_bit dm_snapshot dm_mod fan thermal processor thermal_sys ata_generic pata_jmicron [last unloaded: speedstep_lib]
Jul 17 22:59:52 dekernel kernel: [11660.155003]
Jul 17 22:59:52 dekernel kernel: [11660.155003] Pid: 17, comm: bdi-default Not tainted 2.6.39.11-2-desktop #1 System manufacturer System Product Name/P5B-Deluxe
Jul 17 22:59:52 dekernel kernel: [11660.155003] EIP: 0060:[<c042271f>] EFLAGS: 00010002 CPU: 0
Jul 17 22:59:52 dekernel kernel: [11660.155003] EIP is at __percpu_counter_add+0x1f/0xd0
Jul 17 22:59:52 dekernel kernel: [11660.155003] EAX: 00000000 EBX: f661f374 ECX: ffffffff EDX: ffffffff
Jul 17 22:59:52 dekernel kernel: [11660.155003] ESI: f2c7be40 EDI: 00000000 EBP: f3531d2c ESP: f3531d14
Jul 17 22:59:52 dekernel kernel: [11660.155003]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Jul 17 22:59:52 dekernel kernel: [11660.155003] Process bdi-default (pid: 17, ti=f3530000 task=f34db2c0 task.ti=f3530000)
Jul 17 22:59:52 dekernel kernel: [11660.155003] Stack:
Jul 17 22:59:52 dekernel kernel: [11660.155003]  ffffffec c0a766f4 00000000 00000292 f2c7be40 00000000 f3531d40 c02dd5d0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  00000018 f4b5b5a0 00000000 f3531dd0 c02dd861 00000000 0000000e 00000001
Jul 17 22:59:52 dekernel kernel: [11660.155003]  00001747 00000001 00000000 f3531db4 c0355ad0 00000a83 0002914a 00000002
Jul 17 22:59:52 dekernel kernel: [11660.155003] Call Trace:
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c02dd5d0>] clear_page_dirty_for_io+0xb0/0xe0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c02dd861>] write_cache_pages+0x141/0x370
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c035515a>] mpage_writepages+0x5a/0xa0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<f84d75fd>] fat_writepages+0xd/0x10 [fat]
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c02de997>] do_writepages+0x17/0x30
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c0346bd9>] writeback_single_inode+0xc9/0x200
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c0346f42>] writeback_sb_inodes+0xb2/0x180
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c0347885>] wb_writeback+0x155/0x3e0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c0347ba3>] wb_do_writeback+0x93/0x1f0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c02ee0e9>] bdi_forker_thread+0x89/0x3d0
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c02655d4>] kthread+0x74/0x80
Jul 17 22:59:52 dekernel kernel: [11660.155003]  [<c069a8e6>] kernel_thread_helper+0x6/0xd
Jul 17 22:59:52 dekernel kernel: [11660.155003] Code: c3 8d 74 26 00 8d bc 27 00 00 00 00 55 89 e5 83 ec 18 89 5d f4 89 c3 89 e0 25 00 e0 ff ff 89 75 f8 89 7d fc 83 40 14 01 8b 43 14
Jul 17 22:59:52 dekernel kernel: [11660.155003] EIP: [<c042271f>] __percpu_counter_add+0x1f/0xd0 SS:ESP 0068:f3531d14
Jul 17 22:59:52 dekernel kernel: [11660.155003] CR2: 0000000036391000
Jul 17 22:59:52 dekernel kernel: [11660.168399] ---[ end trace f0a2c1711cf79bb9 ]---

没错,这块盘是有坏道,但坏道能让内核挂掉还真是奇怪,而且只有copy 这个文件会发生问题

2011/07/22 update:

发现好像是硬盘盒造成的,主控是国内的一家名叫super top , 可是是有bug 。 传输的数据内容触发了主控的bug 然后造成内核崩溃 ~ ??!!

Bus 002 Device 009: ID 14cd:6600 Super Top USB 2.0 IDE DEVICE
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0        64
  idVendor           0x14cd Super Top
  idProduct          0x6600 USB 2.0 IDE DEVICE
  bcdDevice            2.01
  iManufacturer           1 Super Top
  iProduct                3 USB 2.0  IDE DEVICE
  iSerial                 2 ??????????
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           32
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          0
    bmAttributes         0xc0
      Self Powered
    MaxPower                2mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass         8 Mass Storage
      bInterfaceSubClass      6 SCSI
      bInterfaceProtocol     80 Bulk (Zip)
      iInterface              0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
Device Qualifier (for other device speed):
  bLength                10
  bDescriptorType         6
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0        64
  bNumConfigurations      1
Device Status:     0x0001
  Self Powered

这难道就是传说中的杂牌? 序列号是一串问号

#################################################################

再来看JMicron的JM20337

Bus 002 Device 010: ID 152d:2338 JMicron Technology Corp. / JMicron USA Technology Corp. JM20337 Hi-Speed USB to SATA & PATA Combo Bridge
Device Descriptor:
  bLength                18
  bDescriptorType         1
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0        64
  idVendor           0x152d JMicron Technology Corp. / JMicron USA Technology Corp.
  idProduct          0x2338 JM20337 Hi-Speed USB to SATA & PATA Combo Bridge
  bcdDevice            1.00
  iManufacturer           1 JMicron
  iProduct                2 USB to ATA/ATAPI bridge
  iSerial                 5 8020A4C30450
  bNumConfigurations      1
  Configuration Descriptor:
    bLength                 9
    bDescriptorType         2
    wTotalLength           32
    bNumInterfaces          1
    bConfigurationValue     1
    iConfiguration          4 USB Mass Storage
    bmAttributes         0xc0
      Self Powered
    MaxPower                2mA
    Interface Descriptor:
      bLength                 9
      bDescriptorType         4
      bInterfaceNumber        0
      bAlternateSetting       0
      bNumEndpoints           2
      bInterfaceClass         8 Mass Storage
      bInterfaceSubClass      6 SCSI
      bInterfaceProtocol     80 Bulk (Zip)
      iInterface              6 MSC Bulk-Only Transfer
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x81  EP 1 IN
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
      Endpoint Descriptor:
        bLength                 7
        bDescriptorType         5
        bEndpointAddress     0x02  EP 2 OUT
        bmAttributes            2
          Transfer Type            Bulk
          Synch Type               None
          Usage Type               Data
        wMaxPacketSize     0x0200  1x 512 bytes
        bInterval               0
Device Qualifier (for other device speed):
  bLength                10
  bDescriptorType         6
  bcdUSB               2.00
  bDeviceClass            0 (Defined at Interface level)
  bDeviceSubClass         0
  bDeviceProtocol         0
  bMaxPacketSize0        64
  bNumConfigurations      1
Device Status:     0x0001
  Self Powered