| riel | ok ... | 
| riel | it's good to be back at Umeet | 
| riel | this is now the 4th time I've participated at Umeet | 
| riel | and it's always been fun | 
| riel | today I'll talk about a few cool patches (and projects) that are almost ready to go into the 2.6 kernel | 
| riel | I don't have any text prepared and will just be talking "live" | 
| riel | so don't worry about interrupting me | 
| riel | you can ask questions at ANY time, in the #qc channel | 
| riel | there is also going to be a translation into spanish in #redes | 
| riel | and into dutch in #taee | 
| riel | I guess I should start by saying that the 2.6.0 kernel looks way better than the 2.0.0, 2.2.0 or 2.4.0 kernels | 
| riel | so you should all try the 2.6 kernel and report bugs to linux-kernel@vger.kernel.org ;)) | 
| riel | I'll be talking a bit about the following projects | 
| riel | - execshield | 
| riel | - 4/4 split | 
| riel | - CKRM (class-based kernel resource manager) | 
| riel | - memory hotplug | 
| riel | ... | 
| riel | the first patch I am going to talk about is security related | 
| riel | as you probably already know, the 2.6 kernel has some security improvements to help limit the damage that can be done through a security hole | 
| riel | for example, with selinux you can limit the damage that is done when sendmail is exploited AGAIN | 
| riel | or bind ;) | 
| riel | with exec-shield you can do another step towards making the system secure | 
| riel | basically the layout of your process memory gets changed a little bit | 
| riel | and data and the stack are by default not executable any more | 
| riel | this makes it a lot harder for a normal buffer overflow to turn into an exploit | 
| riel | on x86 CPUs the page tables do not have an executable permission bit, so exec-shield needs to do really ugly segmentation tricks | 
| riel | luckily most other CPUs Linux runs on, including AMD64, have executable bits in the page tables, so the segmentation tricks will no longer be needed in the future | 
| riel | <hans> riel: will this proces not bring down the speed of the os? | 
| riel | hans: yes, absolutely | 
| riel | segmentation makes the program run a little bit slower | 
| riel | however, the increased security will be worth the speed difference for some people | 
| riel | at this point I should also point out that exec-shield is NOT the most luxurious memory management change for security | 
| riel | PaX is probably a lot more flexible | 
| riel | but at the cost of more overhead than exec-shield | 
| riel | I suspect that exec-shield will be a reasonable compromise between extra security and performance for most people | 
| riel | it goes together with some tricks like randomising the start address of the stack, the heap and the executables and libraries | 
| riel | so using buffer overflows to jump to a libc function becomes very improbable | 
| riel | instead, the attacker will just crash sendmail, instead of getting a root shell | 
| riel | (or more likely, the overflow will not be an attack at all) | 
| riel | <xtingray> riel: what is the resources cheaper method that you know? | 
| riel | xtingray: well, the best would be using an AMD64 chip ;)) | 
| riel | that hardware has the executable bit for page tables built right in | 
| riel | and there is no performance penalty for a non-exec stack or heap ;) | 
| riel | <hans> riel: what makes this better then for example a chrooted enviroment for applications? | 
| riel | ok, exec-shield is not "better" ... I would use the two together | 
| riel | you can (and probably should) run named in a chroot environment | 
| riel | but once somebody breaks in, they can still use your computer to send network packets to somewhere else (maybe to help a DDoS?) | 
| riel | so reducing the chance that a buffer overflow can actually be exploited is probably a good thing to do | 
| riel | oh, you can get exec-shield as part of Arjan's 2.6 kernel RPM | 
| riel | at http://people.redhat.com/~arjanv/ | 
| riel | I have not seen it in any other kernel patch sets yet | 
| riel | <amplifiel> other distros like adamantix use rsbac and pax for security, what about this patches? | 
| riel | ok | 
| riel | rsbac is a bit like selinux | 
| riel | it helps a lot to reduce the damage after a program is broken into | 
| riel | but it does not help prevent break-ins into one program | 
| riel | PaX helps prevent such break-ins, but is a much higher cost than execshield | 
| riel | if you are really paranoid, you will probably prefer PaX over exec-shield | 
| riel | but personally I suspect that the performance impact of PaX (in particular the extra space use, meaning your programs have less address space) will make it too "expensive" for most people | 
| riel | are there any other questions on exec-shield ? | 
| riel | (otherwise I'll move on to the 4/4 split) | 
| riel | ... | 
| riel | ok, 4/4 split ;) | 
| riel | I'll now explain about what is probably the biggest problem Linux has on 32 bit x86 systems | 
| riel | the problem is that x86 can have up to 64GB of physical memory, but only 4GB virtual memory | 
| riel | and the classical Linux virtual memory layout means that the kernel only has 1GB of space! | 
| riel | that means, 1GB of space to manage 64GB of memory | 
| riel | that is simply not enough space if you run the kind of programs anybody with a 64GB server runs | 
| riel | to make a long story short, with 1GB kernel space, a system with more than about 24GB RAM is nearly useless | 
| riel | because you do not have the kernel memory to run the programs people with a big system run | 
| riel | in 2.6, and later 2.4 kernels, the page tables were moved to high memory | 
| riel | that is, they are stored outside of the 1GB of kernel space | 
| riel | that increased the limit from 16GB to 24 or 32GB | 
| riel | but still, nowhere near the 64GB that x86 systems can use | 
| riel | of course, the real solution is for the people with really big servers to use a 64 bit CPU | 
| riel | so the kernel has all the space it needs | 
| riel | but noooo, they want a cheap server ;(( | 
| riel | so they buy x86 | 
| riel | of course, the software people are always the ones left with the problem ;) | 
| riel | the simplest thing we can do is increase the kernel space to 4GB | 
| riel | but, there is only 4GB total available in Linux, divided between userspace and kernel space | 
| riel | so we need to change that | 
| riel | Ingo Molnar made a patch that does something pretty ugly, that just happens to work well and needs little changes in the rest of the kernel code | 
| riel | you know that every process has its own memory space | 
| riel | with Ingo's 4/4 split patch, the _kernel_ also has its own 4GB memory space | 
| riel | and every time you make a system call or an interrupt happens, the system does a memory context switch | 
| riel | into the 4GB large kernel memory space | 
| riel | this way the kernel has enough memory to manage 64GB of physical memory and the programs running in it | 
| riel | however, it does come at quite a cost | 
| riel | it commonly costs 10% performance | 
| riel | because the CPU needs to switch memory address spaces all the time | 
| riel | on some benchmarks the cost is as high as 30% ... | 
| riel | also, this is the last big change that can be done on 32 bit systems | 
| riel | if Intel ever comes out with a 32 bit chip that can address more than 64GB of physical memory, there is no next trick we can use | 
| riel | that is why I think that the only real solution is to use a 64 bit chip | 
| riel | if you need lots of memory | 
| riel | <jamesm> riel: how long do you think people will keep using ia32 for large systems? | 
| riel | I think they will keep using ia32 until Intel has a cheap 64 bit CPU | 
| riel | or until they need more than 128GB of memory | 
| riel | I am afraid that IA64 will never really become cheap | 
| riel | because it is designed as a very high-end chip | 
| riel | however, with AMD marketing their cheap 64 bit chip, I think Intel will have to come up with something | 
| riel | I really hope they do ... ;) | 
| riel | any other questions about the 4/4 split, or memory management issues ? | 
| riel | ok, I'll hold a 1 minute break to give the translators a chance to catch up | 
| riel | then I'll continue with CKRM, the class-based kernel resource manager | 
| riel | ... | 
| riel | CKRM, class-based kernel resource manager | 
| riel | this is the kind of project I have been dreaming about since the 2.0 kernel ;) | 
| riel | and some small aspects of it are in the kernel | 
| riel | basically, CKRM consists of two parts: | 
| riel | 1) a classifier, to group tasks into resource classes based on | 
| riel |   - pid | 
| riel |   - gid | 
| riel |   - uid | 
| riel |   - name | 
| riel |   - resource class id | 
| riel |   - ... | 
| riel | 2) resource control modules, that plug into the CKRM core and | 
| riel |   - divide the CPU fairly between resource classes | 
| riel |   - enforce memory limits between resource classes | 
| riel |  - ... | 
| riel | basically, with CKRM you will be able to do things like: | 
| riel | "I want sendmail and all processes started by sendmail to consume no more than 10% of memory or 20% of the CPU" | 
| riel | so no matter how overloaded your mail queue is, your system as a whole will not be overloaded | 
| riel | or at a university, you could specify  "the students get between 10% and 50% of memory, the staff get between 30% and 80% of memory, the system administrator gets as much as he wants" | 
| riel | the possibilities of what you can do with CKRM are nearly endless | 
| riel | I am sure those of you with BOFH inspiration can come up with some creative ideas ... | 
| riel | [again, if you have questions ask them in #qc] | 
| riel | you can find information on CKRM on http://ckrm.sourceforge.net/ | 
| riel | of course, CKRM has some serious downsides too | 
| riel | it is very cool and very flexible, but also very complex | 
| riel | I would not be surprised if CKRM was too complex for Linus | 
| riel | and things need to be made simpler before it can be merged into the 2.7 kernel | 
| riel | <BigSam72> ok, when CKRM will be implemented and limits set for example for sendmail, what happens when sendmail reach a limits ? memory allocations fail ?  | 
| riel | in the most common case, sendmail would get swapped out | 
| riel | it would get virtual memory, just not physical memory | 
| riel | also, if the system has free memory that is not being used at the time, a resource class can just borrow that memory | 
| riel | <jamesm> riel: what is the performance overhead? | 
| riel | I cannot answer the performance overhead question yet, since CKRM is in very early stages | 
| riel | the code is not quite ready yet and needs a lot of work | 
| riel | I suspect the performance overhead will be small for most resource schedulers | 
| riel | <franl> Can CKRM control only CPU and memory usage, or can it control other things, like fork()s and send()s per second? | 
| riel | franl: currently CKRM can control CPU, memory and IO use only | 
| riel | but people are planning more resource modules | 
| riel | for CPU and IO, the CKRM module is a scheduler | 
| riel | so you can give certain bandwidth guarantees and maximum limits to resource groups | 
| riel | memory is fairly similar, except for one big difference | 
| riel | you have a new second of CPU time every second, but memory doesn't grow ;) | 
| riel | in computer science terms, memory is a non-renewable resource | 
| riel | so if a resource group uses more memory than its limit but something else needs it, the system needs to do work to take it away (swapping out) | 
| riel | for CPU, IO bandwidth or network scheduling the system does not need to do such work | 
| riel | for system administrators there is another issue to keep in mind | 
| riel | if you give every resource group in your system a 10% minimum guaranteed, make sure you don't have more than 10 resource groups ;)) | 
| riel | <franl> What's the system call interface to CKRM look like?  Is it just a bunch of ioctl()s? | 
| riel | franl: the interface to userspace is still in flux | 
| riel | CKRM A0* used system calls, but CKRM B0* seems to be using a /proc interface | 
| riel | this could change again in the future, until Linus is happy ;) | 
| riel | <franl> Does Linus support CKRM in principle for 2.7 development? | 
| riel | I don't think he has been asked yet ;)) | 
| riel | it may be difficult to convince him that CKRM is cool | 
| riel | he never likes server-only things | 
| riel | Linus wants functionality to be useful for everybody | 
| riel | ... and he is right | 
| riel | however, CKRM may be useful for desktop systems | 
| riel | for example, the desktop user could get a guaranteed minimum amount of the system resources so updatedb cannot make the desktop slow | 
| riel | yes it's a hack, but if it helps making the desktop better ... ;) | 
| riel | any other questions about CKRM, before I move on to "memory hotplug" ? | 
| riel | ... | 
| riel | ok, memory hotplug ;) | 
| riel | big server manufacturers are working on a new piece of functionality | 
| riel | the idea is that system administrators can plug new memory (DIMMs) into the system, while the system is running | 
| riel | some even want the system administrator to be able to remove DIMMs | 
| riel | now, adding memory should be doable during the 2.6 kernel | 
| riel | we already have NUMA support in the kernel, to support different areas of memory in a system | 
| riel | when the system administrator adds new memory, we could create a new memory zone for that memory | 
| riel | and then hook up the new zone in the list of other memory zones | 
| riel | after that we can start using the memory | 
| riel | "simple" ... except for some details I will not bother you with now ;) | 
| riel | <franl> How do you remove DIMMs that have dirty pages in them? | 
| riel | ok ... memory removal is a BIG PROBLEM ;) | 
| riel | I don't think Linux is going to support that any time soon | 
| riel | if all the memory in a DIMM belongs to user programs, we could just swap them out when the administrator says he wants to remove a DIMM from the system | 
| riel | but what if the memory is mlocked and we're forbidden from swapping it out ? | 
| riel | or worse, what if the dimm contains kernel data structures that are referenced by physical memory address ?! | 
| riel | I don't see any good way to deal with that | 
| riel | I can think of a few BAD ways, but we don't want that ;) | 
| riel | <franl> Even if a DIMM can be purged of kernel pages and dirty user pages, you still have to hope the sysadmin pulls the right DIMM. :) | 
| riel | I guess that's what the little green and red lights are for ;)) | 
| riel | memory hotplug cards tend to have all kinds of status lights on them, luckily | 
| riel | also, why would you ever want to remove memory from a system ? | 
| riel | I can think of 2 things a system administrator needs to do: | 
| riel | 1) add more memory, because the programs need more | 
| riel | 2) replace a piece of bad memory with good memory ... but that could be done in hardware, with the hardware mirroring the bad memory to a piece of good memory and then letting the sysadmin pull the old DIMM | 
| riel | in this case "bad memory" would be memory that gets correctable ECC errors | 
| riel | so the data is still good | 
| riel | <hans> <riel> also, why would you ever want to remove memory from a system ? | 
| riel | <hans> maybe to exchange it with faster ram? | 
| riel | <franl> Or to upgrade to higher capacity DIMMs. | 
| riel | ok, two good points ;) | 
| riel | especially the higher capacity DIMM argument is a valid one | 
| riel | I forgot all about that | 
| riel | somebody from VALinux Japan is working on memory hot remove, btw | 
| riel | but he is running into the fundamental problems I just described | 
| riel | so his code patch only most of the time | 
| riel | also, he can only remove memory that has no kernel data in it | 
| riel | in short, for the 2.6 kernel you should only expect memory hot-add | 
| riel | hot-remove is very complex ... | 
| riel | ... | 
| riel | are there any other questions about the memory hotplug support ? | 
| riel | ok, then I guess this presentation is done ;) | 
| riel | thanks to the Umeet organisers for putting this event together | 
| riel | I know how much work it is and am thankful they organised another Umeet | 
| riel | I would also like to thank the translators, who are working hard to get talks translated (live!) into other languages | 
| riel | if you are still awake, I'd also like to thank the audience | 
| riel | it just wouldn't be the same if I was talking to myself ;) | 
| riel | thanks everyone, this Umeet was great again |