Umeet :: Presentation

Talk

20031223-3.en

`riel`	`ok ...`
`riel`	`it's good to be back at Umeet`
`riel`	`this is now the 4th time I've participated at Umeet`
`riel`	`and it's always been fun`
`riel`	`today I'll talk about a few cool patches (and projects) that are almost ready to go into the 2.6 kernel`
`riel`	`I don't have any text prepared and will just be talking "live"`
`riel`	`so don't worry about interrupting me`
`riel`	`you can ask questions at ANY time, in the #qc channel`
`riel`	`there is also going to be a translation into spanish in #redes`
`riel`	`and into dutch in #taee`
`riel`	`I guess I should start by saying that the 2.6.0 kernel looks way better than the 2.0.0, 2.2.0 or 2.4.0 kernels`
`riel`	`so you should all try the 2.6 kernel and report bugs to linux-kernel@vger.kernel.org ;))`
`riel`	`I'll be talking a bit about the following projects`
`riel`	`- execshield`
`riel`	`- 4/4 split`
`riel`	`- CKRM (class-based kernel resource manager)`
`riel`	`- memory hotplug`
`riel`	`...`
`riel`	`the first patch I am going to talk about is security related`
`riel`	`as you probably already know, the 2.6 kernel has some security improvements to help limit the damage that can be done through a security hole`
`riel`	`for example, with selinux you can limit the damage that is done when sendmail is exploited AGAIN`
`riel`	`or bind ;)`
`riel`	`with exec-shield you can do another step towards making the system secure`
`riel`	`basically the layout of your process memory gets changed a little bit`
`riel`	`and data and the stack are by default not executable any more`
`riel`	`this makes it a lot harder for a normal buffer overflow to turn into an exploit`
`riel`	`on x86 CPUs the page tables do not have an executable permission bit, so exec-shield needs to do really ugly segmentation tricks`
`riel`	`luckily most other CPUs Linux runs on, including AMD64, have executable bits in the page tables, so the segmentation tricks will no longer be needed in the future`
`riel`	`<hans> riel: will this proces not bring down the speed of the os?`
`riel`	`hans: yes, absolutely`
`riel`	`segmentation makes the program run a little bit slower`
`riel`	`however, the increased security will be worth the speed difference for some people`
`riel`	`at this point I should also point out that exec-shield is NOT the most luxurious memory management change for security`
`riel`	`PaX is probably a lot more flexible`
`riel`	`but at the cost of more overhead than exec-shield`
`riel`	`I suspect that exec-shield will be a reasonable compromise between extra security and performance for most people`
`riel`	`it goes together with some tricks like randomising the start address of the stack, the heap and the executables and libraries`
`riel`	`so using buffer overflows to jump to a libc function becomes very improbable`
`riel`	`instead, the attacker will just crash sendmail, instead of getting a root shell`
`riel`	`(or more likely, the overflow will not be an attack at all)`
`riel`	`<xtingray> riel: what is the resources cheaper method that you know?`
`riel`	`xtingray: well, the best would be using an AMD64 chip ;))`
`riel`	`that hardware has the executable bit for page tables built right in`
`riel`	`and there is no performance penalty for a non-exec stack or heap ;)`
`riel`	`<hans> riel: what makes this better then for example a chrooted enviroment for applications?`
`riel`	`ok, exec-shield is not "better" ... I would use the two together`
`riel`	`you can (and probably should) run named in a chroot environment`
`riel`	`but once somebody breaks in, they can still use your computer to send network packets to somewhere else (maybe to help a DDoS?)`
`riel`	`so reducing the chance that a buffer overflow can actually be exploited is probably a good thing to do`
`riel`	`oh, you can get exec-shield as part of Arjan's 2.6 kernel RPM`
`riel`	`at http://people.redhat.com/~arjanv/`
`riel`	`I have not seen it in any other kernel patch sets yet`
`riel`	`<amplifiel> other distros like adamantix use rsbac and pax for security, what about this patches?`
`riel`	`ok`
`riel`	`rsbac is a bit like selinux`
`riel`	`it helps a lot to reduce the damage after a program is broken into`
`riel`	`but it does not help prevent break-ins into one program`
`riel`	`PaX helps prevent such break-ins, but is a much higher cost than execshield`
`riel`	`if you are really paranoid, you will probably prefer PaX over exec-shield`
`riel`	`but personally I suspect that the performance impact of PaX (in particular the extra space use, meaning your programs have less address space) will make it too "expensive" for most people`
`riel`	`are there any other questions on exec-shield ?`
`riel`	`(otherwise I'll move on to the 4/4 split)`
`riel`	`...`
`riel`	`ok, 4/4 split ;)`
`riel`	`I'll now explain about what is probably the biggest problem Linux has on 32 bit x86 systems`
`riel`	`the problem is that x86 can have up to 64GB of physical memory, but only 4GB virtual memory`
`riel`	`and the classical Linux virtual memory layout means that the kernel only has 1GB of space!`
`riel`	`that means, 1GB of space to manage 64GB of memory`
`riel`	`that is simply not enough space if you run the kind of programs anybody with a 64GB server runs`
`riel`	`to make a long story short, with 1GB kernel space, a system with more than about 24GB RAM is nearly useless`
`riel`	`because you do not have the kernel memory to run the programs people with a big system run`
`riel`	`in 2.6, and later 2.4 kernels, the page tables were moved to high memory`
`riel`	`that is, they are stored outside of the 1GB of kernel space`
`riel`	`that increased the limit from 16GB to 24 or 32GB`
`riel`	`but still, nowhere near the 64GB that x86 systems can use`
`riel`	`of course, the real solution is for the people with really big servers to use a 64 bit CPU`
`riel`	`so the kernel has all the space it needs`
`riel`	`but noooo, they want a cheap server ;((`
`riel`	`so they buy x86`
`riel`	`of course, the software people are always the ones left with the problem ;)`
`riel`	`the simplest thing we can do is increase the kernel space to 4GB`
`riel`	`but, there is only 4GB total available in Linux, divided between userspace and kernel space`
`riel`	`so we need to change that`
`riel`	`Ingo Molnar made a patch that does something pretty ugly, that just happens to work well and needs little changes in the rest of the kernel code`
`riel`	`you know that every process has its own memory space`
`riel`	`with Ingo's 4/4 split patch, the _kernel_ also has its own 4GB memory space`
`riel`	`and every time you make a system call or an interrupt happens, the system does a memory context switch`
`riel`	`into the 4GB large kernel memory space`
`riel`	`this way the kernel has enough memory to manage 64GB of physical memory and the programs running in it`
`riel`	`however, it does come at quite a cost`
`riel`	`it commonly costs 10% performance`
`riel`	`because the CPU needs to switch memory address spaces all the time`
`riel`	`on some benchmarks the cost is as high as 30% ...`
`riel`	`also, this is the last big change that can be done on 32 bit systems`
`riel`	`if Intel ever comes out with a 32 bit chip that can address more than 64GB of physical memory, there is no next trick we can use`
`riel`	`that is why I think that the only real solution is to use a 64 bit chip`
`riel`	`if you need lots of memory`
`riel`	`<jamesm> riel: how long do you think people will keep using ia32 for large systems?`
`riel`	`I think they will keep using ia32 until Intel has a cheap 64 bit CPU`
`riel`	`or until they need more than 128GB of memory`
`riel`	`I am afraid that IA64 will never really become cheap`
`riel`	`because it is designed as a very high-end chip`
`riel`	`however, with AMD marketing their cheap 64 bit chip, I think Intel will have to come up with something`
`riel`	`I really hope they do ... ;)`
`riel`	`any other questions about the 4/4 split, or memory management issues ?`
`riel`	`ok, I'll hold a 1 minute break to give the translators a chance to catch up`
`riel`	`then I'll continue with CKRM, the class-based kernel resource manager`
`riel`	`...`
`riel`	`CKRM, class-based kernel resource manager`
`riel`	`this is the kind of project I have been dreaming about since the 2.0 kernel ;)`
`riel`	`and some small aspects of it are in the kernel`
`riel`	`basically, CKRM consists of two parts:`
`riel`	`1) a classifier, to group tasks into resource classes based on`
`riel`	`- pid`
`riel`	`- gid`
`riel`	`- uid`
`riel`	`- name`
`riel`	`- resource class id`
`riel`	`- ...`
`riel`	`2) resource control modules, that plug into the CKRM core and`
`riel`	`- divide the CPU fairly between resource classes`
`riel`	`- enforce memory limits between resource classes`
`riel`	`- ...`
`riel`	`basically, with CKRM you will be able to do things like:`
`riel`	`"I want sendmail and all processes started by sendmail to consume no more than 10% of memory or 20% of the CPU"`
`riel`	`so no matter how overloaded your mail queue is, your system as a whole will not be overloaded`
`riel`	`or at a university, you could specify "the students get between 10% and 50% of memory, the staff get between 30% and 80% of memory, the system administrator gets as much as he wants"`
`riel`	`the possibilities of what you can do with CKRM are nearly endless`
`riel`	`I am sure those of you with BOFH inspiration can come up with some creative ideas ...`
`riel`	`[again, if you have questions ask them in #qc]`
`riel`	`you can find information on CKRM on http://ckrm.sourceforge.net/`
`riel`	`of course, CKRM has some serious downsides too`
`riel`	`it is very cool and very flexible, but also very complex`
`riel`	`I would not be surprised if CKRM was too complex for Linus`
`riel`	`and things need to be made simpler before it can be merged into the 2.7 kernel`
`riel`	`<BigSam72> ok, when CKRM will be implemented and limits set for example for sendmail, what happens when sendmail reach a limits ? memory allocations fail ?`
`riel`	`in the most common case, sendmail would get swapped out`
`riel`	`it would get virtual memory, just not physical memory`
`riel`	`also, if the system has free memory that is not being used at the time, a resource class can just borrow that memory`
`riel`	`<jamesm> riel: what is the performance overhead?`
`riel`	`I cannot answer the performance overhead question yet, since CKRM is in very early stages`
`riel`	`the code is not quite ready yet and needs a lot of work`
`riel`	`I suspect the performance overhead will be small for most resource schedulers`
`riel`	`<franl> Can CKRM control only CPU and memory usage, or can it control other things, like fork()s and send()s per second?`
`riel`	`franl: currently CKRM can control CPU, memory and IO use only`
`riel`	`but people are planning more resource modules`
`riel`	`for CPU and IO, the CKRM module is a scheduler`
`riel`	`so you can give certain bandwidth guarantees and maximum limits to resource groups`
`riel`	`memory is fairly similar, except for one big difference`
`riel`	`you have a new second of CPU time every second, but memory doesn't grow ;)`
`riel`	`in computer science terms, memory is a non-renewable resource`
`riel`	`so if a resource group uses more memory than its limit but something else needs it, the system needs to do work to take it away (swapping out)`
`riel`	`for CPU, IO bandwidth or network scheduling the system does not need to do such work`
`riel`	`for system administrators there is another issue to keep in mind`
`riel`	`if you give every resource group in your system a 10% minimum guaranteed, make sure you don't have more than 10 resource groups ;))`
`riel`	`<franl> What's the system call interface to CKRM look like? Is it just a bunch of ioctl()s?`
`riel`	`franl: the interface to userspace is still in flux`
`riel`	`CKRM A0* used system calls, but CKRM B0* seems to be using a /proc interface`
`riel`	`this could change again in the future, until Linus is happy ;)`
`riel`	`<franl> Does Linus support CKRM in principle for 2.7 development?`
`riel`	`I don't think he has been asked yet ;))`
`riel`	`it may be difficult to convince him that CKRM is cool`
`riel`	`he never likes server-only things`
`riel`	`Linus wants functionality to be useful for everybody`
`riel`	`... and he is right`
`riel`	`however, CKRM may be useful for desktop systems`
`riel`	`for example, the desktop user could get a guaranteed minimum amount of the system resources so updatedb cannot make the desktop slow`
`riel`	`yes it's a hack, but if it helps making the desktop better ... ;)`
`riel`	`any other questions about CKRM, before I move on to "memory hotplug" ?`
`riel`	`...`
`riel`	`ok, memory hotplug ;)`
`riel`	`big server manufacturers are working on a new piece of functionality`
`riel`	`the idea is that system administrators can plug new memory (DIMMs) into the system, while the system is running`
`riel`	`some even want the system administrator to be able to remove DIMMs`
`riel`	`now, adding memory should be doable during the 2.6 kernel`
`riel`	`we already have NUMA support in the kernel, to support different areas of memory in a system`
`riel`	`when the system administrator adds new memory, we could create a new memory zone for that memory`
`riel`	`and then hook up the new zone in the list of other memory zones`
`riel`	`after that we can start using the memory`
`riel`	`"simple" ... except for some details I will not bother you with now ;)`
`riel`	`<franl> How do you remove DIMMs that have dirty pages in them?`
`riel`	`ok ... memory removal is a BIG PROBLEM ;)`
`riel`	`I don't think Linux is going to support that any time soon`
`riel`	`if all the memory in a DIMM belongs to user programs, we could just swap them out when the administrator says he wants to remove a DIMM from the system`
`riel`	`but what if the memory is mlocked and we're forbidden from swapping it out ?`
`riel`	`or worse, what if the dimm contains kernel data structures that are referenced by physical memory address ?!`
`riel`	`I don't see any good way to deal with that`
`riel`	`I can think of a few BAD ways, but we don't want that ;)`
`riel`	`<franl> Even if a DIMM can be purged of kernel pages and dirty user pages, you still have to hope the sysadmin pulls the right DIMM. :)`
`riel`	`I guess that's what the little green and red lights are for ;))`
`riel`	`memory hotplug cards tend to have all kinds of status lights on them, luckily`
`riel`	`also, why would you ever want to remove memory from a system ?`
`riel`	`I can think of 2 things a system administrator needs to do:`
`riel`	`1) add more memory, because the programs need more`
`riel`	`2) replace a piece of bad memory with good memory ... but that could be done in hardware, with the hardware mirroring the bad memory to a piece of good memory and then letting the sysadmin pull the old DIMM`
`riel`	`in this case "bad memory" would be memory that gets correctable ECC errors`
`riel`	`so the data is still good`
`riel`	`<hans> <riel> also, why would you ever want to remove memory from a system ?`
`riel`	`<hans> maybe to exchange it with faster ram?`
`riel`	`<franl> Or to upgrade to higher capacity DIMMs.`
`riel`	`ok, two good points ;)`
`riel`	`especially the higher capacity DIMM argument is a valid one`
`riel`	`I forgot all about that`
`riel`	`somebody from VALinux Japan is working on memory hot remove, btw`
`riel`	`but he is running into the fundamental problems I just described`
`riel`	`so his code patch only most of the time`
`riel`	`also, he can only remove memory that has no kernel data in it`
`riel`	`in short, for the 2.6 kernel you should only expect memory hot-add`
`riel`	`hot-remove is very complex ...`
`riel`	`...`
`riel`	`are there any other questions about the memory hotplug support ?`
`riel`	`ok, then I guess this presentation is done ;)`
`riel`	`thanks to the Umeet organisers for putting this event together`
`riel`	`I know how much work it is and am thankful they organised another Umeet`
`riel`	`I would also like to thank the translators, who are working hard to get talks translated (live!) into other languages`
`riel`	`if you are still awake, I'd also like to thank the audience`
`riel`	`it just wouldn't be the same if I was talking to myself ;)`
`riel`	`thanks everyone, this Umeet was great again`

Generated by irclog2html.pl by Jeff Waugh - find it at freshmeat.net!

email us more information