IV International Conference of Unix at Uninet
  • Presentation
  • Register
  • Program
  • Organizing Comittee
  • Listing of registered people
  • Translators team
Talk

20031223-3.en

rielok ...
rielit's good to be back at Umeet
rielthis is now the 4th time I've participated at Umeet
rieland it's always been fun
rieltoday I'll talk about a few cool patches (and projects) that are almost ready to go into the 2.6 kernel
rielI don't have any text prepared and will just be talking "live"
rielso don't worry about interrupting me
rielyou can ask questions at ANY time, in the #qc channel
rielthere is also going to be a translation into spanish in #redes
rieland into dutch in #taee
rielI guess I should start by saying that the 2.6.0 kernel looks way better than the 2.0.0, 2.2.0 or 2.4.0 kernels
rielso you should all try the 2.6 kernel and report bugs to linux-kernel@vger.kernel.org ;))
rielI'll be talking a bit about the following projects
riel- execshield
riel- 4/4 split
riel- CKRM (class-based kernel resource manager)
riel- memory hotplug
riel...
rielthe first patch I am going to talk about is security related
rielas you probably already know, the 2.6 kernel has some security improvements to help limit the damage that can be done through a security hole
rielfor example, with selinux you can limit the damage that is done when sendmail is exploited AGAIN
rielor bind ;)
rielwith exec-shield you can do another step towards making the system secure
rielbasically the layout of your process memory gets changed a little bit
rieland data and the stack are by default not executable any more
rielthis makes it a lot harder for a normal buffer overflow to turn into an exploit
rielon x86 CPUs the page tables do not have an executable permission bit, so exec-shield needs to do really ugly segmentation tricks
rielluckily most other CPUs Linux runs on, including AMD64, have executable bits in the page tables, so the segmentation tricks will no longer be needed in the future
riel<hans> riel: will this proces not bring down the speed of the os?
rielhans: yes, absolutely
rielsegmentation makes the program run a little bit slower
rielhowever, the increased security will be worth the speed difference for some people
rielat this point I should also point out that exec-shield is NOT the most luxurious memory management change for security
rielPaX is probably a lot more flexible
rielbut at the cost of more overhead than exec-shield
rielI suspect that exec-shield will be a reasonable compromise between extra security and performance for most people
rielit goes together with some tricks like randomising the start address of the stack, the heap and the executables and libraries
rielso using buffer overflows to jump to a libc function becomes very improbable
rielinstead, the attacker will just crash sendmail, instead of getting a root shell
riel(or more likely, the overflow will not be an attack at all)
riel<xtingray> riel: what is the resources cheaper method that you know?
rielxtingray: well, the best would be using an AMD64 chip ;))
rielthat hardware has the executable bit for page tables built right in
rieland there is no performance penalty for a non-exec stack or heap ;)
riel<hans> riel: what makes this better then for example a chrooted enviroment for applications?
rielok, exec-shield is not "better" ... I would use the two together
rielyou can (and probably should) run named in a chroot environment
rielbut once somebody breaks in, they can still use your computer to send network packets to somewhere else (maybe to help a DDoS?)
rielso reducing the chance that a buffer overflow can actually be exploited is probably a good thing to do
rieloh, you can get exec-shield as part of Arjan's 2.6 kernel RPM
rielat http://people.redhat.com/~arjanv/
rielI have not seen it in any other kernel patch sets yet
riel<amplifiel> other distros like adamantix use rsbac and pax for security, what about this patches?
rielok
rielrsbac is a bit like selinux
rielit helps a lot to reduce the damage after a program is broken into
rielbut it does not help prevent break-ins into one program
rielPaX helps prevent such break-ins, but is a much higher cost than execshield
rielif you are really paranoid, you will probably prefer PaX over exec-shield
rielbut personally I suspect that the performance impact of PaX (in particular the extra space use, meaning your programs have less address space) will make it too "expensive" for most people
rielare there any other questions on exec-shield ?
riel(otherwise I'll move on to the 4/4 split)
riel...
rielok, 4/4 split ;)
rielI'll now explain about what is probably the biggest problem Linux has on 32 bit x86 systems
rielthe problem is that x86 can have up to 64GB of physical memory, but only 4GB virtual memory
rieland the classical Linux virtual memory layout means that the kernel only has 1GB of space!
rielthat means, 1GB of space to manage 64GB of memory
rielthat is simply not enough space if you run the kind of programs anybody with a 64GB server runs
rielto make a long story short, with 1GB kernel space, a system with more than about 24GB RAM is nearly useless
rielbecause you do not have the kernel memory to run the programs people with a big system run
rielin 2.6, and later 2.4 kernels, the page tables were moved to high memory
rielthat is, they are stored outside of the 1GB of kernel space
rielthat increased the limit from 16GB to 24 or 32GB
rielbut still, nowhere near the 64GB that x86 systems can use
rielof course, the real solution is for the people with really big servers to use a 64 bit CPU
rielso the kernel has all the space it needs
rielbut noooo, they want a cheap server ;((
rielso they buy x86
rielof course, the software people are always the ones left with the problem ;)
rielthe simplest thing we can do is increase the kernel space to 4GB
rielbut, there is only 4GB total available in Linux, divided between userspace and kernel space
rielso we need to change that
rielIngo Molnar made a patch that does something pretty ugly, that just happens to work well and needs little changes in the rest of the kernel code
rielyou know that every process has its own memory space
rielwith Ingo's 4/4 split patch, the _kernel_ also has its own 4GB memory space
rieland every time you make a system call or an interrupt happens, the system does a memory context switch
rielinto the 4GB large kernel memory space
rielthis way the kernel has enough memory to manage 64GB of physical memory and the programs running in it
rielhowever, it does come at quite a cost
rielit commonly costs 10% performance
rielbecause the CPU needs to switch memory address spaces all the time
rielon some benchmarks the cost is as high as 30% ...
rielalso, this is the last big change that can be done on 32 bit systems
rielif Intel ever comes out with a 32 bit chip that can address more than 64GB of physical memory, there is no next trick we can use
rielthat is why I think that the only real solution is to use a 64 bit chip
rielif you need lots of memory
riel<jamesm> riel: how long do you think people will keep using ia32 for large systems?
rielI think they will keep using ia32 until Intel has a cheap 64 bit CPU
rielor until they need more than 128GB of memory
rielI am afraid that IA64 will never really become cheap
rielbecause it is designed as a very high-end chip
rielhowever, with AMD marketing their cheap 64 bit chip, I think Intel will have to come up with something
rielI really hope they do ... ;)
rielany other questions about the 4/4 split, or memory management issues ?
rielok, I'll hold a 1 minute break to give the translators a chance to catch up
rielthen I'll continue with CKRM, the class-based kernel resource manager
riel...
rielCKRM, class-based kernel resource manager
rielthis is the kind of project I have been dreaming about since the 2.0 kernel ;)
rieland some small aspects of it are in the kernel
rielbasically, CKRM consists of two parts:
riel1) a classifier, to group tasks into resource classes based on
riel  - pid
riel  - gid
riel  - uid
riel  - name
riel  - resource class id
riel  - ...
riel2) resource control modules, that plug into the CKRM core and
riel  - divide the CPU fairly between resource classes
riel  - enforce memory limits between resource classes
riel - ...
rielbasically, with CKRM you will be able to do things like:
riel"I want sendmail and all processes started by sendmail to consume no more than 10% of memory or 20% of the CPU"
rielso no matter how overloaded your mail queue is, your system as a whole will not be overloaded
rielor at a university, you could specify  "the students get between 10% and 50% of memory, the staff get between 30% and 80% of memory, the system administrator gets as much as he wants"
rielthe possibilities of what you can do with CKRM are nearly endless
rielI am sure those of you with BOFH inspiration can come up with some creative ideas ...
riel[again, if you have questions ask them in #qc]
rielyou can find information on CKRM on http://ckrm.sourceforge.net/
rielof course, CKRM has some serious downsides too
rielit is very cool and very flexible, but also very complex
rielI would not be surprised if CKRM was too complex for Linus
rieland things need to be made simpler before it can be merged into the 2.7 kernel
riel<BigSam72> ok, when CKRM will be implemented and limits set for example for sendmail, what happens when sendmail reach a limits ? memory allocations fail ?
rielin the most common case, sendmail would get swapped out
rielit would get virtual memory, just not physical memory
rielalso, if the system has free memory that is not being used at the time, a resource class can just borrow that memory
riel<jamesm> riel: what is the performance overhead?
rielI cannot answer the performance overhead question yet, since CKRM is in very early stages
rielthe code is not quite ready yet and needs a lot of work
rielI suspect the performance overhead will be small for most resource schedulers
riel<franl> Can CKRM control only CPU and memory usage, or can it control other things, like fork()s and send()s per second?
rielfranl: currently CKRM can control CPU, memory and IO use only
rielbut people are planning more resource modules
rielfor CPU and IO, the CKRM module is a scheduler
rielso you can give certain bandwidth guarantees and maximum limits to resource groups
rielmemory is fairly similar, except for one big difference
rielyou have a new second of CPU time every second, but memory doesn't grow ;)
rielin computer science terms, memory is a non-renewable resource
rielso if a resource group uses more memory than its limit but something else needs it, the system needs to do work to take it away (swapping out)
rielfor CPU, IO bandwidth or network scheduling the system does not need to do such work
rielfor system administrators there is another issue to keep in mind
rielif you give every resource group in your system a 10% minimum guaranteed, make sure you don't have more than 10 resource groups ;))
riel<franl> What's the system call interface to CKRM look like?  Is it just a bunch of ioctl()s?
rielfranl: the interface to userspace is still in flux
rielCKRM A0* used system calls, but CKRM B0* seems to be using a /proc interface
rielthis could change again in the future, until Linus is happy ;)
riel<franl> Does Linus support CKRM in principle for 2.7 development?
rielI don't think he has been asked yet ;))
rielit may be difficult to convince him that CKRM is cool
rielhe never likes server-only things
rielLinus wants functionality to be useful for everybody
riel... and he is right
rielhowever, CKRM may be useful for desktop systems
rielfor example, the desktop user could get a guaranteed minimum amount of the system resources so updatedb cannot make the desktop slow
rielyes it's a hack, but if it helps making the desktop better ... ;)
rielany other questions about CKRM, before I move on to "memory hotplug" ?
riel...
rielok, memory hotplug ;)
rielbig server manufacturers are working on a new piece of functionality
rielthe idea is that system administrators can plug new memory (DIMMs) into the system, while the system is running
rielsome even want the system administrator to be able to remove DIMMs
rielnow, adding memory should be doable during the 2.6 kernel
rielwe already have NUMA support in the kernel, to support different areas of memory in a system
rielwhen the system administrator adds new memory, we could create a new memory zone for that memory
rieland then hook up the new zone in the list of other memory zones
rielafter that we can start using the memory
riel"simple" ... except for some details I will not bother you with now ;)
riel<franl> How do you remove DIMMs that have dirty pages in them?
rielok ... memory removal is a BIG PROBLEM ;)
rielI don't think Linux is going to support that any time soon
rielif all the memory in a DIMM belongs to user programs, we could just swap them out when the administrator says he wants to remove a DIMM from the system
rielbut what if the memory is mlocked and we're forbidden from swapping it out ?
rielor worse, what if the dimm contains kernel data structures that are referenced by physical memory address ?!
rielI don't see any good way to deal with that
rielI can think of a few BAD ways, but we don't want that ;)
riel<franl> Even if a DIMM can be purged of kernel pages and dirty user pages, you still have to hope the sysadmin pulls the right DIMM. :)
rielI guess that's what the little green and red lights are for ;))
rielmemory hotplug cards tend to have all kinds of status lights on them, luckily
rielalso, why would you ever want to remove memory from a system ?
rielI can think of 2 things a system administrator needs to do:
riel1) add more memory, because the programs need more
riel2) replace a piece of bad memory with good memory ... but that could be done in hardware, with the hardware mirroring the bad memory to a piece of good memory and then letting the sysadmin pull the old DIMM
rielin this case "bad memory" would be memory that gets correctable ECC errors
rielso the data is still good
riel<hans> <riel> also, why would you ever want to remove memory from a system ?
riel<hans> maybe to exchange it with faster ram?
riel<franl> Or to upgrade to higher capacity DIMMs.
rielok, two good points ;)
rielespecially the higher capacity DIMM argument is a valid one
rielI forgot all about that
rielsomebody from VALinux Japan is working on memory hot remove, btw
rielbut he is running into the fundamental problems I just described
rielso his code patch only most of the time
rielalso, he can only remove memory that has no kernel data in it
rielin short, for the 2.6 kernel you should only expect memory hot-add
rielhot-remove is very complex ...
riel...
rielare there any other questions about the memory hotplug support ?
rielok, then I guess this presentation is done ;)
rielthanks to the Umeet organisers for putting this event together
rielI know how much work it is and am thankful they organised another Umeet
rielI would also like to thank the translators, who are working hard to get talks translated (live!) into other languages
rielif you are still awake, I'd also like to thank the audience
rielit just wouldn't be the same if I was talking to myself ;)
rielthanks everyone, this Umeet was great again

Generated by irclog2html.pl by Jeff Waugh - find it at freshmeat.net!

email usmore information


© 2003 - www.uninet.edu - contact organizing comittee - valid xhtml - valid css - design by raul pérez justicia