Wednesday, September 02, 2009

Pedigree's "Services" Feature

Well, the Pedigree team is currently in bug fixing mode as we prepare for our first release, codenamed "Foster" and as such, there's not been a lot to talk about innovation-wise in the kernel. So I've decided instead to chat about a feature of the kernel, which was implemented recently.

For our "live CDs" we realised we would need a read/write file system, in memory, with a set of applications and files to allow people to try Pedigree out without making changes to their computers. This meant that we needed a way to mount a disk image as a usable disk, which is a feature we lacked at that stage in Pedigree. So I spent a couple of hours whipping up a module to support such a feature. However, this module required a couple of tricky design decisions: Cache, and write support.

Cache is a must-have in such a module, as without it each read from the disk image ends up in a read to its parent hardware - not so great if a slow CD drive needs to seek! However, the cache must also be big enough to hold at least one sector of disk data (preferably more). I eventually settled upon a 4096-byte buffer cache, which holds two full CD sectors, and eight hard disk sectors. Initial reads will miss the cache and read straight from the hardware into the cache. Further reads come from that buffer in cache. For an image formatted as FAT, for example, this improves read times from the file allocation table and root directory (high-use areas) significantly.

Writing to the virtual disk makes things slightly more complicated. I decided upon an implementation, which gave the option for the virtual disk to be write-through, or totally in RAM. These options need a bit of explanation. A write-through virtual disk will place writes into cache, and write to the real disk image itself. This is an ideal setup for something like Linux-style loopback disks. However, consider the "live CD" situation: we can't write back to a CD! So the second option writes only into the cache without affecting the original file. This means all changes are kept as long as the system is running, and do not persist. Implementing these two write options generalised the module - a massive bonus.

However, all is not well at this stage. A module to provide a disk is one part of the battle. At this stage, all we have is an abstraction of a disk - no partitions, no file systems - it's useless for normal usage. At the time of implementation, Pedigree had no way to dynamically detect and mount such disks. This simply cannot do for a modern operating system where storage devices are hardly static - USB mass storage devices come and go, hot-pluggable hard disks exist, and so on.

I didn't want to have my loopback disk module talk directly to the partition driver though. Exposing the internals of the partitioner to other modules makes changing the partitioner’s interface awfully complicated, and creates an explicit dependency on a specific module.

I decided instead to implement what I call Pedigree’s service manager. This kernel feature sits between different parts of the operating system and provides a standardised interface to other modules. Each service provides the following types of functionality (at the time of writing):

  • Write: Send data to the module
  • Read: Read data from the module
  • Touch: Inform the module of new state
  • Probe: Probe the module for a specific state or piece of information

Each service decides which features it provides, so it is possible for a service to provide only read as a function. The service manager takes these potential features and provides a generic interface for drivers and modules to talk to named services. In effect, this idea of services is a method of inter-process communication using named destinations. Therefore, with this new service manager, I was able to modify the partitioner to add support for the touch service. There is no need for a partitioner to support read, write, or probe, as the only notification to be sent to the partitioner is to inform it of a new disk.

With a quick modification to the loopback disk code, I was able to inform the partitioner of the presence of the new disk with error handling and no direct partitioner-specific functionality used:

// Chat to the partition service and let it pick up that we're around now
ServiceFeatures *pFeatures = ServiceManager::instance().enumerateOperations(String("partition"));
Service         *pService  = ServiceManager::instance().getService(String("partition"));
NOTICE("Asking if the partition provider supports touch");
if(pFeatures->provides(ServiceFeatures::touch))
{
NOTICE("It does, attempting to inform the partitioner of our presence...");
if(pService)
{
if(pService->serve(ServiceFeatures::touch, reinterpret_cast(this), sizeof(FileDisk)))
NOTICE("Successful.");
else
ERROR("Failed.");
}
else
ERROR("FileDisk: Couldn't tell the partition service about the new disk presence");
}
else
ERROR("FileDisk: Partition service doesn't appear to support touch");



This feature has already been added to other areas of the kernel (mainly talking to the partitioner), but has the potential to even be expanded to call applications that the user runs. This means it would be theoretically possible to replace the partitioner at runtime, or replace a component of the network stack to provide a different level of service. That means that Pedigree can be modular and flexible, even though it uses the conventionally rigid “monolithic kernel” design. Now that’s something to write home about!

NOTE: Blogger simply will not let that code sample work without wrapping it (it looks right in the preview and text editor). You should be able to get the idea that I'm trying to convey though.

Saturday, June 13, 2009

Pedigree Progress

When I last posted I chatted a bit about CDs and the Pedigree installer. Since then, I have successfully written a driver, and that means Pedigree now installs off a CD without a hitch. Naturally, there's still bugs and minor things that need fixing (such as creating a default user!), but it's at least mostly complete.

Since then though, we've been refactoring the POSIX subsystem a fair bit. Rather than have a lot of POSIX-specific functionality at the kernel level, I've developed a Subsystem abstraction that sits between the kernel and the individual subsystems (POSIX, native, DOS, etc...). As a result of this new abstraction the POSIX subsystem has been significantly changed - it's now far neater, and a lot of sneaky bugs have been fixed.

The biggest test of Pedigree itself so far has been our brand new Apache port. Through attempting to run Apache (over and over again) I've found some pretty big bugs, as well as a fair few incorrectly implemented functions. At the moment I'm working on file locking in order to allow Apache to actually serve a document.

This file locking concept introduces some serious design choices. One of the biggest is whether to implement it generically, and have it available to the kernel, or implement it specifically for the POSIX subsystem. Both have their pros and cons.

Generic Kernel-Wide Interface
Pros:
  • Global - can be accessed and used anywhere in the kernel
  • Cleaner - no subsystem-specific code, #defines, or functions makes the final interface cleaner
  • Can wrap around already-existing objects
Cons:
  • Potentially very slow - locks on ranges of bytes need to be checked on every file operation, as the subsystem doesn't control the lock (subsystems would be able to decide better which file operations to lock)
  • May end up biased towards one subsystem rather than being truly generic

Subsystem-Specific Interface
Pros:
  • Keeps a complicated, and potentially very subsystem-specific, interface out of the kernel
  • Faster as it can use the subsystem's structures and functions when needed
Cons:
  • A real pain to implement - requires changing every file-based operation in the subsystem to use the proper locking mechanism, rather than just replacing the file object
  • Code duplication for locking in other subsystems

Two very plausible interfaces - each with equally valid pros and cons. However, the final decision is made based on the point Code duplication for locking in other subsystems. Why should such a generic concept as file locking be duplicated based on a different subsystem?

There is a compromise to be made here though. Note that I have mentioned the generic interface has a serious problem: "Potentially very slow - locks on ranges of bytes need to be checked on every file operation". This means that on every read, every write, the range of bytes being affected must be checked against every lock that the file has active. In the case of multiple shared locks, this is an O(n) search.

A simple solution exists. Whereas it is a definite deviation from the POSIX definition of a file lock, removing the ability to lock individual ranges of bytes and instead lock the entire file allows housekeeping in the kernel to be kept to a minimum - the lock can be kept as a simple Mutex rather than a list of locked ranges - and only slightly affects runtime speed.

I consider this an appropriate compromise to make, as I believe it is ineffective to lock small regions of files (and I expect that the majority of file locks in POSIX applications are made for the entire file). I also feel that implementing the ranges of bytes properly is merely an introduction of more POSIX functionality into the kernel, which is something we're trying to avoid.

EDIT:

The use of a generic interface means it's also possible to distinguish advisory and mandatory locking in the same object without requiring major changes. For instance, adding advisory locking with the generic interface to the POSIX subsystem was a 10 minute job, as it only involved editing the FileDescriptor object and (naturally) fcntl. In this implementation, mandatory locking is just an additional function call in file I/O functions, rather than multiple calls and checks per function.

Monday, May 11, 2009

The Plot Thickens (and the trail lengthens)

I started work on the Pedigree installer scripts recently, and whilst now I have an extremely powerful Python + ncurses installer script (copies files from a to b with installer pages and destination selection, with MD5 verification of installed files) I have realised that Pedigree isn't quite ready for it yet.

You see, for all of our testing we've been booting the kernel off a floppy disk with an initrd and using the hard disk to store all of our files (applications, config, etc). However, that approach doesn't quite work so well when we don't actually have a Pedigree-ready system state.

So I sat down and whipped up some ISO generation scripts, which means we now have Pedigree installer CDs. However, we don't have any method to read CD disks in Pedigree yet. What started as a simple concept (Python installer) has now developed into developing drivers for the CD drive and the on-disk filesystem format.

There's nothing wrong with that, of course, but it is slightly frustrating to not be able to jump straight into testing my installation scripts on Pedigree. If you're really interested, you can check out the progress of the installation side of Pedigree.

For all those excited about the concept of a Python + ncurses installer (highly customisable too :) ), you can see the most up-to-date scripts here. Keep in mind Pedigree has some unique requirements, and I haven't documented anything at the time of posting, so you're totally on your own if you do try them out.

Thursday, May 07, 2009

More Updates!

My 3C90X emulation for QEMU works - and by works I mean I tried multiple guest operating systems and they all connected to the internet correctly via my emulation.

You can patch your QEMU 0.10.3 source tree with the patches at http://lists.gnu.org/archive/html/qemu-devel/2009-05/msg00205.html to enable the emulation.

And in Pedigree-related news, I've started work on a basic installer system which uses Python to copy the files to the disk. This will let me create a quick and easy install method which is highly customisable.

The installer will be under Pedigree's license and should be portable enough to apply in numerous other environments.

Friday, May 01, 2009

Something Unusual...

A lot has happened since I last posted here. I've dropped Mattise and rejoined the Pedigree project (http://code.google.com/p/pedigree).

The Pedigree project has allowed me to do some pretty crazy things - writing all sorts of code including a fully functional TCP/IP implementation.

And that's what's brought me back here.

I've got a few 3Com 3C90x NICs lying around the place here, and I thought it'd be nice to write a driver for them so I could test out Pedigree on some of the boxes I have around the house. However, I struck a problem very quickly - no emulator for x86 has a 3Com card emulated!

A quick Google search showed me that PearPC had an emulation of one, but because it's a PPC emulator it wasn't useful to me (our PPC port is still a bit behind, unfortunately). PearPC's emulation is licensed under the GPLv2, as is QEMU's (our emulator of choice). So I thought it'd be fun to port it across.

The first hurdle was that PearPC was in C++ and QEMU in C, and their entire interface to the virtual state was different. After an hour or two I'd fixed that up. However, I ran into a slight problem - the emulation was incomplete. So for the past day and a half I've been reading the specification for the card and updating my emulation code, while also updating my own driver.

So I've been working on a driver for (and an emulation of) this NIC, which has been a huge challenge and all up, a lot of fun!

It's an amazing feeling when you successfully create and close your first TCP connection. You can imagine how much more incredible it is when you watch "lynx" (the text-based browser) running through your NIC driver, controlling your emulated NIC, do a full request and print the page.

I will be submitting the 3C90x emulation as soon as it's more complete, cleaned up a bit, and most importantly - tested!