Tags:
create new tag
view all tags

SL5 LCFG Port Diary

This is our diary of the day to day progress we make in porting LCFG to Scientific Linux 5.0. We're mainly trying to document all the problems we encounter. The most recent entries appear nearest the top of the page.

ChrisCooke
PanagiotisKritikakos
KennyMacDonald
StephenQuinney

The initial project to port LCFG to SL5 is now completed so this page is closed. There is a new page for logging ongoing activities which should be used instead.

26 October 2007

  • (Kenny) Following Stephen's creation of the lcfg_sl5_kernel.rpms package list and updates of the headers to use it, I was easily able to replace this with a package list which installed the RHEL 5.1 Beta kernel kernel-2.6.18-53.el5.i686.rpm. I neeed to remove policycoreutils and audit-libs-python which depend on kernel-headers (no RPM for them). The good news is that the Dell Optiplex 745 I tried it on doesn't have the slow detect/timeout in the SATA driver at boot time! Waiting on getting the Dell Optiplex 755 back from Geosciences to try this kernel on it.

22 October 2007

  • (Kenny) I stumbled over RHEL 5.1 release notes here. In particular this has the list of new and updated RPMs at the end of the document. There are also more here that are also useful.

5 October 2007

  • (Kenny) I added perl-XML-Simple-2.14-4.fc6/noarch in to our mdp_sl5_min.rpms (and sl5_64 one too) as it was removed from lcfg_sl5_lcfg.rpms. Perhaps it should be added to installbase?
  • (Stephen) Fixed the installbase package lists so they include perl-XML-Simple to satisfy perl-Template-Toolkit
  • (Kenny) Our minimal build was stung by the kernel nightmare Stephen mentioned. Our "full" build was fine at reboot time. Added ndiswrapper-1.41-1.SL (dependency of kernel-module-ndiswrapper), openafs-1.4.4-42.SL5 (dependency of kernel-module-openafs), and needed the extra kernel flavours kernel-xen-2.6.18-8.1.3.el5 on i386 and x86_64 and kernel-PAE-2.6.18-8.1.3.el5 on i386.

3 October 2007

  • (Stephen) The updates package lists lcfg_sl5_updates.rpms and lcfg_sl5_64_updates.rpms now list all the available updates. I am not sure the packages are all listed in the correct order yet though, Chris is going to check them. This has revealed a problem with the SL5 way of naming kernel module packages, the name changes with every updated kernel which doesn't mix well with the LCFG '-/+/?' system. We will have to come up with a better way of handling kernel packages, this should be discussed at the next LCFG deployers meeting. Note this might mean updaterpms on some machines will break if they use the current package lists, hopefully this will be fixed before the next testing/stable release cycle next week.
  • (Stephen) I've added back into lcfg_sl5_64_base.rpms a number of kernel modules which were commented out as the package file names had been corrupted on initial mirroring.

28 September 2007

  • (Stephen) I've created lcfg-release-sl5 and lcfg-release-sl5_64, built and submitted them and added them to lcfg/defaults/profile.h
  • (Kenny) Successfully compiled profiles on an SL5 "minimal" server. As noted before, the lcfg-server RPM doesn't depend on any RPMs, but the server doesn't run without cpp installed. Perhaps this requirement should be added to the lcfg-server RPM. gcc then only required three other RPMs (see mdp_sl5_min.rpms for example).
  • (Kenny) Added a supported NIC into the Dell 755, disabled the onboard NIC and tried installing SL5 i386 from the LCFG install CD (choose sr0 at boot). The build seemed to progress well: the NIC was detected and configured as expected, the hard disc detected, partitioned and filesystems created, the installbase installed. But the resulting system couldn't boot with the same symptoms as I saw with the PIE image build. It can't find the hard disc, so can't mount root and kernel panics. The computer is up for grabs next week. Donald Grigor would like his hands on it too for his RHEL tests.

26 September 2007

  • (Chris) lcfg-server has now been built for both platforms and submitted to autolcfg.
  • (Chris) I've belatedly gone back to the project plan and marked stages 11 (installation technology) and 12 (copying MPU stuff to the dice level) as complete. I'm now working on stage 13, which is documenting the thing; and Panos is flying to Greece today, the lucky man. He has already written most of the release notes.
  • A draft of the release notes is now up at http://www.lcfg.org/doc/sl5.html
  • The defaults packages being used by the LCFG servers on FC5 have now also been submitted for SL5 (into autolcfg - so, not tested), so it should now be possible to at least attempt to run an LCFG server on SL5.

25 September 2007

  • (Kenny) I discovered that the lcfg-server RPM doesn't bring in any extra dependencies beyond those listed in lcfg_sl5_base.rpms but as expected the schema RPMs are missing from the SL5 repository. Is it much work to get these into autolcfg?

  • (Kenny) Brought up a mirror of the SL5 (i386) RPM repository from master.rpms.inf.ed.ac.uk and successfully built a client from it. Will point all our clients at it once we're happy it's updating every night - or probably every hour while development is rapid.

  • (Kenny) Got our pam_sessionshell.so PAM module working on i386 and x86_64. Rolled it out to all our MDP SL5 clients.

  • (Panos) - Built and submitted lcfg-pkgtools (includes lcfg-pkgtools-devel), lcfg-pkgstools-perl for sl5 and sl5_64 repositories.

  • As the lcfg-pkgstools includes the qxpack the lcfg-rpmcache is no longer needed. It is removed from the packages/lcfg/lcfg_sl5_lcfg.rpms list and in added to the packages/lcfg/lcfg_sl5_extras.rpms list. The packages/lcfg/lcfg_sl5_extras.rpms and packages/lcfg/lcfg_sl5_lcfg_installroot.rpms lists are now including the latest version of lcfg-rpmcache.

  • SL5 entries were also added to the include/dice/options/qlogic_base.h header. The packages that are needed for this were also copied from the FC5 repository to the SL5 one (available only for i386 version).

21 September 2007

  • I've realised that we didn't properly copy all the inf level settings to the dice level, so I've been going through the dice level files. So far I've added SL5 settings to these files:

include/dice/options/environment.h
include/dice/options/afs-client.h
include/dice/options/bwnode.h
include/dice/options/compute-server.h
include/dice/options/console_server.h
include/dice/options/desktop.h
include/dice/options/external-access-server.h
include/dice/options/inventory.h
and I've created these files:

include/dice/os/sl5.h
include/dice/os/sl5_64.h
packages/dice/dice_sl5_i386_teaching_and_research.rpms
packages/dice/dice_sl5_rat_env.rpms
packages/dice/dice_sl5_i386_rat_env.rpms
packages/dice/dice_sl5_teaching_and_research.rpms
These files look as if they might need changing for SL5, but I haven't changed them yet, either because it looks like they belong to a unit other than the MPU, or because I've put them aside to be done later (e.g. if there's some package building and submitting to be done), or because I just don't understand what's needed:

include/dice/options/axnet.h
include/dice/options/authentication.h
include/dice/options/dns.h
include/dice/options/installbase.h
include/dice/options/ipfilter.h
include/dice/options/iptables.h
include/dice/options/jabber-client.h
include/dice/options/jabberd.h
include/dice/options/kerberos.h
include/dice/options/kernel.h
include/dice/options/lcfg-slave-server.h
include/dice/options/lcfgweb.h
include/dice/options/openldap.h
include/dice/options/openldap-server.h
include/dice/options/openvpn.h
include/dice/options/postgresql-server.h
include/dice/options/qlogic_base.h
include/dice/options/quotas.h
include/dice/options/rfe-server.h
include/dice/options/routing.h
include/dice/options/snmptrap.h
include/dice/options/test_fc6_updates.h
include/lcfg/options/test_fc6_updates.h
include/dice/options/test_updates.h
include/dice/options/triggers.h
include/dice/options/webmark-server.h

  • I've gone through the list above and this is what I reckon. These ones belong to the MPU:

include/dice/options/test_sl5_updates.h  (created)
include/dice/options/test_updates.h   (modified)
include/lcfg/options/test_sl5_updates.h   (created)
packages/lcfg/lcfg_sl5_64_testupdates.rpms   (created)
packages/lcfg/lcfg_sl5_testupdates.rpms   (created)
include/dice/options/qlogic_base.h

Whoever these belong to, we can autobuild some packages for them:

include/dice/options/axnet.h
include/dice/options/installbase.h

These belong to other units and we shouldn't touch them:

include/dice/options/dns.h
include/dice/options/ipfilter.h
include/dice/options/iptables.h
include/dice/options/jabber-client.h
include/dice/options/jabberd.h
include/dice/options/openldap.h
include/dice/options/openldap-server.h
include/dice/options/openvpn.h
include/dice/options/postgresql-server.h
include/dice/options/quotas.h
include/dice/options/routing.h
include/dice/options/snmptrap.h
include/dice/options/webmark-server.h

Maybe nothing needs to be done here just yet after all:

include/dice/options/kernel.h
include/dice/options/lcfg-slave-server.h
include/dice/options/lcfgweb.h

I don't know what to do here:

include/dice/options/authentication.h
include/dice/options/kerberos.h
include/dice/options/rfe-server.h
include/dice/options/triggers.h

  • I've had a look on http://www.qlogic.com for the sansurfer and scli packages for SL5 but it gives you a vast array of models to choose from before offering you the software. Not sure which to pick.

20 September 2007

  • Concering the problem with the fuse group, the problem must be on the package version. On the MDP installed machines, the package that controlled /etc/passwd and /etc/group was lcfg-sl5-defetc-0.99.4-1 while the machines on Informatics had installed the last version which includes the fuse group, lcfg-sl5-defetc-0.99.6.1. I have removed now the prior version from the RPM repository. Chris also noticed that the list lcfg_sl5_lcfg_installroot.rpms was making use of the general defetc package, lcfg-defetc-0.101.0-1, so I replaced that one with the SL5 one.

19 September 2007

  • I removed piix from the /etc/modprobe.conf but the delay and errors at boot time were still there. I added also the achi module as Kenny suggested but the same results. I used the hardware component:


!hardware.modlist                       mREMOVE(piix)
!hardware.modlist                       mADD(ahci)
!hardware.mod_ahci                      mSET(alias scsi_hostadapter ahci)

Maybe I am missing something but the problem seems to be in the Kernel's source. The patch I found, http://thread.gmane.org/gmane.linux.ide/13408/focus=13439, which is not sure if it is working, will need a kernel recompiling and rebuilding a new RPM. However, I can't see in the kernel's source tree the needed file drivers/ata/libata-core.c

  • (Kenny) When the hardware component changes /etc/modprobe.conf for something that's needed in the initrd, the initrd doesn't seem to get updated. Not only does the initrd not get regenerated, but even when done manually (using /sbin/mkinitrd) the new initrd reflects the currently loaded kernel modules, so I suspect the hardware component can't change anything in the initrd. In this particular case instance, it turns out that ahci.ko is loaded by the initrd anyway, as well as ata_piix.ki. To look at the initrd do the following ...


# cd /tmp
# mkdir initrd
# cd initrd
# zcat /boot/initrd.img | cpio -i
# less init

  • (Chris) I'm still on the trail of the non-working CD installs on serial consoles. I've made progress, largely thanks to Iain Rae who pointed me towards isolinux/syslinux, th program that does the booting-Linux-from-a-CD thing, and used by the installroot. The isolinux config file can be found on the CD in /media/CDROM/isolinux/isolinux.cfg and the boot message can be found in /media/CDROM/isolinux/boot.msg. If you consult the syslinux FAQ you'll see that the boot.msg file can contain "display options" which do things like clear the screen, set colours and display graphics. And if you do a web search on "isolinux serial console" you'll find advice which tells you that the "serial" isolinux option is incompatible with Dell's "Console Redirection" BIOS setting. Sure enough, if I go back to the SC1425, set Console Redirection to "off", and try booting from an install CD, the boot message comes up perfectly well on the serial console (in white on black - no colours) and the install works. I also tried an fc6_64 install CD and an sl5_64 install CD; the boot messages on them came up perfectly well too, and the isolinux configurations were the same on all. So, this is a pre-existing problem, and it's not a killer to work round. SL5 is performing as well as other OSes in this respect. Surely there must be a way of making our isolinux and pxelinux installroots work properly with the same console redirection setting though?

  • (Kenny) Noticed an error (warning?) message at boot time ... fuse: no such group (from memory). It seems that an RPM (probably one of the fuse ones funnily enough) is adding the group (gid=103) in its postinstall script, but the LCFG auth component doesn't have it listed.

18 September 2007

  • (Chris) Sadly today is largely a Solaris day for me, so there hasn't been time for much SL5 work. I've been wondering why the orange LCFG blurb that appears when you boot from an LCFG install CD doesn't appear when you use a serial console. It does on other OSes, but not on SL5. So far I've tracked the blurb down to the lcfg-pxelinux package but so far I can't even see why it's orange in the first place...

  • eucs-sslcerts was tested and seems to work fine. It have been submitted on the ed layer on the 64 repository.

14 September 2007

  • Concerning the welcome message on the LCFG CD. It wouldn't display the appropriate message because I made the image as /usr/sbin/buildinstallroot -f -p installroot-sl5-develop -o lcfginstall-sl5-develop.img rather than usr/sbin/buildinstallroot --force --profile sl5-develop -o lcfginstall-sl5-develop.img
  • I've lost track of what hardware tests we've done, what we haven't done, what's OK and what's broken; so I've put them all in a table at SL5InstallTests. Add results there as you get them.
  • I've done lots more install tests on servers.
  • Tests on the 2650, 850 and SC1425 have been completed.
  • All installs with a directly conasnected console are fine.
  • Serial PXE installs can be persuaded to work after a fashion if the =Console Redirection
= settings in the BIOS are set carefully to:
Console Redirection       Serial Port 1
Failsafe Baud Rate        115200
Remote Terminal Type      VT100/VT220
Redirection After Boot    Enabled

This is worth noting as the serial console settings used to have to be almost entirely the opposite of this to work properly. The fly in the ointment here is that the serial console freezes during the serial PXE install a couple of times: once for a few seconds during the initial boot and once for a longer period during the reboot into the installbase. The second freeze ends when the Linux kernel starts up.
  • Serial CD installs do not work: the orange LCFG welcome screen is not visible on the serial console. Instead the serial console goes entirely blank at this point.
  • Still to be tested: all Dell 860 serial console installs, and retry SC1425 direct CD.

13 September 2007

  • Dell Poweredge 2650 with SL5, without serialconsole.h, starts to install OK from PXE but then mysteriously goes silent on rebooting into the installbase. Eventually figure out that this is because I haven't disabled console redirection in the BIOS. Sigh.
  • Dell Poweredge 2650 with SL5, without serialconsole.h, with console redirection disabled in the BIOS, installs properly from PXE.
  • Dell Poweredge 2650 with SL5, without serialconsole.h, with console redirection disabled in the BIOS, does not install from CD. It boots from the CD, the orange LCFG install screen comes up, but the prompt is immediately followed by six lines of nonsense characters. Any input on the keyboard, valid or invalid, produces the response could not find kernel image (or similar) followed by another six lines of nonsense characters. The same CD was used successfully for the 850 yesterday. I tried it again on the 850 straight after its failure on the 2650 and it again worked perfectly well on the 850, so the CD has not suddenly deteriorated overnight.
  • Also, the welcome message on the CD declares the release to be -stable when it should say sl5-develop.
  • Dell Poweredge 860, SL5, no serialconsole.h or console redirection, installs fine with PXE.
  • Dell Poweredge 860, SL5, no serialconsole.h or console redirection, installs fine with CD.
  • Dell Poweredge SC1425, SL5, no serialconsole.h or console redirection, fails to install with CD. Media errors. This is the same CD that worked well with the 860 but failed on the 2650.
  • Dell Poweredge SC1425, SL5, no serialconsole.h or console redirection, installs fine with PXE.
  • Dell Poweredge 860, SL5_64, no serialconsole.h or console redirection, installs fine with PXE.

  • SL5 was installed as Virtual Machine hosted on VMWare ESX Server on a Dell PowerEdge 2850.

  • Alastair has just tested LCFG SL5 under vmplayer and it works fine (dunssl5.inf.ed.ac.uk - but not always up!)

  • Kenny tested the rest of the SelectPC models with SL5 built using PIE - i386 on RM Accelerator (RM'02), HP d530 (HP'04) and HP dc7100 (HP'05). x86_64 on Dell Optiplex GX620 (Dell'06) and Dell Optiplex 745 (Dell'07).

  • Kenny now has access to the new Dell Optiplex 755 for the next month. The NIC isn't recognised by the SL5 kernel, but DST have PIE running on it and will work on getting the SL5 kernel booting on it, most likely using Intel's e1000 module.

  • The slow boot on the Dell 745 might be fixed by loading the ahci module instead of the ata_piix module. I'll try setting this via the hardware component tomorrow. - Kenny.

12 September 2007

  • Dell Optiplex 745 was built with SL5 x86_64 successfully. Both PXE and CD worked.
  • Dell Optiplex 745 was built with SL5 i386 successfully. Both PXE and CD worked.

  • Dell Poweredge 850 with SL5 and lcfg/options/serialconsole.h failed to build from CD - output stopped as soon as the kernel found the serial port. Thinking about it later though I'm no longer sure that I typed the right option to get a serial install. I'll have to retry this one.
  • Dell Poweredge 850 with SL5 and lcfg/options/serialconsole.h failed to build from PXE - output stopped as soon as grub started. Thinking about it later though I'm no llonger sure that I typed the right option to get a serial install. I'll have to retry this one.
  • Dell Poweredge 850 with SL5 but without serialconsole.h failed to build from CD. I suspect this is my cursed CD drive again as there were hundreds of media errors, buffer I/O errors and the like. I'll not try making any CDs with that drive again. I'll retry this one with another CD.
  • Dell Poweredge 850 with SL5 but without serialconsole.h builds successfully with both PXE and CD.
  • Dell Poweredge 2650 with SL5 and PXE does not build over a serial console. It doesn't even get to the installroot. Is it even meant to work? I can't remember. Edit: Yes it is. It works for FC6 installs.
  • Built SL5 i386 on the following SelectPCs using PIE - RM Innovator (RM'03), Dell Optiplex GX620 (Dell'06), Dell Optiplex 745 (Dell'07) and have sourced an HP d530 (HP'04) and HP dc7100 (HP'05) to test. - Kenny.

  • Found a horrible bug in the X server (I presume) on Dell Optiplex 745. It autodetected the monitor such that it set the resolution to 800x648 and had the cursor hot point offset from the displayed mouse pointer. Adding one of the lcfg/options/monitor_*.h options to the profile fixed this. Looks like autodetection is failing. - Kenny

11 September 2007

  • We seem to have finished stage 12 of the project in a few hours, rather than the forecast 3 (or 9!) days. Are we missing something? Or did we just have to make far fewer changes to header files than usual, since SL5 has proved to be so similar to FC6 from an LCFG point of view? Still, there's plenty of work left to do for stage 11.

  • I gave Kenny an ISO install CD image for 32 bit SL5 LCFG a day or two ago, and he's had some success combining it with PIE. I've now given him a 64 bit image too. Later today he'll announce a "tech preview" PIE MDP build on SL5, on condition that people hassle him about it rather than us smile

  • We're now testing PXE installs for sl5_64. We're keeping our fingers crossed...
  • The first attempt failed: it couldn't find the kernel. I'd forgotten to run updaterpms on the PXE server solti.
  • The second attempt failed: it couldn't mount the filesystem from /export/linux/installroot on 129.215.202.127. I'd forgotten to make a symlink in /export/linux/installroot on roc from sl5_64 to sl4_64-develop. I've now made the symlink. The symlink isn't mentioned in the instructions but when I did the PXE support for 32 bit I seem to have remembered to make it anyway. This time I was going too quickly to notice.
  • The third attempt appears to be succeeding! It's got as far as handing over to the installbase, so we reckon that we have a working sl5_64 PXE install system!

  • panos is back to normal smile

  • Kenny reported that the SL5 install we gave him hung on his Dell Optiplex 745 during boot with ata errors. This is the same thing that happens on fetlar (a Dell Optiplex GX270) during booting too. Typically booting stops for about 90 seconds and says things like this:

ata1: port failed to respond (30 secs)
ata1: SRST failed (status 0xFF)
ata1: SRST failed (err_mask=0x100)
ata1: softreset failed, retrying in 5 secs
ata1: SRST failed (status 0xFF)
ata1: SRST failed (err_mask=0x100)
ata1: softreset failed, retrying in 5 secs
ata1: SRST failed (status 0xFF)
ata1: SRST failed (err_mask=0x100)
ata1: reset failed, giving up
ata2: port failed to respond (30 secs)
ata2: SRST failed (status 0xFF)
ata2: SRST failed (err_mask=0x100)
ata2: softreset failed, retrying in 5 secs
ata2: SRST failed (status 0xFF)
ata2: SRST failed (err_mask=0x100)
ata2: softreset failed, retrying in 5 secs
ata2: SRST failed (status 0xFF)
ata2: SRST failed (err_mask=0x100)
ata2: reset failed, giving up

Then booting starts once more. This is a known problem (as is actually announced if you turn off the grub quiet option) and as Kenny says it's good enough for a proof-of-concept service. However it'll be annoying for users, so we've been trying to see if there was anything we could do about it.
!hardware.modlist   mADD(piixnoprobe)
hardware.mod_piixnoprobe options ata_piix noprobe

but the next two reboots were sadly just like all the others: the "ata" delays happened as usual. I tried enabling SATA in the bios too but this made the machine unable to boot ("Strike F1 key to continue, F2 to enter setup").
  • Panos's next comment: I've found a patch but it seems everything should be done manually: http://thread.gmane.org/gmane.linux.ide/13408/focus=13439 I don't know if we should spend too much time with that as far as the system works as far as we know it's a kernel bug. Maybe a we should use another kernel? A newer one? Or wait until a new one is officially supported by SL people? I agree completely - I don't know what to do next about this either. Can anyone offer guidance here?
  • In other news, Panos has made a start on stage 13, writing release notes. I've made a list of machines on which we can test the PXE and CD installs. Of the MPU's test servers, the Dell Optiplex 650, 750 and 1750 are connected to a SAN using fibrechannel and I've been asked not to touch them (don't they trust me? I've only ever accidentally destroyed one SAN partition in an LCFG install smile ); the 1750 is currently doing another job; leaving just the 850 (figgy), the 860 (prague) and the 2650 (tummy). Hopefully I'll get the bulk of those tests done tomorrow.

10 September 2007

  • I have looked to the headers of the inf level in order to check those that have been modified for SL5 and will need to be ported to the dice level These are:

core/include/inf/defaults.h
core/include/inf/options/afs-client.h
core/include/inf/options/amd.h
core/include/inf/options/environment.h
core/include/inf/options/kerberos-client.h
core/include/inf/options/mailcap.h
core/include/inf/options/openssh.h
core/include/inf/options/packages.h

  • The only MPU owned component from above is amd. The amd resources amd.gvariables were already set in dice/options/fileysystem.h. I added the SL5 repositories to dice/options/packages.h. I have written down the changes we did to non-MPU components on the Inf level and can be found here: http://www2.epcc.ed.ac.uk/~pkritika/sl5/changes_on_inf_levell.odt

  • An empy list of packages/dice/dice_sl5_env.rpms was also created.

  • I didn't get any word back from George, so I hoped that this meant that he didn't radically object to the idea of putting some pxe settings in the local pxe server's config just to test them. I tested the sl5 PXE booting by putting the extra sl5 resources from ed/options/pxe_server.h into solti. After a bit of poking around in our DNS map I guessed that solti might be the local PXE server for the wire our test machines are on. I thought it was worth a go, anyway. So I added the settings, I ran updaterpms on solti, I rebooted fetlar and PXE-booted it. Success!! Entries for "sl5" and "sl5serial" were on the PXE boot list. I chose "sl5" and the install succeeded with no errors - so fetlar has now been installed using an SL5 PXE installroot. The next thing to do is to get the 64 bit PXE installs also working. Meanwhile Panos is getting started on stage 12 of the project.
  • I asked the weekly MPU meeting about testing PXE and they reckoned that it should be tested on a wide variety of servers - so I'm going to test it on each of the MPU's test servers, except for the fibrechannel ones.

07 September 2007

By special request, here's how we changed the SL5 filesystem and SysVinit RPMs to make them work properly as part of the SL5 installroot. These were the same changes that were made to the equivalent FC6 packages for the FC6 installroot.

Changes to filesystem
Remove media and tmp from the long list of arguments to the mkdir -p command; add ln -snf /var/tmp tmp; add ln -snf /tmp media; replace the %files %attr(1777,root,root) /tmp line with just /tmp
Changes to SysVinit
Add Patch999: sysvinit-lcfginstall.patch consisting of:
diff -ru sysvinit-2.86.orig/src/sulogin.c sysvinit-2.86/src/sulogin.c
--- sysvinit-2.86.orig/src/sulogin.c    2007-02-26 15:49:34.000000000 +0000
+++ sysvinit-2.86/src/sulogin.c 2007-02-26 15:51:40.000000000 +0000
@@ -453,6 +453,10 @@
                sleep(2);
        }
 
+#define LCFG_INSTALLROOT
+#ifdef LCFG_INSTALLROOT
+       sushell(pwd);
+#else
        /*
         *      Ask for the password.
         */
@@ -468,6 +472,7 @@
                        sushell(pwd);
                printf("Login incorrect.\n");
        }
+#endif
 
        /*
         *      User pressed Control-D.

  • A new release was made for lcfg-buildinstallroot and lcfg-skeleton. Both new RPMs are submitted to the repository and bot fetlar and panos are updated through updaterpms. lcfg-install and lcfg-installcfg don't require any changes so we leave them as they are.

  • A SL5 default system passes more option to the kernel:

[root@esx-sl5 ~]# grep -v ^# /etc/sysctl.conf | sed '/^$/d'
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 4294967295
kernel.shmall = 268435456

However, the options kernel.sysrq, kernel.core_uses_pid and net.ipv4.tcp_syncookies are also included by default in the kernel options in a FC6 system but the kernel component isn't adding them. So I added the four last ones in lcfg/defaults/kernel.h, which a bit more important:

#ifdef LINUX_SL5

!kernel.set                     mADD(msgmnb)
kernel.tag_msgmnb               kernel.msgmnb
kernel.value_msgmnb             65536

!kernel.set                     mADD(msgmax)
kernel.tag_msgmax               kernel.msgmax
kernel.value_msgmax             65536

!kernel.set                     mADD(shmmax)
kernel.tag_shmmax               kernel.shmmax
kernel.value_shmmax             4294967295

!kernel.set                     mADD(shmall)
kernel.tag_shmall               kernel.shmall
kernel.value_shmall             268435456

#endif

Stephen suggested to declare the new tags individually in order to avoid possible mutation conflicts.

  • Stephen also suggested to create a new component for SL5 to handle the /etc/passwd and /etc/group/ files. This is the lcfg-defetc-sl5 which is in the CVS repository. After a few tries, and breaking panos, I managed to get the right files. I have rebuilt panos using the lcfg-defetc-sl5 component and works fine.


I've been following the Managing the PXE install process instructions, hoping to make a PXE installroot for SL5.

  • I created the "root" from the sl5-develop profile and put it on roc as described.
  • I've added the standard SL5 versions of the filesystem and setup packages.
  • I've rebuilt busybox and pci_scan packages from the FC6 source RPMs.
  • I've tried creating a kernel-pxe package using nsu -c "/usr/sbin/lcfg-mkpxeboot --os sl5 --verbose". However this doesn't seem to produce the right results - the script thinks that I have specified an OS called "1" rather than "sl5":
-bash-3.1$ nsu -c "/usr/sbin/lcfg-mkpxeboot --os sl5 --verbose"
Fetching profile installroot-pxe-1 from http://lcfghost/profiles
** failed to fetch profile: http://lcfghost/profiles/inf.ed.ac.uk/installroot-pxe-1/XML/profile.xml
**   Not Found
lcfg-mkpxeboot failed : failed to fetch profile installroot-pxe-1 from http://lcfghost/profiles
-bash-3.1$

  • I see that fetlar had a file called /etc/LCFG-RELEASE but it was empty. lcfg-mkpxeboot was looking at this file to find out the OS. I put the string sl5 in the file and ran the script again. Still failed. I ran it again without the --os option (nsu -c "/usr/sbin/lcfg-mkpxeboot --verbose") and it succeeded this time. I've submitted the resulting file to the sl5/ed, fc5/ed and fc6/ed repositories and I've added some suitable entries to ed/options/pxe_server.h.
  • However it looks as if the pxe servers are all on a stable release, so we won't be able to test any of this stuff until next Friday! Unless we cheat. This being a Friday afternoon, I'm not going to start mucking about with the configuration of the main network infrastructure servers!

06 September 2007

We're not yet having the same success with the 32 bit SL5 installroot frown Problems, problems. Panos has made a 32 bit installroot CD image containing a suitable kernel and updated software, as he did for 64 bit, but I'm not having much luck with it so far. The CD doesn't even start to boot. If I enable PXE on fetlar and select a CD boot then the PXE boot starts up immediately; if I disable PXE then try again, I get Strike F1 to retry boot. I'm not confident that the CD writer I'm using is making good CDs so I've asked a colleague to write a CD for me, but neither is working. I also had a malfunctioning CD drive on fetlar. The drive's been replaced by another which seems to be working better, but I can't rule out the possibility that something's wrong with that too. Panos is going to make a CD from the same image and try it on his machine.

S UC C E S S ! ! !

The 32 bit installroot worked too. I'm typing this entry from a newly reinstalled (with an SL5 installroot CD) fetlar. All the CDs I make seem to be cursed in some way, but Panos's CD worked perfectly.

Next up:

  • PXE installs
  • update the stage 11 components to mark their SL5 support

05 September 2007

  • Still going through Stephen's list of tips. This is the filesystem package tip.
  • I've removed the fc6-derived filesystem-2.4.0-1.lcfg.2.i386.rpm from the repository (and .filesystem-2.4.0-1.lcfg.2.i386.rpm too)
  • and from the lcfg_sl5_lcfg_installroot.rpms package list.
  • I've compared the differences between the native fc6 filesystem source RPM and the one in use in lcfg_fc6_lcfg_installroot.rpms
  • obtained an SL5 filesystem source RPM
  • applied the same changes to that, to produce an RPM I called filesystem-2.4.0-1.lcfg.1.i386.rpm
  • submitted that to the SL5 lcfg repository
  • and changed lcfg_sl5_lcfg_installroot.rpms to use filesystem-2.4.0-1.lcfg.1.i386.rpm (the new SL5-derived one) rather than filesystem-2.4.0-1.lcfg.2.i386.rpm (the old FC6-derived one).
  • It looks as if the same may be required for the SysVinit package.
  • I've now obtained a SysVinit source RPM for FC6 and compared the contents to our lcfg-hacked FC6 SysVinit source RPM.
  • I then obtained an SL5 SysVinit source RPM and made similar changes there.
  • I built and submitted that on SL5 as SysVinit-2.86-14.lcfg.2.
  • I've changed lcfg_sl5_lcfg_installroot.rpms to use this new version.

  • The same done for 64 version. I'm trying to create a new image for to burn a CD but one package is still missing [WARNING] /usr/sbin/updaterpms: couldn't find RPM header file for lcfg-installcfg-0.99.12-1/noarch
  • I fetched a copy of lcfg-installcfg from the CVS repository. I can see some interesting files related to the dhclient problem:

-bash-3.1$ ls -l | grep dhclient
-rw-r--r-- 1 pkritika people  297 Aug  5  2003 dhclient.conf.cin
-rwxr-xr-x 1 pkritika people 1003 Aug  5  2003 dhclient-lcfg-script.cin

  • I have a made a new image and burned it to a CD. dhclient now works fine, but lcfg-fstab doesn't. I noticed that in the image was being used the version 1.1.33 and the SL5 supported version is 1.1.34.
  • I changed the lcfg-fstab package version in lcfg_sl5_lcfg_installroot.rpms.
  • lcfg-grub needed to be changed as well. From 1.6.3-1 to 1.6.4-1.
  • lcfg-defetch as well. The package that was used by installroot was lcfg-defetc-fc6-1.0.1-1/noarch but we are using lcfg-defetc-0.101.0-1/noarch.
  • lcfg-hackparts was the last that needed to be changed. From version 0.100.23-1 to 0.100.24-1.

  • SUCCESS! installroot is now working! The machine managed to boot from CD and install the packages from installbase list. Now it's installing all of its packages smile

  • panos has finished installation and works fine smile

04 September 2007

  • Stephen just popped in and we had a chat about the install stuff. He gave us some good tips on what to do.
  • Look carefully at buildinstallroot. It may have sections in it for each platform ("if fc3...", "if fc5..."), and there won't be a section for SL5 there yet.
  • Should we even bother to make an installroot running SL5, if the FC6 one works so well? Yes we should, it's a good idea. The FC6 one may work at the moment, but if one of the platforms changes a key bit of software - such as the version of Berkeley DB it uses - we may find that our installroot is creating files on the machine's hard disk which the machine's software cannot use. So we should develop an SL5 installroot.
  • However we'll need a kernel that's different to the standard SL5 one - see below. There's no reason we couldn't just use the FC6 installroot kernel for the SL5 installroot. The rest of the installroot will be running SL5 - it doesn't matter so much where the kernel comes from as long as it does the job OK.
  • In the FC6 installroot we don't use our normal FC6 kernel, because it doesn't include support for booting from SCSI CDs, from IDE CDs or from NFS/PXE roots. We add all those things to our installroot kernel. The latest installroot kernel is /pkgs/master/rpms/fc6/lcfg/kernel-2.6.20-1.2933_FC6_lcfg_1.1.i686.rpm. There's a matching source RPM in /pkgs/master/srpms/.
  • The fc6 PXE installroot kernel is in an RPM called kernel-pxe-install-fc6.
  • Look carefully at the filesystem RPM. We replaced the SL5 one with an FC6 one to get the installroot to build. The FC6 one used in the installroot is a modified version of the standard FC6 one. It had to be changed to solve some conflicts. Instead of using this FC6 one, we would be better comparing our FC6 filesystem RPM with the standard FC6 filesystem RPM and noting the changes, then making similar changes to the SL5 filesystem RPM to make an SL5 filesystem RPM suitable for the installroot.
  • The details of the PXE stuff are at: https://wiki.inf.ed.ac.uk/DICE/MPUPxeRoot which is linked from the MPU wiki "internal procedures" page: https://wiki.inf.ed.ac.uk/DICE/MPUInternalProcedures
  • I've taken a look at buildinstallroot and I can't find any build-time OS tests, or any mention of strings like "fc6" or "fedora" or "redhat" anywhere except in the ChangeLog r config.mk files.
  • Panos has already put the FC6 installroot kernel into the SL5 installroot and found that it understands how to boot from a CD.

03 September 2007

  • Trying today to create a testing release image worked fine as expected for both i386 and x86_64 versions.
  • After a few hours I got something more. In lcfg/options/installroot.h, the entry for FC6 adds a LCFG version of the Linux kernel. I thought that maybe that was the problem. I added the same entry for SL5. I got the source rpm and made a new RPM on panos. I submitted it to the repository for sl5_64 and then I created again the installroot image with the new kernel. I burned a CD and tried to reboot. This time issuing as parameter "hda" worked fine! However, after 2 seconds the process failed because /etc/rc_install was missing. This is the script that handle's the installation, written by Alistair. Somehow it must get in /etc during the creation of the installroot.
  • rc_install is part of the lcfg-installroot component. I guess that this must be submitted to the repository.
  • lcfg-buildinstallroot has been built and submitted to the repository. There was not release as it's not tested yet. Also, lcfg-install was submitted to the RPM repository after being built on panos. Again, no new release was made.

  • I tried again with the new image... Different problem this time. rc_installs is found but when requesting IP on DHCP it prompts the error:


/sbin/dhclient-script: conf for eth0 not found continuing with defaults
/etc/sysconfig/network-scripts/network-functions: line 78: eth0: No such file or directory
cp: cannot stat /etc/resolv.conf: No such file or directory

However the machine gets its IP and /etc/resolv.conf exists with the right DNS entry.

31 August 2007

  • SUCCESS! Fetlar is back. And this time it's installed with decent physical partitions, none of your namby-pamby easily-wrecked LVM stuff:

-bash-3.1$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/hda1              25G   11G   13G  47% /
/dev/hda3              12G  158M   11G   2% /disk/scratch
none                  248M     0  248M   0% /dev/shm
AFS                   8.6G     0  8.6G   0% /afs
-bash-3.1$ 

Hmm - I thought that AFS needed a separate /var/cache/afs/ partition to work properly...? Certainly DICE machines all seem to have one. However AFS seems to be working perfectly well on fetlar: I get my AFS home directory and I can see other AFS directories perfectly well too. There are no complaints in /var/lcfg/log/afs either. EDIT: Simon assures me that this is OK - it'll probably be slower and it may be a bit more dangerous in the long run, but it's fine for now. Alastair tells me that the separate /var/cache/afs/ partition is a DICE level thing and isn't created for inf or lcfg level machines (fetlar is inf level). So anyway, I'd say the sl5 installbase is working. Next up: the sl5_64 installbase, and tackling the installroot. We have been working on both of those already so hopefully getting them working won't be too painful.

  • SUCCESS! panos is installed with LCFG process as well smile Things that need to be mentioned:

1) I didn't include in the beginning the lcfg-install component so the process wouldn't start. I included in panos profile and it worked fine.
2) fstab component was failing, and the installation with it, because the filesystem was read-only and couldn't create /root/disk/extra1 which was mount point from sdb1 specified in panos profile. I removed all the fstab entries as also all the unnecessary ones like these of grub component.
3) updaterpms would be called and fail due to 1386 conflicts. There was a warning that kernel RPM header couldn't be found. GLIBC 2.5 x86_64 version was missing. I checked the lists it was specified as i686 package so I removed the /i686 and rebooted. All package would be flagged for installation. After that, it wouldn't be able to find the kernel package. Kernel package was also specified as i686 while it is x86_64, so I removed /i686 from that one as well. Reboot and system start installing 2158 packages smile
(for fetlar there was no need to take out /i686 as the packages used for i386 are i686) 4) Machine finally ready! I can login. However, somebody with admin privileges on Kerberos must create a keytab for the machine.
5) That should be first but doesn't have to do with the process of SL5 at all. I boot from PXE and choosen fc664 installation. Choosing fc6 on 64 machine would fail as the package specified were for x86_64 architecture.
6) In the installbase lists for both i383 and x86_64 was specified a "bacl" package which is never used so I removed it.

  • After a few hours of trying to get it work, installroot list seems to be fine now. There were about 20 conflicts. Some packages needed to be added and one to be removed which was not needed. That was fine. The interesting thing was that three packages would not be flagged for installation by updaterpms and these were
  • perl-Time-modules
  • filesystem
  • SysVinit The first one was not in the installroot list for x86_64 or i386 and it seemed strange at first place, and the other two were in the installroot but wouldn't be flagged and package would still ask for them. So, I remember that there is also another list called lcfg_sl5_lcfg_installroot.rpms. These three files were specified in there but there were not in the RPM repository. I also noticed from the updaterpms log files, that it could find the headers of these files so definitely were not in the repository.
    The problem with perl-Time-modules was that the entry in lcfg_sl5_lcfg_installroot.rpms was defined as the name of the package built for FC6, but we had built a different one for SL5. So I renamed it and that worked fine. It was already in the repository so I didn't have to submit, but the package with the other name wasn't in the repository.
    Then, I copied the filesystem-2.4.0-1.lcfg.2.src.rpm and SysVinit-2.86-14.lcfg.1.src.rpm from the SRPM repository of fc6_64 to panos and rebuilt them. Then I submitted them to the RPM repository and no more conflicts. The installroot building was working. I did the same with i386 version of fetlar.

  • Having the iso file ready, I burn it to a CD and tried to boot from there (I made an image of sl5_64-develop). The booting process wouldn't go that far as it couldn't open the root device "hdc" or unknown block(0,0). Chris suggested checking where the DVD driver is actually mapped and try again. The drive is mapper on hda so I tried with setting the root device at the beginning as cd root=hda but the result was the same. Instead of "can't open root device "hdc" ", now it was can't open root device "hda". I tried also as cd root=/dev/hda but unfortunately the result was the same.

  • One more problem I just noticed. When trying to create the installroot from sl5_64-develop then it works fine. It downloads around 250 packages. If you try to do it from sl5_64-testing and sl5_64-stable it takes all the packages that are normally installed on the system (2000+)

Chris gave me an answer about the last "problem": This is because the testing release is only made once a week, on Monday mornings. It's a snapshot of whatever is in subversion at that point. The stable release is made once a week on Thursday afternoons, and it's just a copy of that Monday's testing release. The third type of release, "develop", is just a copy of whatever is in the live subversion repository. You can find out more about the releases here: https://wiki.inf.ed.ac.uk/DICE/ReleasesFAQ

If you try making an sl5_64-testing installroot on Monday lunchtime, it will have about 250 packages in it - because I will have made next week's testing release by then, and all of our changes from this week will be in the new testing release. sl5_64-stable won't work properly until late on Thursday (about 16:00).

30 August 2007

  • Firstly, Paul has sent us some useful advice on debugging LCFG components: "The best way to debug it would be to dump the resources in to a file, and then issue the corresponding sxprof command by hand (or from a simple script) referencing the resources in the file rather than the profile. This will confirm which command is failing and let you play around more easily with the resources and the input to work out out what is going on ..."
  • And secondly, an incautious test of lcfg-fstab yesterday seems to have totally trashed the LVM information on the main disk on fetlar. As far as I can remember, I enabled lcfg-fstab, put in what I thought were the right resources, and started the component. Then (for some reason) I decided that this would be a great time to reboot the machine. It went down, but it wouldn't boot up again. Instead of the normal grub menu I got a grub prompt. Booting into the SL5 install DVD and selecting Linux Rescue mode, it gave me a shell prompt but rather worryingly told me that it couldn't find any linux partitions on the hard disk and so it hadn't mounted anything on /mnt/sysimage as it normally would. I found that I could mount /dev/hda1 (/boot) manually, so I did that and checked the grub.conf file - it looked fine. I rebooted again, and got the grub prompt once more. I typed the "root" and "kernel" and "initrd" and "boot" commands from grub.conf. The machine tries to boot, but fails because it can't find any logical volumes. In particular it says Volume group "VolGroup00" not found which is bad news as that's the root partition. Sigh - my first big disaster of the project! Time to test the install procedure we've been working on ...
  • I PXE-booted fetlar and chose an fc6 installation. It fell over when it tried to get the installbase-sl5-develop profile. I went and made that : rfe -t lcfg/installbase-fc6-develop -n lcfg/installbase-sl5-develop and change "fc6" in the contents to "sl5". The installbase profiles include lcfg/options/installbase.h so I added an SL5 section to that. It appears from that that we need a package called lcfg-installfixups so I've built and submitted a version of that for SL5 (32 bit only as yet). Luckily Stephen has a 32 bit SL5 machine up and running on which I can build packages. A second attempt at installation got further - it started the updaterpms run of the installbase, but fell over with

LCFG updaterpms: failed to fetch RPM http://k.rpmsinf.ed.ac.uk/master/rpms/sl5/updates/.vim-minimal-7.0.109-3.el5.3.i386.rpm : 404    [ ERR ]
LCFG updaterpms: updaterpms failed [FAILED]

  • cd /pkgs/master/rpms/sl5/updates; nsu linux
  • for i in  *.rpm; do /usr/sbin/genhdfile $i; done
  • ... and I'm now trying the install again. updaterpms gets slightly further this time, but eventually fails with LCFG updaterpms: There were 519 conflicts.
  • Adding ncurses and perl to lcfg_sl5_installbase.rpms gets it down to 25 conflicts. Once the 25 conflicts were all sorted out, the next attempt threw up 8 conflicts. Ditto, one conflict. Ditto, two conflicts. Ditto, two other conflicts. And on the next attempt - it started installing RPMs! I'm now excitedly waiting for it to finish. Unfortunately I can't keep an eye on its progress as something has timed out, the screen has cleared, and a login prompt has appeared, meaning that none of the install messages is visible any more. This is what usually happens with our LCFG installs, I remember. Eventually when the install really finishes, the machine reboots (I think) then you get a proper login prompt (that is, one that will let you login).
  • I'm not confident that this install is going well. The login prompt is still showing. The machine's disk activity light blips once every few seconds, like a slow heartbeat. If updaterpms was still running it'd be blinking furiously. Has something got stuck or fallen over?
  • I eventually got fed up and rebooted the machine. Updaterpms failed again but the boot carried on regardless. I couldn't login when the machine finished booting. Tried booting single user, but it didn't like any of the passwords I tried - maybe I've forgotten what I set the root password to. I booted from PXE and dropped to a shell and mounted /dev/hda1 and finally took a look at the machine's /var/lcfg/log/updaterpms. It turned out that updaterpms had failed as soon as I had looked away. It was complaining that an i386 version of openssl was scheduled for deletion but that it was already installed. Which I found slightly odd. But I noticed that the base package list specified i686 openssl whereas the installbase specified i386. I've changed it to i686 and I'm trying again.
  • Panos has made all the rest of the sl5 installbase profiles, and has attempted to correct the package lists for sl5 installroot and sl5_64 installroot and installbase according to the changes I've had to make to the sl5 installbase. We won't be able to test much of this until we can use it in an install.
  • My last update of the day. The fetlar install has got to the next stage: updaterpms has finished checking dependencies and has consented to start installing packages! It's currently on 120 out of 2192. We'll see tomorrow how far it gets through that lot.

29 August 2007

Kenny was having a look in this diary and he informed Chris that "genparts call fails because it gets back rubbish (probably empty) from listparts when it's trying to find the cylinders and heads and sectors". That happens because there are no SL5 entries for LCFG_OS_* in listparts.c.cin. After specifying SL5 I tried to run the adddisk method:


[root@panos noarch]# /usr/sbin/genparts /dev/sdb
p:1:0:1274:ext3
[root@panos noarch]# om fstab adddisk sdb
[FAIL] fstab: Error: Can't have the end before the start! mkparts: failed ped_partition_new

I had a look at mkparts.c.cin and it seemed that an SL5 entry was necessary to this one as well. After adding it:

[root@panos x86_64]# om fstab adddisk sdb
[INFO] fstab: Preserving filesystem on /dev/sdb1
[FAIL] fstab: e2fsck failed for preserved /dev/sdb1

I suppose that happens because this partition was an existing one and was NTFS. After deleting it from the partition table, lcfg-fstab would re-create it but prompt again the same error. Then I specified one more partition in panos profile (as sdb2) which in the table but it doesn't seem to get formatted, and there is the same error message for sdb1. Reading the guide for second time wouldn't be bad. Chris told me that I had to set preservation to no:

!fstab.preserve_sdb1    mSET(no)
!fstab.preserve_sdb2    mSET(no) 

After that lcfg-fstab worked as it should. The partitions were created normally an formatted with ext3 and ext2 respectively as specified in the profile. The component also generated a new /etc/fstab file. However, that configuration will not work in the existing system so I'll restore back the old fstab file.

(A note from Chris: although lcfg-fstab did its job properly, the /etc/fstab it made wasn't suitable for our test machine, because the main sda disk used logical partitions. The standard SL5 install process makes the root filesystem a logical partition. lcfg-fstab does not understand logical partitions, so we couldn't describe the existing partition and filesystem arrangement on sda in the fstab resources - so we just had to leave sda out. So when Panos ran the fstab component, the resulting /etc/fstab only contained entries for sdb.)


[root@panos lcfg]# om fstab adddisk sdb
[INFO] fstab: Making fsys (/sbin/mke2fs -j) for /disk/extra1 on /dev/sdb1
[INFO] fstab: Mounting /dev/sdb1 on /disk/extra1
[INFO] fstab: Making fsys (/sbin/mke2fs ) for /disk/extra2 on /dev/sdb2
[INFO] fstab: Mounting /dev/sdb2 on /disk/extra2
[INFO] fstab: /etc/fstab has changed - requesting reboot
[OK] fstab: adddisk
[root@panos lcfg]# cat /etc/fstab
# LCFG generated /etc/fstab - do not edit
/dev/sdb1 /disk/extra1 ext3 defaults 1 2
/dev/sdb2 /disk/extra2 ext2 defaults 1 3
none /proc proc defaults 0 0
none /dev/pts devpts gid=5,mode=620 0 0
none /dev/shm tmpfs defaults 0 0
none /sys sysfs defaults 0 0

The mount point were also created and the partitions are mounted as well:

-bash-3.1$ mount | grep sdb
/dev/sdb1 on /disk/extra1 type ext3 (rw)
/dev/sdb2 on /disk/extra2 type ext2 (rw)

The resources in machine's profile are as following:

!fstab.disks            mSET(sdb)
!fstab.dopartition_sdb  mSET(yes)
!fstab.partitions_sdb   mSET(sdb1 sdb2)
!fstab.size_sdb1        mSET(10000)
!fstab.size_sdb2        mSET(10000)
!fstab.type_sdb1        mSET(ext3)
!fstab.type_sdb2        mSET(ext2)
!fstab.mpt_sdb1         mSET(/disk/extra1)
!fstab.mpt_sdb2         mSET(/disk/extra2)
!fstab.preserve_sdb1    mSET(no)
!fstab.preserve_sdb2    mSET(no)

  • New release of lcfg-fstab is made. The new RPM is copied to the RPM repository, package list lcfg_sl5_lcfg.rpms is updated. fetlar and panos have been updated with updaterpms.

  • New release and RPM of lcfg-hackparts is made. fetlar and panos updated by updaterpms.


!fstab.entries          mADD(root boot swap)

fstab.spec_root         /dev/VolGroup00/LogVol00
fstab.vfstype_root      ext3
fstab.file_root         /
fstab.passno_root       1

fstab.spec_boot         /dev/sda1
fstab.vfstype_boot      ext3
fstab.file_boot         /boot
fstab.passno_boot       2

fstab.vfstype_swap      swap
fstab.file_swap         swap
fstab.spec_swap         /dev/VolGroup00/LogVol01

After lcfg-fstab was reconfigured, a new fstab file was generated with the appropriate entries:

-bash-3.1$ cat /etc/fstab
# LCFG generated /etc/fstab - do not edit
/dev/sdb1 /disk/extra1 ext3 defaults 1 2
/dev/sdb2 /disk/extra2 ext2 defaults 1 3
none /proc proc defaults 0 0
none /dev/pts devpts gid=5,mode=620 0 0
none /dev/shm tmpfs defaults 0 0
none /sys sysfs defaults 0 0
/dev/VolGroup00/LogVol00 / ext3 defaults 0 1
/dev/sda1 /boot ext3 defaults 0 2
/dev/VolGroup00/LogVol01 swap swap defaults 0 0

I rebooted the machine and everything worked fine.

  • The installroot and installbase lists for SL5 seem to have, finally, been sorted out. The exact lists are lcfg_sl5_installbase.rpms and lcfg_sl5_installroot.rpms for i386 architecture, lcfg_sl5_64_installbase.rpms and lcfg_sl5_64_installroot.rpms for x86_64 architecture.

28 August 2007

  • Stephen has cleared up the remaining lcfg-hardware concern. The LoadModules code does do the right thing even when =hardware.modname_tag
= is not defined, because if it isn't then the modname variable is simply set to the name of the tag. So the end result is the command modprobe pcspkr rather than just modprobe as I said yesterday.
  • lcfg-hardware now officially works with SL5. A new release has been made, and packages have been made, submitted and installed for 32 and 64 bit.

  • Trying to partition the second hard drive on panos with lcfg-fstab. So far, no luck. The entry in the profile is:

!fstab.disks            mSET(sdb)
!fstab.dopartition_sdb  mSET(yes)
!fstab.partitions_sdb   mSET(sdb1)
!fstab.size_sdb1        mSET(10000)
!fstab.type_sdb1        mSET(ext3)
!fstab.mpt_sdb1         mSET(/disk/extra1)

And then trying to run adddisk method as user and then as root:

[root@panos components]# om fstab adddisk sdb
[FAIL] fstab: Use of uninitialized value in multiplication (*) at /usr/sbin/genparts line 192. Use of uninitialized value in multiplication (*) at /usr/sbin/genparts line 192. Illegal division by zero at /usr/sbin/genparts line 192.

The problem is actually reported by genparts:

[root@panos components]# /usr/sbin/genparts
Usage: genparts disk
[root@panos components]# /usr/sbin/genparts /dev/sdb
Use of uninitialized value in multiplication (*) at /usr/sbin/genparts line 192.
Use of uninitialized value in multiplication (*) at /usr/sbin/genparts line 192.
Illegal division by zero at /usr/sbin/genparts line 192.

  • I tried out lcfg-kernel on fetlar. It seems to work properly. I adjusted the grub resources to use the vmlinuz and initrd.img symlinks made by kernel. A reboot worked fine. I noticed in passing that the splashimage line in /boot/grub/grub.conf wasn't working - no splash image was displayed at boot time - and established with a couple more reboots that the color line appearing later in grub.conf was having the effect of making the splash image disappear: when the color line was commented out, the splash image was displayed by grub. The color line is part of the template file so I've told Stephen about the problem and he's agreed to fix it at some point.

27 August 2007

  • Still puzzling over the failing sxprof in lcfg-hardware on fetlar. If I run it by hand it seems to work perfectly:

[root@fetlar cc]# eval `qxprof -e hardware`
[root@fetlar cc]# echo $LCFG_hardware_modloader_pcspkr
modprobe
[root@fetlar cc]# export root=/
[root@fetlar cc]# export modconf=/etc/modprobe.conf
[root@fetlar cc]# sxprof -i hardware - $root/$modconf << 'EOF'
> #
> # LCFG generated (hardware component)
> #
> <%for: e=<%modlist%>%>
> <%mod_<%e%>%><%end:%>
> EOF
[root@fetlar cc]# cat /etc/modprobe.conf
#
# LCFG generated (hardware component)
#

alias eth0 e1000
alias snd-card-0 snd-intel8x0
options snd-card-0 index=0
install snd-intel8x0 /sbin/modprobe --ignore-install snd-intel8x0 && /usr/sbin/alsactl restore >/dev/null 2>&1 || :
remove snd-intel8x0 { /usr/sbin/alsactl store >/dev/null 2>&1 || : ; }; /sbin/modprobe -r --ignore-remove snd-intel8x0
alias usb-controller ehci-hcd
alias usb-controller1 uhci-hcd
alias scsi_hostadapter ata_piix
[root@fetlar cc]# 

  • The lcfg-hardware mystery has been solved. The definition of modconf depended on build-time variables:

@FEDORA_ONLY@   modconf=/etc/modprobe.conf
@RH9_ONLY@      modconf=/etc/modules.conf

These configuration variables evaluate to a comment character if not true, and to nothing if true, so in this case both lines were commented out when built on SL5. This meant that sxprof was trying to modify a file called // instead of /etc/modprobe.conf (or even ///etc/modprobe.conf). I've changed this to

@FEDORA_ONLY@   modconf=/etc/modprobe.conf
@SL5_ONLY@      modconf=/etc/modprobe.conf
@RH9_ONLY@      modconf=/etc/modules.conf

... pending a patch which Stephen kindly offered which will define this through a resource instead, which will be far more convenient, not to mention far less mysterious for future LCFG porters.
  • So, having finally got the hardware component building /etc/modprobe.conf instead of trying to build // instead, I've taken a look at the other files it's trying to configure. It looks as if a lot of the capability of the hardware component isn't currently being used, at least on this platform. I noticed one thing which I'm not entirely happy about. These two resources are defined in lcfg/defaults/hardware.h:

!hardware.permmodules           mADD(pcspkr)
hardware.modloader_pcspkr       modprobe

However since neither hardware.modname_pcspkr nor hardware.modopt_pcspkr is defined, this basically makes the hardware component execute the command modprobe with no arguments or options.
  • I asked Stephen and Alastair how to go about testing lcfg-fstab and lcfg-install.
  • Alastair said that when he tested fstab he usually used an external disk. Just plug in the external disk and use the fstab component's "adddisk" method to partition the external disk and make filesystems and so on. He says it uses most of the fstab component's code.
  • To test the install component, Stephen suggested starting off with the fc6 installroot. We could use the PXE (network) install - just choose "fc6" from the menu - and when it gets the machine's profile it'll use the sl5 stuff in that. This means that to start with we will not need any sl5 "installroot" package files. We will need an sl5 installbase file though - lcfg/lcfg_sl5_installbase.rpms and lcfg/lcfg_sl5_64_installbase.rpms. These need to contain the minimum set of RPMs necessary to get the machine installed. Once the installbase stuff is working, we can then try making an sl5 installroot, and getting buildinstallroot working for sl5.
  • Also, Stephen is soon going to write some new software which will solve the problem with lcfg-rpmcache and qxpack on 64 bit. He says that fc5_64 and fc6_64 both suffer from the same problem, so don't worry about it, just mark rpmcache as "done" smile

24 August 2007

  • I'm bravely trying out lcfg-hardware on fetlar. So far the machine hasn't been trashed. The component isn't working properly though, and I can't see why. The log file is saying:

sxprof: can't rename file: //#=>//
sxprof: Device or resource busy

This happens both after a start and after a configure. sxprof is used in the LoadProfile routine in ngeneric but this is only called after a start, so I reckon it can't be that. The other place in which sxprof is called is in the component itself, in UpdateModuleConf:

##########################################################################
UpdateModuleConf() {
##########################################################################
#
#
   root=$1

   [ -f $root/$modconf ] || touch $root/$modconf

   sxprof -i $_COMP - $root/$modconf << 'EOF'
#
# LCFG generated (hardware component)
#
<%for: e=<%modlist%>%>
<%mod_<%e%>%><%end:%>
EOF

   status=$?
   if [ $status = 2 ]  ; then
      Info "$modconf has changed - request reboot"
      RequestReboot "$modconf has changed"
   fi
}

Now there might be something wrong there if, for instance, one of the tags in the modlist taglist didn't have a corresponding =mod_tag = resource, but that doesn't seem to be the case:

-bash-3.1$ qxprof hardware.modlist
modlist=ether snd sndo sndi sndr ehcihcd uhcihcd piix
-bash-3.1$ qxprof hardware|grep mod_
mod_ehcihcd=alias usb-controller ehci-hcd
mod_ether=alias eth0 e1000
mod_piix=alias scsi_hostadapter ata_piix
mod_snd=alias snd-card-0 snd-intel8x0
mod_sndi=install snd-intel8x0 /sbin/modprobe --ignore-install snd-intel8x0 && /usr/sbin/alsactl restore >/dev/null 2>&1 || :
mod_sndo=options snd-card-0 index=0
mod_sndr=remove snd-intel8x0 { /usr/sbin/alsactl store >/dev/null 2>&1 || : ; }; /sbin/modprobe -r --ignore-remove snd-intel8x0
mod_uhcihcd=alias usb-controller1 uhci-hcd
-bash-3.1$ 

All present and correct, as far as I can see. So why is it failing? The LCFG template syntax hasn't changed. sxprof hasn't been changed. The resources are OK. The same code works on other platforms.
Hmm. So, if $root is / and $modconf is /etc/modprobe.conf then $root/$modconf will be ///etc/modprobe.conf, no? Which doesn't look healthy. So why does it work elsewhere?

  • I have add in panos profile the lcfg-kernel component and from a first view seems to work fine. It creates the link of the kernel and initrd image under /boot/vmlinuz and /boot/initrd.img. I thought after that trying again lcfg-grub but there was no luck. I changed also the configuration to match the machine's one but didn't work, it fails to boot the kernel, it doesn't find it. I suppose it has something to with partition schema, but not sure.

  • Chris suggested changing kernel /boot/vmlinuz  root=/dev/VolGroup00/LogVol00 ro quiet selinux=0 to kernel /boot/vmlinuz  root=/dev/sda3 ro quiet selinux=0. I change the resource through panos profile but the result was unfortunately the same.

  • We finally realized what the problem was with lcfg-grub. SL5 uses a different partition for /boot while LCFG uses a directory /boot under root. That caused lcfg-grub to not boot, as it couldn't find anything under /boot. I changed then the entry from /boot/vmlinuz and /boot/initrd.img to /vmlinuz and /initrd and the system booted up normally. The root had to be set root=/dev/VolGroup00/LogVol00 as in SL5. Changing it to root=/dev/sda3 would cause kernel panic. I suppose that will be no problem when the system will be formatted by LCFG components using the actual LCFG schema for partitions.
  • fetlar also now boots OK with lcfg-grub enabled! It needed these resource settings:

!grub.initrd_defaultboot_disk1      mSET(/initrd-2.6.18-8.1.3.el5.img)
!grub.initrd_defaultboot_disk1single   mSET(/initrd-2.6.18-8.1.3.el5.img)
!grub.kernel_defaultboot_disk1      mSET(/vmlinuz-2.6.18-8.1.3.el5)
!grub.kernel_defaultboot_disk1single   mSET(/vmlinuz-2.6.18-8.1.3.el5)
!grub.kroot_defaultboot_disk1      mSET(/dev/VolGroup00/LogVol00)

Once lcfg-kernel is enabled on fetlar the first four lines won't be necessary. Once we're able to reinstall with our usual disk partition arrangements we won't need the fifth line either. It looks as if lcfg-grub is behaving itself!

23 August 2007

  • There is a new release of lcfg-network. The new RPM is copied to the RPM repository, lcfg/lcfg_sl5_lcfg.rpms is also updated and the new package in installed with updaterpms on both fetlar and panos. Stephen reports that it's also installed and working fine on funzie.
  • I tried installing lcfg-mailng on fetlar and sending mail. The mail arrived OK but looked a bit odd, like a forwarded message. The version of sendmail on SL5 looks identical to the one on FC6. I took a look at the sendmail template files and saw that the one for SL5 looked a bit different from the FC6 one. I tried running with a copy of the FC6 sendmail template and that seemed to work better - test mail arrived looking normal, and from a normal address (username@domain, rather than username@machine.domain). So the sendmail template for SL5 is now a copy of the FC6 one with the string "FC6" changed to "SL5" smile . I've made a new release and made, submitted and installed it on both fetlar and panos and mail sends properly now from both machines. I mailed our mail guru to confess about the new releases, let's hope he doesn't mind (I didn't touch any other files save "ChangeLog"). I submitted it to the autolcfg bucket rather than the lcfg one as it's not a Services Unit approved release yet.
  • I've been looking at lcfg-hardware. Not wanting to take a huge leap in the dark (yet), I looked through all the subroutines to figure out what they would do on fetlar. It all looks OK except for ConfigureDisks. I can see two possible problems:
  • hardware.disks inherits its default value from fstab.disks, and we're not yet using lcfg-fstab. However the inf/hw/ include files all seem to define hardware.disks so maybe that's not a problem.
  • As far as I can see, ConfigureDisks won't actually do anything if hardware.diskparams_ isn't defined, and none of our header files defines any hardware.diskparams resources at all. Maybe that's not a problem though...? It doesn't seem to delete anything (there's a rm -f $diskcfg at the end of the routine, but as far as I can see, $diskcfg is empty).
  • I've just realised that there's another lcfg-hardware difference with fetlar that I'd better check out. The redhat default /etc/modprobe.conf on fetlar has these snd options:

alias snd-card-0 snd-intel8x0
options snd-card-0 index=0
options snd-intel8x0 index=0
remove snd-intel8x0 { /usr/sbin/alsactl store 0 >/dev/null 2>&1 || : ; }; /sbin/modprobe -r --ignore-remove snd-intel8x0

However the LCFG-created one would have these:

alias snd-card-0 snd-intel8x0
options snd-card-0 index=0
install snd-intel8x0 /sbin/modprobe --ignore-install snd-intel8x0 && /usr/sbin/alsactl restore >/dev/null 2>&1 || :
remove snd-intel8x0 { /usr/sbin/alsactl store >/dev/null 2>&1 || : ; }; /sbin/modprobe -r --ignore-remove snd-intel8x0

Is this significant? Edit: Stephen reckons it's not, or if anything the LCFG one is better since the extra install makes sure that alsa is correctly configured at start time. After going through some man pages I was tentatively forming a similar opinion.
  • I've now gone through panos as well and I think that lcfg-hardware should work OK there too. Famous last words. Time to try it out...

22 August 2007

  • lcfg-rpmcache on panos is still trying to download a lot of i386 RPMs instead of x86_64 RPMs - 1307 of them to be precise. It has 2502 RPMs installed. Stephen reckons that this may be some sort of default behaviour in the Perl module which rpmcache is using. He's going to take a look at the module to see if he can find the problem there.
  • I see that qxpack is also getting it wrong on panos. It thinks that a package is i386:

-bash-3.1$ qxpack -v zisofs-tools
zisofs-tools:
 version=1.0.6-3.2.2
    arch=i386
  derive=/var/lcfg/conf/server/releases/develop/core/packages/lcfg/lcfg_sl5_64_base.rpms:3159

... but it's actually x86_64:

-bash-3.1$ rpm -q zisofs-tools
zisofs-tools-1.0.6-3.2.2.x86_64
-bash-3.1$ 

However if I try making a version of /usr/bin/qxpack which has "x86_64" in place of "i386", it still gets it wrong in exactly the same way - perhaps confirming Stephen's suspicion that the default in qxpack is being ignored in favour of a default elsewhere, e.g. in the Perl module.

-bash-3.1$ chrisqxpack -v zisofs-tools
zisofs-tools:
version=1.0.6-3.2.2
    arch=i386
  derive=/var/lcfg/conf/server/releases/develop/core/packages/lcfg/lcfg_sl5_64_base.rpms:3159
-bash-3.1$

  • lcfg-network have been tested and works fine. After setting the IP in _panos_profile !network.ipaddr_eth0    mSET(111.111.111.111) and restarting the component:

-bash-3.1$ om network restart
[OK] network: restart
-bash-3.1$ cat /etc/sysconfig/network-scripts/ifcfg-eth0 
DEVICE=eth0
ONBOOT=yes
IPADDR=111.111.111.111
NETMASK=255.255.255.0
NETWORK=111.111.111.0
BROADCAST=111.111.111.255

Testing also adding an entry to /etc/hosts as:

!network.extrahosts     mADD(panostest)
!network.hentry_panostest      mSET(129.215.46.109 panostest)

And then checking the file itself as also trying to ping the host panostest:

-bash-3.1$ cat /etc/hosts
# LCFG generated /etc/hosts
127.0.0.1 localhost localhost.inf.ed.ac.uk
129.215.46.109 panos.inf.ed.ac.uk panos
129.215.46.109 panostest
-bash-3.1$ ping -c 5 panostest
PING panostest (129.215.46.109) 56(84) bytes of data.
64 bytes from panos.inf.ed.ac.uk (129.215.46.109): icmp_seq=1 ttl=64 time=0.023 ms
64 bytes from panos.inf.ed.ac.uk (129.215.46.109): icmp_seq=2 ttl=64 time=0.016 ms
64 bytes from panos.inf.ed.ac.uk (129.215.46.109): icmp_seq=3 ttl=64 time=0.012 ms
64 bytes from panos.inf.ed.ac.uk (129.215.46.109): icmp_seq=4 ttl=64 time=0.012 ms
64 bytes from panos.inf.ed.ac.uk (129.215.46.109): icmp_seq=5 ttl=64 time=0.012 ms

--- panostest ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3999ms
rtt min/avg/max/mdev = 0.012/0.015/0.023/0.004 ms

  • WARNING: It seems that something have broken seriously after using the network component. At least the component itself works as I could edit /etc/hosts and /etc/sysconfig/network-scripts/ifcfg-eth0 After rebooting the machine wouldn't be able to communicate with the network. I checked the ... ifcfg-eth0 and there was no MAC address specified so I thought that maybe was the reason. I edited the file as it was normally (actually copied from fetlar and change values) and the rebooted. the network component replaced it again so I did it again and then I renamed /usr/lib/lcfg/components/network so it wouldn't start at boot. I did the same with pam as it wouldn't allow me to login as root. I logged finally. I checked network configuration and it looks fine, but still it doesn't communicate with outside world, not even informatics network.
  • SOLVED: I should have checked better, gateway was missing from /etc/sysconfig/network-scripts/ifcfg-eth0 and /etc/sysconfig/network. I had boot in rescue mode from the installation DVD and edited the files.

  • Roger Burroughes has written a detailed explanation of how the install process works. Basically the machine boots from a special image called the installroot. This uses the install component to partition the machine's disk, then install and configure the essential bits of the operating system and LCFG (collectively called the installbase). The machine then reboots into the installbase, which then installs all the rest of the RPMs which the machine needs. One more reboot and the machine is up and running normally, with all of the machine's LCFG components doing their normal configuration jobs.
  • Roger has also documented the buildinstallroot process which builds the installroot.
  • There's also a third page detailing other notes on the install process.
  • The top level page for all this is http://homepages.inf.ed.ac.uk/roger/LCFG/.
  • We have a customer! Stephen has installed an SL5 machine funzie (pronounced "finnie" and named after this place).

  • lcfg-network must have be broken because live/wire_k.h was missing from panos profile. After including it, the gateway was inserted in /etc/sysconfig/network and I have specified other crucial network information in the profile and rebooted. It worked fine. Then I commented the network information in the profile, rebooted and everything was fine again.

21 August 2007

  • Stephen has removed the lcfg-gdm default session from the schema file. This makes the SL5 gdm default session appear properly. This means that there's no more need to mSET the value of gdm.sessiondesktopdir in inf/options/environment.h so I've removed it. After the removal lcfg-gdm on fetlar still gives a proper choice of sessions.

  • lcfg-xinetd has been tested and works fine. New services can be created and the daemon is being controlled. I copied the Subversion xinetd entry from lcfg/options/subversion-server.h and added it in panos 's profile:

!xinetd.services                        mADD(svn)
!xinetd.enableservices                  mADD(svn)
!xinetd.attributes_svn                  mSET(flags socket_type protocol wait \
                                        instances server server_args user \
                                        log_on_failure)
!xinetd.value_svn_flags                 mSET(NOLIBWRAP)
!xinetd.value_svn_socket_type           mSET(stream)
!xinetd.value_svn_protocol              mSET(tcp)
!xinetd.value_svn_wait                  mSET(no)
!xinetd.value_svn_instances             mSET(1000)
!xinetd.value_svn_server                mSET(<%subversion.tool_svnserve%>)
!xinetd.value_svn_server_args           mSET(-i -r <%subversion.path_repository%>)
!xinetd.value_svn_user                  mSET(<%subversion.user%>)
!xinetd.assignop_svn_log_on_failure     mSET(+=)
!xinetd.value_svn_log_on_failure        mSET(HOST USERID)

That created a new service file /etc/xinetd.lcfg/svn:


-bash-3.1$ cat /etc/xinetd.lcfg/svn 
#######################################################################
#
# service specific xinetd configuration file, built from resources
#
# Toby Blake 
# Version 0.99.7 : 29/12/03 07:31
#
# ** Generated file : do not edit **
#
#######################################################################
 
service svn
{
    flags = NOLIBWRAP
    socket_type = stream
    protocol = tcp
    wait = no
    instances = 1000
    server = 
    server_args = -i -r 
    user = 
    log_on_failure += HOST USERID
}

And restarting xinetd we can see in the logfile

21/08/07 10:44:27: >> restart
21/08/07 10:44:27:    Enabling services: svn
21/08/07 10:44:27:    No file for svn found in /etc/xinetd.d
21/08/07 10:44:27:    Using LCFG resources for svn
Starting xinetd:                                           [  OK  ]
21/08/07 10:44:28:    Ran /etc/rc.d/init.d/xinetd start

  • New release have been made of lcfg-xfree, lcfg-gdm and lcfg-xinetd. New RPMs copied to the RPM repository, package/lcfg/lcfg_sl5_lcfg.rpms list is update and the new packages are installed using updaterpms on both fetlar and panos.
  • I've enabled lcfg-mailcap on fetlar and now the /etc/mailcap file is empty. I see the mailcap resources on dice machines are all set at the Ed level. I've been trying to include these resources in the inf level for SL5 machines.
  • I've changed inf/defaults.h so that for SL5 it includes ed/defaults.h instead of lcfg/defaults.h.
  • I've edited ed/defaults.h to add an SL 5 section, adding SL5 ed env package lists.
  • I've taken the ed env SL5 package list out of inf/options/packages.h where I'd previously put it.
  • I've created inf/options/mailcap.h - for SL5 only it includes ed/options/mailcap.h
  • I've included inf/options/mailcap.h in inf/defaults.h
  • With Iain's permission I've changed ed/options/mailcap.h to make it support SL5 and other Linuxes too instead of just FC5 and FC6.
  • Again with Iain's permission Stephen changed lcfg/defaults/mailcap.h in the same way.
  • Finally - we now have some useful mailcap resources, and /etc/mailcap has some useful-looking entries, so it's at least not much worse than the default redhat one; so we can declare lcfg-mailcap to be OK with a clear conscience.
  • I couldn't actually find a machine locally that was using lcfg-rpmcache, so I must have missed something there, since I know we have some RPM cache machines! But on all the RPM slaves I tried, qxprof rpmcache said that there was no such component. Odd. Well, anyway, I tried lcfg-rpmcache out by including lcfg/options/rpmcache in lcfg/fetlar. om rpmcache start succeeded. om rpmcache run started copying all the installed RPMs into a new directory called /disk/scratch/rpmcache. This looks pretty convincing behaviour for lcfg-rpmcache to me. I've therefore declared it to be runnable on SL5, said so in the component's source in the usual config.mk place, made a release, made and submitted new RPMs, and installed it on fetlar. It's not yet on panos as that currently has errors in its profile so it's not picking up the change in the package list yet.

  • A new release of lcfg-prelink have been made, copied to RPM repository, update lcfg_sl5_lcfg.rpms. The new rpm was successfully installed via updaterpms. The test I did was to compare the /etc/prelink.conf files of a normal FC6 and SL5 installation and then these files with the one generated by LCFG. All the files appear to be identical.

  • lcfg-rpmcache on panos has problems as it tries to download i386 rpm copies from the x86_64 repository.

  • I have started testing lcfg-grub. I didn't go any further as after installing the resources and starting the component, on the next reboot, the entries were not corresponding. At least we know what it can edit fine the /boot/grub/menulst and /boot/grub/grub.conf. The problems I recognized by a quick view were:

1) /boot/vmlinuz and /boot/initrd links don't exist
2) The current system is partitioned in Logical Volumes while grub is using normal partitions entries and lcfg-fstab doesn't know about Logical Volumes.

20 August 2007

  • I realised that lcfg-gdm wasn't really finished until we put a fix for the session type menu into a header file. I've chosen to put a temporary fix for SL5 in to inf/options/environment.h because the broken value for gdm.sessiondesktopdir appears to be set in the gdm-2.def file and I'm not going to touch that, as it belongs to the Infrastructure Unit. I've set the value to /usr/share/gdm/BuiltInSessions/:/usr/share/xsessions/:/usr/lib/lcfg/conf/gdm/sessions/. I've added /usr/lib/lcfg/conf/gdm/sessions/ to the path to get our "Custom" session back, and I've added it at the end so that our faulty "default" session doesn't override the working SL5 one.
  • While there I noticed that other platforms have defined gdm.graphtheme in inf/options/environment.h so I moved the SL5 definition there from lcfg/defaults/gdm.h where we had put it earlier.
  • I've mailed the Infrastructure Unit about the gdm.sessiondesktopdir problem.
  • I've pulled down copies of the source RPMs for all of the prerequisite RPMs we had to get from independent repositories (Fedora EPEL, AT, DAG) earlier in the project, and copied them to /pkgs/master/srpms/sl5/lcfg and /pkgs/master/srpms/sl5_64/lcfg.
  • Oops! lcfg-gdm actually belongs to the MPU. Stephen is going to have a look at the problem.
  • Some stage 10 advice from Stephen: One of the mail components usually needs a new blob of sendmail configuration for each new platform. We can try reusing an existing one but it hasn't worked in the past. Neil or another mail expert can help with this.

17 August 2007

  • todo.gif We went to the monthly LCFG Deployers' Meeting yesterday and told people how we thought the port was going (pretty well, mostly, we think) and what we thought of SL5 (pretty similar to Fedora, few difficulties adapting). They didn't disagree with our using binary RPMs from a few well respected external repositories, but they suggested that we pull down copies of the SRPMs for each of the RPMs we got from third party repositories - because we expect SL5 to keep going for several years, and in a period of several years it wouldn't be inconceivable for an independent software repository to shut down or disappear. We also agreed that it was likely that we'd be able to make some sort of initial version of SL5 LCFG available for people to try out around the time of the next meeting (20 September).
  • I just tried logging in to fetlar on the console and starting a Gnome session. This worked OK when I last tried it a day or two ago. Now however I just get a blank grey screen instead of a proper Gnome session, like a basic X session with no clients or window manager. When I kill the X server (Ctl-Alt-Backspace) I get the following screenful of text:
figuration path file /etc/gconf/2/path doesn't contain any databases or wasn't
found 2) somehow we mistakenly created two gconfd processes 3) your operating system is misconfigured so NFS file locking doesn't work in your home directory or 4) your NFS client machine crashed and didn't properly notify the server on reboot that file locks should be dropped. If you have two gconfd processes (or had
two at the time the second was launched), logging out, killing all copies of gconfd, and logging back in may help. If you have stale locks, remove ~/.gconf*/*lock. Perhaps the problem is that you attempted to use GConf from two machines at
once, and ORBit still has its default configuration that prevents
Aug 17 09:41:46 fetlar gconfd (cc-23511): Error setting value for `/desktop/gnome/accessibility/keyboard/slowkeys_delay': Unable to store a value at key '/desktop/gnome/accessibility/keyboard/slowkeys_delay', as the configuration server has no writable databases. There are some common causes of this problem: 1) your configuration path file /etc/gconf/2/path doesn't contain any databases or wasn't
found 2) somehow we mistakenly created two gconfd processes 3) your operating system is misconfigured so NFS file locking doesn't work in your home directory or 4) your NFS client machine crashed and didn't properly notify the server on reboot that file locks should be dropped. If you have two gconfd processes (or had
two at the time the second was launched), logging out, killing all copies of gconfd, and logging back in may help. If you have stale locks, remove ~/.gconf*/*lock. Perhaps the problem is that you attempted to use GConf from two machines at
once, and ORBit still has its default configuration that prevents
Aug 17 09:42:03 fetlar gdm[9260]: failsafe dialog failed (inhibitions: 0 0)
Aug 17 09:42:15 fetlar gdm[9260]: failsafe dialog failed (inhibitions: 0 1)
So presumably the switch yesterday from a local home directory to an AFS home directory has affected Gnome badly. I don't know, but I shouldn't imagine that NFS file locking is available for an AFS filesystem.
  • I've just tried it again several more times. It seems that Gnome does respond eventually. On the grey background it eventually manages to pop up two warning messages from Nautilus and gnome-panel - here's a screen shot:
Screenshot.png

  • Looking at ~/.nautilus, it proves to be a link to /platspec/DotFiles/cc/.nautilus/. I recall the platspec thing being a DICE-only bodge for smoothing the transition from FC3 to FC5? Maybe this platspec thing isn't needed any more? Don't know. Must check. I seem to have a number of these platspec things in my home directory. Neither /platspec or /amd/platspec (which on DICE it's a link to) exists on fetlar.

bash-3.1$ ls -la |grep platspec
lrwxr-xr-x   1 cc   people        27 Aug 22  2006 .cpan -> /platspec/DotFiles/cc/.cpan
lrwxr-xr-x   1 cc   people        30 Aug 22  2006 .esdauth -> /platspec/DotFiles/cc/.esdauth
lrwxr-xr-x   1 cc   people        28 Aug 22  2006 .gconf -> /platspec/DotFiles/cc/.gconf
lrwxr-xr-x   1 cc   people        29 Aug 22  2006 .gconfd -> /platspec/DotFiles/cc/.gconfd
lrwxr-xr-x   1 cc   people        36 Aug 22  2006 .gnome-desktop -> /platspec/DotFiles/cc/.gnome-desktop
lrwxr-xr-x   1 cc   people        39 Aug 22  2006 .gtkrc-1.2-gnomec -> /platspec/DotFiles/cc/.gtkrc-1.2-gnomec
lrwxr-xr-x   1 cc   people        33 Aug 22  2006 .gtkrc.mime -> /platspec/DotFiles/cc/.gtkrc.mime
lrwxr-xr-x   1 cc   people        30 Aug 22  2006 .mailcap -> /platspec/DotFiles/cc/.mailcap
lrwxr-xr-x   1 cc   people        27 Aug 22  2006 .mcop -> /platspec/DotFiles/cc/.mcop
lrwxr-xr-x   1 cc   people        29 Aug 22  2006 .mcoprc -> /platspec/DotFiles/cc/.mcoprc
lrwxr-xr-x   1 cc   people        31 Aug 22  2006 .metacity -> /platspec/DotFiles/cc/.metacity
lrwxr-xr-x   1 cc   people        33 Aug 22  2006 .mime.types -> /platspec/DotFiles/cc/.mime.types
lrwxr-xr-x   1 cc   people        31 Aug 22  2006 .nautilus -> /platspec/DotFiles/cc/.nautilus
-rw-r--r--   1 cc   people         0 Jul 13  2006 .noplatspecdots
lrwxr-xr-x   1 cc   people        33 Aug 22  2006 .openoffice -> /platspec/DotFiles/cc/.openoffice
lrwxr-xr-x   1 cc   people        25 Aug 22  2006 .qt -> /platspec/DotFiles/cc/.qt

I'll see if it's safe to get rid of these things, and if it is, hopefully that'll help make Gnome work again. (Later.) Stephen reckons it's not only safe but highly desirable to get rid of the platspec stuff. As long as you rescue the contents of whatever it's pointing to of course and use that to replace the platspec links. He also says it was cooked up for the RH9 to FC3 upgrade, so it's even older than I thought! All this ancient rubbish littering my homedir. I'll clean it up then try Gnome on fetlar again. (Later still.) All the platspec files are now gone, their contents saved. The Gnome sessions on both my normal DICE machine and on my SL5 machine both work fine now. I won't trouble you with a picture of my normal gnome session; suffice it to say that I've attempted to de-gnomify it as much as possible. I miss twm. Anyway; eliminating platspec hasn't fixed KDE.
  • I have at least found where the gdm sessions are defined: the resource gdm.sessiondesktopdir has the value /usr/lib/lcfg/conf/gdm/sessions/:/etc/X11/sessions/:/etc/X11/dm/Sessions/:/usr/share/gdm/BuiltInSessions/:/usr/share/xsessions/ and at least one of these directories doesn't exist. It probably needs tweaking.
  • KDE is now fixed. IW [gdmstuff] Row 1 Col 1 3:06 Ctrl-X H for help
  • /usr/lib/lcfg/conf/gdm/sessions has two files in it, custom.desktop and default.desktop.
  • /etc/X11/sessions doesn't exist.
  • /etc/X11/dm/Sessions doesn't exist.
  • /usr/share/gdm/BuiltInSessions does exist and contains default.desktop.
  • /usr/share/xsessions exists and contains gnome.desktop, icewm.desktop and kde.desktop.
  • I've spotted one reason why KDE is not starting properly. The kde.desktop file says to start KDE with the command startkde. Look at it:

-bash-3.1$ ls -la /usr/bin/startkde
-rwxr-xr-x 1 root lcfg 0 Aug  3 18:21 /usr/bin/startkde
-bash-3.1$

Zero bytes?! No wonder it's not working. One quick rpm -i --force /pkgs/master/rpms/sl5/distro/kdebase-3.5.4-13.5.el5.i386.rpm later, and:

[root@fetlar cc]# ls -l /usr/bin/startkde
-rwxr-xr-x 1 root root 13277 Aug 17 14:56 /usr/bin/startkde
[root@fetlar cc]#

Logging in with a session type of KDE now starts a convincing-looking KDE session. I wonder why startkde was like that though?
  • Shortening gdm.sessiondesktopdir to /usr/share/gdm/BuiltInSessions/:/usr/share/xsessions/ then restarting gdm seems to be a good move. The session type menu then becomes:

Last session
1. GNOME
2. Default System Session
3. Ice
4. KDE

Each option works properly. Default System Session starts a GNOME session. I think we can say that lcfg-gdm works for us, but we'll pass on to the Infrastructure Unit (which owns lcfg-gdm) our experience with gdm.sessiondesktopdir, and ask for their advice on whether or where to make the change for SL5.
  • lcfg-alias have been tested. I tried to add myself as an admin: In panos profile:

!alias.aliases          mADD(pkritika)
alias.alias_pkritika    _ADMIN

and then:

-bash-3.1$ cat /etc/aliases.lcfg | grep pkritika
pkritika:       root@inf.ed.ac.uk
-bash-3.1$ qxprof alias | grep pkritika
alias_pkritika=root@inf.ed.ac.uk
aliases=root pkritika

  • lcfg-ntp seems now to work fine on fetlar as well:

-bash-3.1$ /usr/sbin/ntpq -c peers
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
-solti.inf.ed.ac 193.62.22.98     2 u   60  128  377    0.193    1.143   0.723
*jochum.inf.ed.a 193.62.22.90     2 u   52  128  377    0.323    1.637   0.384
+goodall.inf.ed. 193.62.22.74     2 u   56  128  377    0.258    2.029   0.299
+haitink.inf.ed. 193.62.22.98     2 u   63  128  377    0.288    1.788   0.339 
-bash-3.1$ cat /var/lcfg/conf/ntp/ntp.drift
50.916 
-bash-3.1$ ntpstat 
synchronised to NTP server (129.215.144.15) at stratum 3 
   time correct to within 28 ms
   polling server every 128 s

The tests were actually based on George's Ross guide lines:

Any ideas on how to test lcfg-ntp (in the SL5 LCFG port project)?


Use "ntpq -c peers" to see whether it's synchronised or not.  If it stays
syncronised it's probably working.  If it doesn't, it isn't, and will
probably also log things to syslog.  Unless your machine is seriously
broken it'll either work and stay working, or not and stay broken.  Offset
and delay are in ms, and jitter is something like RMS ms; poll is in
seconds, is the time between polls, and will probably ramp up to 512 or
1024 if things are OK; reach is an octal bitmap of the last 8 polls, and
should be 377 if things are OK.

Also have a look at the driftfile, which one of the ntp resources will point
you at.  If it has a number in it within the range +-150 then you're
probably OK, and if it doesn't then you're probably not.

As far as I can remember from years ago, once it's going ntp gives up if your machine goes much more than a second out from the timeservers, and as far as I can see it keeps no logs anyway, so I'm not sure how to go about demonstrating that lcfg-ntp is doing its job properly...?


It gives up if the machine's clock's drift is too far out to correct.  The
component will do an ntpdate (or equivalent) first to get things reasonably
close before firing up ntpd.

There should be stuff getting logged in /var/lcfg/log/ntp.stats/, and more
can be configured if required. not. 

  • lcfg-mailng needs to have template file for sendmail.mc. This must be in lcfg-mailng/mc/ on the repository fetched folder. The template is a copy of /etc/mail/sendmail.mc. After creating that, I submitted to the repository and made a new release including the template for SL5. I made then a new rpm and I was able to start and stop sendmail daemon:

 -bash-3.1$ om mailng start
[OK] mailng: start
-bash-3.1$ ps aux | grep sendmail
root     28991  0.0  0.0  80948  2264 ?        Ss   14:33   0:00 sendmail: accepting connections
smmsp    28999  0.0  0.0  50980  1728 ?        Ss   14:33   0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
pkritika 29013  0.0  0.0  60236   696 pts/0    S+   14:33   0:00 grep sendmail
-bash-3.1$ om mailng stop
[OK] mailng: stop
-bash-3.1$ ps aux | grep sendmail
pkritika 29064  0.0  0.0  60236   696 pts/0    S+   14:33   0:00 grep sendmail

Without the template for SL5, trying to start the component would generate errors:

-bash-3.1$ tail -f /var/lcfg/log/mailng
17/08/07 12:50:51: >> configure
sxprof: can't open template file: /usr/lib/lcfg/conf/mailng/sendmail.tmpl.m4
sxprof: No such file or directory
17/08/07 12:50:51: ** failed to configure /var/lcfg/conf/mailng.mc (see logfile)

16 August 2007

  • lcfg-nsswitch have been tested on fetlar as well. It is working fine as it does on panos. New release made, copied to RPM repository and lcfg_sl5_lcfg.rpms updated. The version is also installed through updaterpms on both fetlar and panos.

  • lcfg-dns tested on fetlar. It works fine. Specifying DNS server in _fetlar_'s profile !dns.servers          mSET(129.215.216.248). Before and after updating the DNS server:

 -bash-3.1$ om dns start
[WARNING] dns: Null , using <129.215.32.241 129.215.202.253>
[OK] dns: start
-bash-3.1$ cat /etc/resolv.conf 
nameserver 129.215.32.241
nameserver 129.215.202.253
search inf.ed.ac.uk
sortlist  129.215.46.0/255.255.255.0 129.215.144.0/255.255.255.0 129.215.41.0/255.255.255.0 129.215.32.0/255.255.255.0
-bash-3.1$ om dns restart
[OK] dns: restart
-bash-3.1$ cat /etc/resolv.conf
nameserver 129.215.216.248
search inf.ed.ac.uk
sortlist  129.215.46.0/255.255.255.0 129.215.144.0/255.255.255.0 129.215.41.0/255.255.255.0 129.215.32.0/255.255.255.0

lcfg-dns is not MPU component so a new release will not be made.

  • lcfg-openssh have been tested on fetlar. It works fine as it does on panos. Adding and changing resources:

!openssh.sshdopts        mPREPEND(Port)
!openssh.sshdopt_Port    mSET(1400)
!openssh.sshdopt_X11Forwarding  mSET(no)

And then trying to connect to fetlar:


[rydell]pkritika: ssh fetlar
ssh: connect to host fetlar port 22: Connection refused
[rydell]pkritika: ssh fetlar -p 1400
The authenticity of host 'fetlar (129.215.46.133)' can't be established.
RSA key fingerprint is d7:d1:bd:8a:61:56:7f:aa:5c:e7:e7:2e:c8:97:07:f2.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'fetlar,129.215.46.133' (RSA) to the list of known hosts.
Last login: Thu Aug 16 10:03:23 2007 from rydell.inf.ed.ac.uk
test lcfg-file on sl5 32
-bash-3.1$

lcfg-openssh is not MPU component so a new release will not be made.

  • A new lcfg-pam release has been made to mark SL5 support, and new packages have been built, submitted and installed on fetlar and panos, and the os headers and lcfg_sl5_lcfg.rpms package list have been adjusted to match. This marks the end of stage 7, so I've marked that in the project doc.
  • pam is not appearing in profile.components on fetlar but it is on panos. So far I can't see why. In the course of investigating this I've taken the temporary SL5 profile.packages settings out of inf/defaults.h because they're now set in inf/options/packages.h. I noticed that ed_sl5_env.rpms was missing so I added that.
  • Panos has solved the fetlar pam mystery. I had made the necessary changes to inf/os/sl5.h but I had forgotten to commit the changed file to subversion. Good grief.
  • The problems with inf/options/openldap-client.h and inf/options/kerberos-client.h have been solved. The problem was the same as with inf/options/afs-client.h - multiple inclusion. Both were already included so further inclusions had no effect. Removing the mREMOVE macros for these components from the inf/os headers made things work OK.
  • I've been trying to find out what's wrong with gdm on fetlar, starting with the odd-looking choice in the session type menu (e.g. "foo"). I noticed an error in each of the /var/log/gdm/ log files: (EE) AIGLX: DRI module not loaded. I wondered if some glx thing was missing from SL5, but rpm -qa|grep -i glx returns the same results on SL5 as on FC5 (glx-utils). Comparing the results of rpm -qa | grep -i xorg (and ignoring version numbers) shows a bit more variation: fc5 is the only one with xorg-x11-xdm and xorg-x11-xkbdata RPMs and SL5 is the only one with xorg-x11-drv-ast, xorg-x11-drv-vmmouse and xorg-x11-server-Xephyr RPMs. The other 104 xorg=x11 RPMs are on both. So, apparently we're no further forward with that yet.

  • lcfg-ntp have been tested and seems to work fine. Following what George Ross suggested to Chris on how to test the ntp component, we actually check the following values:


-bash-3.1$ /usr/sbin/ntpq -c peers
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+solti.inf.ed.ac 193.62.22.98     2 u  734 1024  377    0.200   -0.698   0.785
-jochum.inf.ed.a 193.62.22.90     2 u  767 1024  377    4.931   -1.538   0.157
*goodall.inf.ed. 193.62.22.90     2 u  830 1024  377    1.913   -0.115   0.765
+haitink.inf.ed. 193.62.22.98     2 u  484 1024  377    0.264   -0.456   0.133
-bash-3.1$ cat /var/lcfg/conf/ntp/ntp.drift 
-32.156
-bash-3.1$ ntpstat
synchronised to NTP server (129.215.41.251) at stratum 3 
   time correct to within 45 ms
   polling server every 1024 s

15 August 2007

  • I've checked all the lcfg-pam things I can think of on fetlar and all seem to work properly.
  • I've logged in on the console
  • I've logged in remotely with ssh (after starting sshd - we don't yet have lcfg-openssh doing it for us automatically)
  • I've rebooted the machine
  • I've started X
  • I've locked the screen and tried to unlock it with an incorrect password and had it recognised as incorrect
  • I've locked the screen and tried to unlock it with a correct password and had it recognised as correct, and had my Kerberos and AFS tickets renewed
  • I've tried su and had it working correctly
  • I've tried sudo and had it say cc is not in the sudoers file.  This incident will be reported. which is what happens when you try it on another LCFG-managed machine too.
  • We now need to try it all on panos too. Once that's done, we can declare lcfg-pam ported to SL5 and make new releases to celebrate; and that will be stage 7 completed ahead of schedule!

  • The same tests were tried on panos as well and the results are the same.

* lcfg-xfree and lcfg-gdm have been tested. lcfg-xfree works fine without any problems. I changed the layout of the keyboard, from uk to us, and the changes were applied successfully and the keys are now acting as being a us keyboard.

In the profile:


!xfree.inputopt_kbd_xkblayout   mSET(Option "XkbLayout" "us")

and then:

-bash-3.1$ grep XkbLayout /etc/X11/xorg.conf
   Option "XkbLayout" "us"

lcfg-gdm had some problems. It was generating errors while loading GDM as it was trying to load happygnome-list as a theme for the login screen, but SL5 has only theme named EaseOfBlue. I edited lcfg/defaults/gdm.h in order to add an entry for SL5 to use EaseOfBlue be default.


#ifdef LINUX_SL5
!gdm.graphtheme                 mSET(EaseOfBlue)
#endif

After fetching the new profile and restarting gdm, there were no errors generated. The theme(s) is located in /usr/share/gdm/themes.

The first times I tried to login, I was getting the error No Exec line in the session file: default.  Running the GNOME failsafe session instead. After rebooting the machine I was able to login without any errors or warnings. (I suppose this was due to conflict with the home directory mounted by afs)

  • I've tried the same lcfg-xfree and lcfg-gdm tests on fetlar too with the same results (minus the errors, because Panos has already fixed the headers).
  • This is the list of gdm sessions we're presented with on login:

Last session
1. GNOME
2. foo
3. Custom
4. Ice
5. KDE

These all work on panos (except foo) but on fetlar KDE doesn't work. This is what happens when I try each session: Trying "GNOME" session I get a normal GNOME login.

Trying "foo" session I get:


No Exec line in the session file: default.  Running
the GNOME failsafe session instead.

Then if I click OK I get this message:

This is the Failsafe GNOME session. You will be logged into the
'Default' session of GNOME without startup scripts being run.
This should only be used to fix problems in your installation.

If I click OK again I get into GNOME.

Trying the "Custom" session I get GNOME.

Trying the "Ice" session I get Icewm.

Trying the "KDE" session I get the "Your session only lasted less than 10 seconds" message. At this point it tells me that the ~/.xsession-errors file contains this:


localuser: cc being added to access control list
No profile for user 'cc' found
Launching a SCIM daemon with Socket FrontEnd...
Loading simple Config module ...
Creating backend ...

What's this foo session anyway? That can't be right, can it? And is there meant to be a failsafe session visible in the menu? Because I can't see it.

  • I'm getting confused about what's been done and what's still to be done, so here's a list of things we need to do:
  • release, build, submit and install *lcfg-pam*

  • then mark stage 7 as finished.

  • test lcfg-ntp (how? I've asked MP and Infrastructure units for ideas)
  • sort out the problem with inf/options/kerberos-client.h (however lcfg-kerberos is otherwise working fine in client mode)

  • sort out the problem with inf/options/openldap-client.h (however lcfg-openldap is otherwise working fine in client mode)

  • test lcfg-dns, lcfg-nsswitch, lcfg-openssh on fetlar using the tests already done on panos.

  • When tested, release, build, submit, install lcfg-nsswitch, and enable in headers. This is the only MPU-owned component in stage 8.

  • When tested, mark lcfg-dns and lcfg-openssh as done, and enable in headers.

  • When header problems sorted, mark lcfg-kerberos and lcfg-openldap as done, and enable them in headers.

  • When that lot's all done, that'll be stage 8 finished.
  • Sort out the KDE problem on fetlar.
  • Sort out the gdm "foo" problem.
  • Sort out the gdm "failsafe" menu problem (if it is a problem).

14 August 2007

  • I've made progress with lcfg-pam.
  • I edited lcfg/defaults/pam.h to make the FC6 sections apply to SL5 too.
  • /lib/security/pam_afs2.so was needed by the pam configuration. I've built it from our source RPM. It needs gafstoken so I've built that too, also from our source RPM. I submitted both to the ed bucket on fetlar.
  • I've added an SL5 section to inf/options/packages.h so that the ed and dice repositories for sl5 and sl5_64 are added to the list of locations where updaterpms should look for RPMs (the updaterpms.rpmpath resource). I also created a package file packages/ed/sl5_env.rpms. It's shared between sl5 and sl5_64.
  • updaterpms is now happy with this.
  • A warning to anyone else: heed the wise words of those who have gone before you! I forgot Stephen's sage pam advice until it was too late ("Make a backup copy of /etc/pam.d and leave yourself logged in as root before you dare to change anything to do with pam"). I enabled lcfg-pam and started it and instantly screwed up everything. Thankfully I had some working shells and was able to copy the contents of /etc/pam.d from panos to recover.
  • As far as I've been able to tell so far the pam configuration created by lcfg-pam now works. At least, it lets me reboot and login and ssh to fetlar. This is perhaps not a complete test of pam security smile but it's a definite sign of progress.
  • I'll leave lcfg-pam disabled in inf/defaults.h until Panos gets a chance to check things. For now I've added it to profile.components in lcfg/fetlar only.
  • I've built and submitted pam_afs2 and gafstoken for sl5_64 too, and installed them on panos. When pam is working, these have the effect of automatically getting your AFS token at login time, so you don't have to bother with kinit or aklog to get at your AFS home directory - very handy. This works on fetlar, I've tried it.

13 August 2007

  • Started testing from the morning lcfg-kerberos, lcfg-nsswitch, lcfg-openldap. The tests made including the header from lcfg/defaults/ and everything worked fine. For instance, changing values to Kerberos settings:

In panos profile:


!kerberos.admin         mSET(test.test)
!kerberos.domain        mSET(test2)
!kerberos.kdcrealm      mSET(kdctest)

and then:

-bash-3.1$ qxprof kerberos | grep test
admin=test.test
domain=test2
kdcrealm=kdctest
-bash-3.1$ cat /etc/krb5.conf | grep test
  admin_server = test.test
  default_domain = test2

Also, testing a new entry to /etc/nsswitch.conf

In panos profile:


!nsswitch.maps          mADD(test)
nsswitch.mods_test      files
!nsswitch.mods_shadow   mSET(ldap)

and then:

-bash-3.1$ qxprof nsswitch | grep -e test -e shadow
maps=passwd group hosts services networks protocols rpc ethers netmasks aliases automount shadow netgroup test
mods_shadow=ldap
mods_test=files
-bash-3.1$ grep -e test -e shadow /etc/nsswitch.conf 
shadow: ldap
test:   files

For lcfg-openldap I remove the attribute what specifies to specified to mount the afs home directory. It worked, after fetching the new profile the entry was removed from /etc/ldap.conf.

However, include the headers from lcfg/defaults/ is not the right way. We must include those from inf/options but then we have the same problem as we did with afs.

  • lcfg-kerberos needs kdcregister to create host entry in /etc/krb5.keytab but in order to build the rpm for kdcregister you need kadm-devel. I got kadm-devel from /pkgs/master/srpms/rc6_64/ and build the rpm. After installing that one and kdcregister, as also lcfg-kerberos and lcfg-openldap I had to add the new packages to !profile.packages so updaterpms will not remove them. So, this have been done for both 32 & 64 versions. I had to add new entries in the headers. For instance, in inf/options/kerberos-client.h I had to add:


#ifdef LINUX_SL5
!kerberos.hostkeyless           mSET(false)
!profile.packages               mADD(kadm-devel-1.5-1.inf.1)
!profile.packages               mADD(kdcregister-1.11-1)
#endif

and in lcfg/options/kerberos.h:


#ifdef LINUX_SL5
!profile.packages               mADD(lcfg-kerberos-2.1.24-1/noarch)
#endif

For lcfg-openldap I had to edit lcfg/options/openldap.h and add:


#ifdef LINUX_SL5
!profile.packages           mADD(lcfg-openldap-3.1.37-1/noarch)
#endif

After that, nothing will be removed from panos:


-bash-3.1$ om updaterpms run -- -t
[OK] updaterpms: run

  • thanks to Alastair for solving two perplexing mysteries!
  • inf/options/afs-client.h was being included but the mutation to profile.components wasn't happening. This was happening because the file was included twice - once in the machine's lcfg file, and once from inf/options/filesystem.h which was being pulled in by inf/defaults.h. So by the time we included it, it had already been included, the mutation had already happened, and because the file contents are protected against multiple inclusion, the second inclusion was having no effect. And it was the first inclusion, not the second one, that was showing up in qxprof -v profile.components, not ours.
  • The lack of messages on the screen at boot time. Editing the grub configuration /etc/grub.conf solved this. Change this line

kernel /vmlinuz-2.6.18-8.1.3.el5 ro root=/dev/VolGroup00/LogVol00 rhgb quiet  console=tty0 console=ttyS0,9600

to this:

kernel /vmlinuz-2.6.18-8.1.3.el5 ro root=/dev/VolGroup00/LogVol00

and the messages appear.

10 August 2007

I just tried running om updaterpms run -- -t on fetlar:


-bash-3.1$ om updaterpms run -- -t
[INFO] updaterpms: 0 installs, 2 removals
[INFO] updaterpms: Flagging lcfg-afs-0.99.10-1/noarch for deletion
[INFO] updaterpms: Flagging lcfg-amd-0.100.17-1/noarch for deletion
[WAIT] updaterpms: Checking dependencies
[OK] updaterpms: run

Oops! It shouldn't be trying to delete either of those. For lcfg-amd I've now added this to lcfg/options/amd.h:

#ifdef LINUX_SL5
!profile.packages       mADD(lcfg-amd-0.100.17-1/noarch)
#endif /* LINUX_SL5 */

and for lcfg-afs I've added this to inf/options/afs-client.h:

#ifdef LINUX_SL5
/* On SL5 we use the standard AFS packages so we just need the afs component. */
!profile.packages      mADD(+lcfg-afs-0.99.10-1/noarch)
#endif /* LINUX_SL5 */

All is now well on fetlar:

-bash-3.1$ om updaterpms run -- -t
[OK] updaterpms: run

And also on panos after a quick build and submit of lcfg-afs for sl5_64.

  • lcfg-init (and lcfg-lcfginit as well) was re-installed on panos and included in its profile. The machine got the new LCFG managed /etc/inittab file and can boot now. I first tried commenting this line from inittab as it caused problem on the first tries:


lo:2345:wait:/usr/lib/lcfg/waitonboot

But then, I tried also with including it among the other and it worked without any problem. The only issue that is raised is that the splash screen is gone and there are not any messages or any information given on the boot process stage. The machine goes directly to login mode. Also, X will not start directly, you have to do startx after logging in from console. I suppose that's becase we don't make use of the LCFG X related components yet.

  • I have tried to edit one entry of the LCFG managed inittab file to check if lcfg-init works fine and it does. On panos profile i added !init.entry_lo mSET(lo:2345:wait:/usr/lib/lcfg/waitonboot_test) and after the machine fetched the new profile and reconfiguring the init component, /etc/inittab was editted at that specific line.


bash-3.1$ cat /etc/inittab | grep waitonboot
lo:2345:wait:/usr/lib/lcfg/waitonboot_test

  • lcfg-syslog seems to work straight out of the box. It's installed on fetlar. I added it to profile.components then did om syslog start and it started logging to /var/lcfg/log/syslog. Logging in and sending a syslog message with logger produced more syslog messages.

10/08/07 11:36:02: >> start
10/08/07 11:36:02:    configuration changed
Aug 10 11:36:02 fetlar syslogd 1.4.1: restart.
10/08/07 11:36:02:    logrotate changed
Aug 10 11:36:04 fetlar syslogd 1.4.1: restart.klogd: Already running.
Aug 10 11:39:13 fetlar sshd[3408]: Authorized to cc, krb5 principal cc@INF.ED.AC.UK (krb5_kuserok)
Aug 10 11:39:13 fetlar sshd[3408]: Accepted gssapi-with-mic for cc from 129.215.46.78 port 50438 ssh2
Aug 10 11:39:13 fetlar sshd[3408]: pam_unix(sshd:session): session opened for user cc by (uid=0)
Aug 10 11:39:32 fetlar cc: hello, testing, does this make it to the syslog file?
Aug 10 11:39:42 fetlar sshd[3408]: pam_unix(sshd:session): session closed for user cc

I've noted sl5 support for lcfg-syslog, made a new release, and made and submitted RPMs on fetlar and panos. I'm adding it to lcfg_sl5_lcfg.rpms now too. Also added syslog support in inf/os/sl5.h. Panos has added it to inf/os/sl5_64.h.

  • I have tested lcfg-boot by adding auditd for SL5 to lcfg/defaults/boot.h resources. The boot component successfuly read that resource and after rebooting panos, the script /etc/auditd started successfully.

  • New releases made for lcfg-boot, lcfg-init and lcfg-lcfginit. They are also copied to the RPM repositories for both sl5 and sl5_64. lcfg_sl5_lcfg.rpms has also been updated. After running updaterpms the packages were upgraded to their new versions.

  • I thought about testing lcfg-pam. I hadn't edit anything in pam.h, however it seems that needs lots of editing. At first place seemed to work. For example, I wasn't allowed to run su - as it was returning incorrect password without typing anything. Then I wanted to check if I'll get the same message when trying to open a gui configuration tool. I couldn't log in locally nor ssh. Fortunately, the ssh session I already had wasn't killed so I removed pam from sl5_64.h and I restored the old /etc/pam.d/ files which I had save in mh home directory.

  • Using the boot component to start up services we don't have sshd running at boot time, we have to start it manually. I tried lcfg-openssh and works fine. Rebooting the machine after declaring it in sl5_64.h and fetching the new profile made sshd to start. I also tried to edit /etc/ssh/sshd_conf and that works fine as well. I had set a new port number in panos profile:


!openssh.sshdopts        mPREPEND(Port)
!openssh.sshdopt_Port    mSET(1400)

Then checking if the new resources are applied:


[root@panos pkritika]# qxprof openssh | grep 1400
sshdopt_Port=1400
[root@panos pkritika]# cat /etc/ssh/sshd_config | grep 1400
Port 1400

After restarting the component I tried to login:


[rydell]pkritika: ssh panos
ssh: connect to host panos port 22: Connection refused
[rydell]pkritika: ssh panos -p 1400
Last login: Fri Aug 10 15:26:20 2007 from rydell.inf.ed.ac.uk
test lcfg-file on sl5 64
-bash-3.1$ 

  • lcfg-dns on client mode have been tested and works. I have include in _panos_profile two entries:


!dns.search             mADD(epcc.ed.ac.uk)
!dns.servers            mADD(129.215.56.230)

And after fetching the new profile:


-bash-3.1$ cat /etc/resolv.conf 
nameserver 129.215.56.230
search inf.ed.ac.uk epcc.ed.ac.uk
sortlist  129.215.46.0/255.255.255.0 129.215.144.0/255.255.255.0 129.215.41.0/255.255.255.0 129.215.32.0/255.255.255.0

The sort list is added to /etc/resolv.conf after using the component.

  • (Chris) I've spent all afternoon editing inf/options/afs-client.h trying to fix a weird problem - so far without success. The file contains !profile.components mADD(afs) but according to qxprof profile.components the profile.components resource does not have "afs" added to its value. However the output of qxprof -v profile.components lists the line number of the line in inf/options/afs-client.h that does the profile.components mutation. The file is included. The client component does run. The line still has no effect even if it's moved to the start of the file, to the end of the file, to the #ifdef LINUX_SL5 section of the file, to other bits of the file. It still has no effect if ten identical mutations all appear in the file. The same mutation when cut and pasted into lcfg/fetlar works perfectly. todo.gif

  • lcfg-init and lcfg-lcfginit and lcfg-boot are all working on fetlar as well as on panos. However although components and rc scripts are started at boot time, the boot messages are not displayed on screen. todo.gif

09 August 2007

  • I have started from yesterday writing down the differences between the rc scripts in a FC6 machine and a LCFG built FC6 machine. In what runlevel each script is running, in what order and how are these changed when controlled by LCFG and the similarities and differences with a normal SL5 machine. An SL5 machine seems to make use of all the components that are controlled by default by LCFG and the runlevels and order appear to be the same between FC6 and SL5. All the figures are in that file: http://www2.epcc.ed.ac.uk/~pkritika/sl5/rc_scripts.ods I have also identified which scripts are used by FC6 but not LCFG FC6 and the other way round, as also extra scripts in SL5 and scripts that have been taken out in SL5. Based on these we must create the boot order for SL5 which, in my opinion, will not actually differ from FC6 which means that we may not need to edit anything else in boot.h.

  • Stephen advised us in today's morning meeting to start using the opposite logic to of what we've used concerning how we add new components to the profiles. What we did since now was to have removed all the components and then adding one by time. What Stephen advised us to do was to remove from the profile.components all those that we haven't yet tested and using. That solved and the problem we had with the boot.h and the rc_* scripts resources where we had to re-specify them again in the machines profile. For instance, the sl5_64.h header now has entries like those bellow for components that are used:


!profile.components     mSET(client)
!profile.components     mADD(inv)
!profile.components     mADD(file)
...

and like these for components that are not used:

!boot.services  mREMOVE(lcfg_alias)
!boot.services  mREMOVE(lcfg_dns)
!boot.services  mREMOVE(lcfg_fstab)
...

  • lcfg-boot seems to work a bit better. I did test the configure method with different runlevels in the way of:

om boot configure 5

The component seems to correspond correct as it stops or starts services according to the runlevel specified.

  • I've put in deadline dates for the project on the Informatics project pages (https://devproj.inf.ed.ac.uk/project/show/74). Dates are based on us having taken roughly three times as long as the "expert" time estimates for each stage so far. We're slightly ahead of schedule.
  • I've got lcfg-afs working (hooray!). I realised that it should be set up by including inf/options/afs.h so I put the SL5 changes in there. The SL5 changes consist of firstly a comment pointing out that we're using totally standard packages for openafs on SL5 so need no profile.packages changes whatsoever; and secondly these changes to the locations of the openafs rc scripts:

#ifdef LINUX_SL5
!afs.rcscript           mSET(/etc/init.d/afs)
!afs.serverrcscript     mSET(/etc/init.d/afs-server)
#endif /* LINUX_SL5 */

  • I also realised that we should have set up lcfg-amd similarly (using inf/options/amd.h) so I removed amd settings we'd put in elsewhere and included that file in individual profiles.

08 August 2007

  • Chris figured out yesterday that the boot component would not take the resources of the RH scripts defined in the boot.h header. After defining them in the machine's profile, the boot component would accept them:

For example, re-defning rc_cpuspeed in panos' profile


!boot.services  mADD(rc_cpuspeed)

and:


-bash-3.1$ qxprof boot | grep rc_cpuspeed
levels_rc_cpuspeed=2 3 4 5
reboot_rc_cpuspeed=restart
services=lcfg_client lcfg_file lcfg_logserver lcfg_authorize lcfg_updaterpms lcfg_amd lcfg_auth lcfg_cron lcfg_tcpwrappers rc_cpuspeed rc_network rc_irqbalance rc_mdmonitor rc_smartd rc_acpid rc_gpm rc_xfs rc_atd rc_messagebus rc_haldaemon rc_killall rc_halt rc_single rc_reboot rc_readahead_early rc_readahead_later
start_rc_cpuspeed=06 restart
stop_rc_cpuspeed=99 restart
user_rc_cpuspeed=root

What needs to be done is to check the boot component files and see what can be wrong for SL5 not accepting the defined resources in boot.h

  • I have started testing lcfg-tcpwrappers as it is much simpler than lcfg-boot and we can have tested and released one more component. It works fine. I have redefined the hosts that are allow to ssh into panos:


!tcpwrappers.allow_sshd         mSET(sshd : fetlar.inf.ed.ac.uk)

And the actual test after /etc/hosts.allow have been updated:


[rydell]pkritika: ssh panos
ssh_exchange_identification: Connection closed by remote host
[rydell]pkritika: ssh fetlar
Last login: Wed Aug  8 10:45:30 2007 from rydell.inf.ed.ac.uk
test lcfg-file on sl5 32
Could not chdir to home directory /home/pkritika: No such file or directory
-bash-3.1$ ssh panos
Could not create directory '/home/pkritika/.ssh'.
Last login: Wed Aug  8 10:50:00 2007 from fetlar.inf.ed.ac.uk
test lcfg-file on sl5 64
-bash-3.1$ 

  • lcfg-updaterpms, updaterpms, lcfg-om, lcfg-tcpwrappers and lcfg-nsu all now support SL5 and have been made and submitted for both platforms. To get lcfg-om to build we needed cvs2cl. I couldn't find this in the SL distro or in ATrpms or in DAG or in EPEL. I eventually resorted to installing the source RPM from FC6 (it's part of FC6 extras, so maybe it'll appear in EPEL eventually) on sl5 and sl5_64. I edited config.mk and ChangeLog and committed the change in both into CVS, then did make release then make rpm to build the new release.

07 August 2007

  • lcfg-updaterpms seems to have run during the night and have installed lots of packages. No reboot yet to see if anything has broken.
  • As well as all the package installs, the updaterpms run on panos removed the i386 packages. One of them refused to be removed. This is from a subsequent run of updaterpms:

07/08/07 10:04:34:    Flagging librsvg2-2.16.1-1.el5/i386 for deletion
ls: /etc/gtk-2.0/i?86*: No such file or directory
/usr/bin/update-gdk-pixbuf-loaders: line 44: /etc/gtk-2.0/i686-redhat-linux-gnu/gdk-pixbuf.loaders: No such file or directory
07/08/07 10:04:53: ++ %postun(librsvg2-2.16.1-1.el5.i386) scriptlet failed, exit status 1
07/08/07 10:04:53: ++  (code 7210499)
   %postun(librsvg2-2.16.1-1.el5.i386) scriptlet failed, exit status 1
07/08/07 10:04:53: ** rpmtsRun failed
07/08/07 10:04:53: ++ updaterpms failed
07/08/07 10:04:53: ++ updaterpms generated some warnings

In order to execute the postuninstall script for librsvg2-2.16.1-1.el5/i386 it needs to have gtk2-2.10.4-16.el5/i386 installed, by the look of it. This seems broken to me. I removed gtk2-2.10.4-16.el5.i386.rpm like this:

[root@panos cc]# rpm --noscripts -e librsvg2-2.16.1-1.el5.i386
[root@panos cc]#

I'm reasoning that it won't be a problem when we're finished because librsvg2-2.16.1-1.el5.i386 will never be installed in the first place on our machines.
  • Both panos and fetlar have rebooted cleanly after their updaterpms adventures. On fetlar there is no boot component running yet, so I had to remember to login and start up the client component explicitly after booting had finished. I also started up amd though I forgot to check beforehand whether it had started up automatically or not (it did shut down automatically as part of the reboot).

-bash-3.1$ om client start
[FAIL] client: failed to start component; already running
-bash-3.1$ ps axuww|grep xprof
cc        2832  0.0  0.1   3884   684 pts/0    S+   11:03   0:00 grep xprof
-bash-3.1$ om client stop
[WARNING] client: no client process (18992)
[OK] client: stop
-bash-3.1$ om client start
[OK] client: start
-bash-3.1$ ps axuww|grep xprof
root      2931  6.0  1.4  10148  7480 ?        Ss   11:03   0:00 /usr/bin/perl /usr/sbin/rdxprof -R -d -C client -p 10m+30s -A 7s+30s -u http://lcfg1.inf.ed.ac.uk/profiles,http://lcfg3.inf.ed.ac.uk/profiles -n -v -a
cc        2945  0.0  0.1   3884   684 pts/0    S+   11:03   0:00 grep xprof
-bash-3.1$ nsu
[root@fetlar cc]# /etc/init.d/amd start
Starting amd: 2973
                                                           [  OK  ]
[root@fetlar cc]# 

LCFG thought that the client component was still running because it hadn't been cleanly shut down as part of the reboot (the boot component also does the job of shutting down LCFG components when necessary, and fetlar has no boot component yet). Officially stopping the client component cleared this erroneous status, then I was able to start it up. Searching for "xprof" should show the client component's rdxprof process when the client component is running.
  • I tried running updaterpms on fetlar. The correct versions of lcfg-file and lcfg-auth and lcfg-openssh were missing. I built them all and submitted lcfg-openssh to the autolcfg bucket (as it hasn't yet been tested) and the others to the lcfg bucket. Updaterpms now runs cleanly again.

  • The problem with lcfg-init still goes. At first place, the message that it was given it is generated by /usr/lib/lcfg/waitonboot. lcfg-init would replace the /etc/inittab file and in the new inittab file there's the line init.entry_lo   lo:2345:wait:/usr/lib/lcfg/waitonboot Commenting this out would make the system actually boot. Not in a normal way though. The system's RH boot scripts didn't execute and didn't the lcfg ones either. After trying the lcfg-init component today, there was not splash screen on boot nor dmesg output. The screen hangs a bit at the ata2 error messages right after the kernel is loaded. It also takes long time to boot with lcfg-init (and the waitonboot commented). However, the splash screen and dmesg are also gone when lcfg-init isn't used, something that didn't happen yesterday. At least, the system boots faster without lcfg-init for now. Running qxprof on the boot component shows that it doesn't handle the Red Hat scripts. So, maybe the fact that the RH scripts didn't execute when lcfg-init was used is because of this.


-bash-3.1$ qxprof boot | grep start_rc | wc -l
      0

While on an FC6 LCFG built machine:


[lochranza]pkritika: qxprof boot | grep start_rc | wc -l
20

and


[lochranza]pkritika: qxprof boot | grep start_rc
start_rc_acpid=44 restart
start_rc_atd=100 restart
start_rc_cpuspeed=06 restart
start_rc_gpm=100 restart
start_rc_haldaemon=98 restart
start_rc_halt=1 restart
start_rc_irqbalance=13 restart
start_rc_killall=0 restart
start_rc_mdmonitor=15 restart
start_rc_messagebus=97 restart
start_rc_network=10 restart
start_rc_nfslock=14 restart
start_rc_portmap=13 restart
start_rc_psacct=99 restart configure
start_rc_readahead_early=04 restart
start_rc_readahead_later=96 restart
start_rc_reboot=1 restart
start_rc_single=0 restart
start_rc_smartd=96 restart
start_rc_xfs=90 restart

  • lcfg-nsu has been tested and works. Have specified a new user in panos' profile to be allowed to use nsu. The component has been released into a new version, uploaded to CVS, copies to RPM repository, and the lcfg_sl5_lcfg.rpms list has also been updated.


!nsu.access             mADD(pkritika)
nsu.access_pkritika     allow 

and then:


-bash-3.1$ hostname
panos.inf.ed.ac.uk
-bash-3.1$ whoami
pkritika
-bash-3.1$ nsu
[root@panos home]# whoami
root

06 August 2007

  • lcfg-amd have been tested and seems to work fine. The configuration file /var/lcfg/conf/amd/amd.conf can be normally edited through the machine's profile. For example, I added these entries from panos profile:


!amd.maplist            mADD(platspec)
amd.name_platspec       platsepc
amd.path_platspec       /var/lcfg/conf/amd/amd.platspec.map
amd.type_platspec       file

!file.files             mADD(platspec)
file.file_platspec      /var/lcfg/conf/amd/amd.platspec.map
file.type_platspec      literal
file.tmpl_platspec      /var/lcfg/conf/amd/amd.platspec.map

!amd.maplist            mADD(platform)
amd.name_platform       platform
amd.path_platform       /var/lcfg/conf/amd/amd.platform.map
amd.type_platform       file

!file.files             mADD(platform)
file.file_platform      /var/lcfg/conf/amd/amd.platform.map
file.type_platform      literal
file.tmpl_platform      /var/lcfg/conf/amd/amd.platform.map

The component's method can also be called by using the om.

  • lcfg-auth has also been tested. Had specified a new password on its resources and was successfully accepted. Also, new users and groups can be specified on the profile and they will then be added to /etc/passwd and /etc/group (done already with avahi user and group)

  • lcfg-boot has been tested by changing values to resources. Specifically to lcfg-amd and lcfg-auth. Resources are successfully changed:


!boot.start_lcfg_amd    mSET(80 restart)
!boot.levels_lcfg_amd   mSET(3 4 5)
!boot.reboot_lcfg_auth  mSET(start)
!boot.user_lcfg_auth    mSET(pkritika)

and:


-bash-3.1$ qxprof boot | grep -e start_lcfg_amd -e levels_lcfg_amd -e reboot_lcfg_auth -e user_lcfg_auth
levels_lcfg_amd=3 4 5
reboot_lcfg_auth=start
reboot_lcfg_authorize=restart
start_lcfg_amd=80 restart
user_lcfg_auth=pkritika

todo.gif add and/or edit boot.h header with all the appropriate entries for SL5

  • lcfg-cron is in panos profile and its resources are installed. At first place it seems to work fine. A new user was added to cron.allow:

in the profile:


!cron.allow             mADD(cc)

having as results:


-bash-3.1$ qxprof cron | grep cc
allow=cc
-bash-3.1$ cat /etc/cron.allow 
cc

Trying to list crontabs entries for pkritika:


-bash-3.1$ crontab -l
You (pkritika) are not allowed to use this program (crontab)
See crontab(1) for more information

  • New releases made for lcfg-amd, lcfg-auth and lcfg-cron. The new versions of RPMs are also copies to the RPM repository and the lcfg_sl5_lcfg.rpms is also updated.

  • lcfg-etcservices was actually tested a few weeks ago when we released a new version with services.tmpl_sl5 template file for SL5 services. That did work, as etcservices component adds two new lcfg services in the file. I just tried to add another one entry to see if it still works fine and it does (no new release needs to be made):

in the profile:


!etcservices.extras             mADD(test)
etcservices.service_test        test    1000/test       test

and the /etc/services after fetching the new profile:


-bash-3.1$ cat /etc/services | grep 1000/test
test    1000/test       test

  • lcfg-updaterpms is now runnable on panos. The lcfg_sl5_64_base.rpms list has been sorted out. On Stephen's advice I tried to limit it to purely x86_64 packages. I had to make a few changes:
  • exclude all i386, i586 and i686 packages. They're still in lcfg_sl5_64_base.rpms but commented out.
  • lcfg-file, perl-W3C-SAX-XmlParser and perl-W3C-Itol-Basekit were missing. I built them from SRPM and did rpmsubmit.
  • lcfg-openssh-1.0.7 was missing. It turned out that this was because it'd been upgraded to 1.0.9 recently. I changed lcfg_sl5_lcfg.rpms to match, making sure that =/pkgs/master/rpms/sl5/autolcfg/ had a copy of the new version too.
  • for some reason gcc needs both 32 bit and 64 bit versions of libgcc. This isn't the case on Fedora. I've kept both x86_64 and i386 versions of libgcc in the lcfg_sl5_64_base.rpms list in case this is intentional and not a horrible mistake.
  • The following RPMs are not included in lcfg_sl5_64_base.rpms as their filenames end in .r or .rp instead of .rpm. Why are the filenames wrong? This isn't the case for 32 bit. A mistake? Part of a cunning plan? Anyway until it gets sorted (todo.gif) I've commented out both kernel-module-madwifi and madwifi.
  • kernel-module-madwifi-2.6.18-8.1.3.el5xen-0.9.3-10.sl5.x86_64.rp
  • kernel-module-openafs-2.6.18-8.1.3.el5xen-1.4.4-42.SL5.x86_64.rp
  • kernel-module-madwifi-hal-2.6.18-8.1.3.el5-0.9.3-10.sl5.x86_64.r
  • kernel-module-ndiswrapper-2.6.18-8.1.3.el5xen-1.41-1.SL.x86_64.r

  • lcfg-init was about to be tested. The profile gave no errors, the resources were installed fine but since I started the component, it killed X and this message was given on the screen:

[WAIT]boot: Wait on boot finishing....

but it wouldn't do anything. Then I rebooted the machine and at the beginning it seemed that it works fine as the components were being booted (could be seen on the splash screen details) but again would stack on the boot component with the same message. I had then to copy the initial inittab from another SL5 machine, copy to my usb flashdisk, boot with SL DVD in rescue mode and replace the inittab file. After the system was back to normal. Maybe the problem was lying on the boot sequence of the components. Needs to be checked.

03 August 2007

  • Trying to get lcfg-amd working. At first place, I had a look in inf/options/amd.h header in order to see how I could prevent overwriting the /home directory. Firstly I saw that OS definition existed for amd.gvariables and amd.gvar_osver so I added new ones for SL5:


#ifdef LINUX_SL5
!amd.gvariables         mADD(osver)
amd.gvar_osver          osver = SL5
#endif 

There was also a section were it was saying if the system is FC6 remove the home link. I added SL5 as well:


#if defined(LINUX_FC6) || defined(LINUX_SL5)
!file.files             mREMOVE(home)
#endif

After that, everything would be applied and work as it should but the /home directory was still there, mounted locally.

  • lcfg-amd seems to work fine. I have entered new amd entries to panos profile and /var/lcfg/conf/amd/amd.conf was successfully updated. The component's methods are also working fine.

  • lcfg-auth component somehow removed the avahi user and group from /etc/passwd and /etc/group respectively. That caused the avahi-daemon to fail on booting. I added new auth resources on panos profile to solve this:

!auth.extrapasswd       mADD(avahi)
auth.pwent_avahi        avahi:x:70:70:Avahi daemon:/:/sbin/nologin

!auth.extragroup        mADD(avahi)
!auth.grpent_avahi      mADD(avahi:x:70:) 

After that, avahi-daemon was able to boot as normal. Maybe we should edit the lcfg/defaults/auth.h header and specify that if the system is SL5 then create this user and group.

  • lcfg_sl5_lcfg.rpms have been updated as all the components that have new releases and copied to the RPM repository are marked as MPU rathen than MPU - autobuild.

  • I've got updaterpms running on fetlar (32 bit). It's currently installing 1488 packages. To get it to work I had to do this:
  • remove lcfg_sl5_optional.rpms from profile.packages in inf/defaults.h
  • make lcfg_sl5_base.rpms a complete list of packages in the SL5 distro.
  • from the list, remove any i386 packages that had i686 equivalents, as the i686 ones will be far more efficient
  • mark any i586 or i686 or noarch packages with their arch after the package name
  • careful though: if you want packages with the same name but multiple architectures installed (you can do this on 64 bit) then you have to specify the architecture before the RPM name rather than after it
  • I also had to build and submit a number of LCFG prerequisites and add them to lcfg_sl5_lcfg.rpms.
  • This was all done alternately with repeatedly running om updaterpms run -- -t and looking at /var/lcfg/log/updaterpms to find out details of the conflicts - until there were no more conflicts left.
  • I also had to remove a package called R-devel for having a dependency on a nonexistent package (Xfree86-devel). Maybe this is fixed in updates. At some point we'd better start regularly collecting updates and adding them to sl5 updates package files. Maybe Stephen's getupdates script will be handy for this. todo.gif
  • I created an ed/ed_sl5_env.rpms list and put our locally patched openssh RPMs in it. I added it to profile.packages in inf/defaults.h too.
  • We still need to do most of this for x86_64 and panos.

  • lcfg-boot was added on panos profile and installed on the system. The components now are starting normally on boot time. It still needs to be checked the boot.h header and see what must to be added or edited for SL5.

02 August 2007

(Chris) I've been trying to get rpmsubmit working. I installed am-utils to give us access to amd; I copied /var/lcfg/conf/amd/amd.conf from a DICE machine to /etc/amd.conf on fetlar, and changed its LDAP server from 127.0.0.1 to infdir.inf.ed.ac.uk. amd then started up OK. (And /home did not disappear - that unlucky event must be down to lcfg-amd rather than amd itself?) However when I tried to access /pkgs/master/rpms I wasn't allowed to. The trouble was that the repositories are exported only to machines which are members of netgroup HOST_managed. Our two machines are not in this netgroup. I found in one of the defaults headers that a machine is added to HOST_managed using roles.netgroups. Adding roles to our machines' profile.components resource adds the machines to the netgroup. After a quick om openldap kick on the ldap server and the repository server, we can ls /pkgs/master/rpms on fetlar. However rpmsubmit still doesn't work. Something more is needed.


-bash-3.1$ /usr/sbin/rpmsubmit -B lcfg ~/Linux/RPMS/noarch/lcfg-client-2.2.30-1.noarch.rpm 
Copying SRPM /home/cc/Linux/SRPMS/lcfg-client-2.2.30-1.src.rpm to /pkgs/master/srpms/sl5/lcfg
cp: cannot create regular file `/pkgs/master/srpms/sl5/lcfg/lcfg-client-2.2.30-1.src.rpm': Permission denied
rpmsubmit: failed to copy /home/cc/Linux/SRPMS/lcfg-client-2.2.30-1.src.rpm to /pkgs/master/srpms/sl5/lcfg - Operation not permitted
Copying RPM /home/cc/Linux/RPMS/noarch/lcfg-client-2.2.30-1.noarch.rpm to /pkgs/master/rpms/sl5/lcfg
cp: cannot create regular file `/pkgs/master/rpms/sl5/lcfg/lcfg-client-2.2.30-1.noarch.rpm': Permission denied
rpmsubmit: failed to copy /home/cc/Linux/RPMS/noarch/lcfg-client-2.2.30-1.noarch.rpm to /pkgs/master/rpms/sl5/lcfg - Operation not permitted
Creating dependency file for lcfg-client-2.2.30-1.noarch.rpm
genhdfile: failed to open /pkgs/master/rpms/sl5/lcfg/lcfg-client-2.2.30-1.noarch.rpm - No such file or directory
rpmsubmit: failed to create dependency file for lcfg-client-2.2.30-1.noarch.rpm - Operation not permitted
-bash-3.1$ 

  • The home directory os panos that had all the rpms and components is restored to its primary condition. The packages that were built from source rpms have been rebuilt.
  • panos is also able to see the contents of /pkgs/master but has the same problem as on fetlar when trying to submit packages.
  • lcfg-auth was basically tested on panos
. I removed the manually added lcfg group from /etc/group and I wanted to check if lcfg-auth will create the new group (as it is supposed to do) and it did. Full test of this and the rest of components will come after finishing step 6 and copying all the appropriate rpms in the repository.
  • rpmsubmit now works! Stephen gave me the final missing piece of the puzzle.
This is what rpmsubmit is like on panos:

-bash-3.1$ ls -l /usr/sbin/rpmsubmit
-r-xr-xr-x 1 root root 20000 Jul 18 13:20 /usr/sbin/rpmsubmit
-bash-3.1$ 

and that's what it was like on fetlar too. What you need to do is add the auth component to profile.components. This adds the "linux" account to /etc/passwd. THEN you can install rpmsubmit. If you install it with the linux account in /etc/passwd, rpmsubmit gets the correct ownership. If you install it with no local "linux" account, it gets owned by root, and doesn't work properly. So I just reinstalled rpmsubmit:

[root@fetlar SPECS]# rpm -i --force ~cc/Linux/RPMS/i386/rpmsubmit-0.99.8-1.i386.rpm
[root@fetlar SPECS]# ls -l /usr/sbin/rpmsubmit
-r-sr-xr-x 1 linux local 11328 Jul 18 17:17 /usr/sbin/rpmsubmit
[root@fetlar SPECS]# 

and it now works!

-bash-3.1$ /usr/sbin/rpmsubmit -B lcfg ~/Linux/RPMS/noarch/lcfg-client-2.2.30-1.noarch.rpm
Copying SRPM /home/cc/Linux/SRPMS/lcfg-client-2.2.30-1.src.rpm to /pkgs/master/srpms/sl5/lcfg
Copying RPM /home/cc/Linux/RPMS/noarch/lcfg-client-2.2.30-1.noarch.rpm to /pkgs/master/rpms/sl5/lcfg
Creating dependency file for lcfg-client-2.2.30-1.noarch.rpm
-bash-3.1$ 

  • Panos noticed that the rpmlist files in the repository directories were not being updated when we did an rpmsubmit. It seems that rpmsubmit does not update the rpmlist file. Instead, all the rpmlist files are updated every minute by a cron job on the repository server pezenas. The script which does this is /disk/rpms/bin/refreshrpms. It's not under the control of RPM or LCFG - you just have to edit it. I've added the sl5 and sl5_64 trees to it, and the rpmlist files in those directories are now staying up to date.

  • As rpmsubmit works fine, a new release have been made on the CVS repository, a new rpm package is copied to the RPM repository and the lcfg_sl5_lcfg.rpms list is updated with the new version of rpmsubmit.

01 August 2007

  • lcfg-updaterpms comes with thousand of conflicts because of wrong rpms list. Chris's Perl script for converting the comps.xml file to rpms list seems to be correct. However, comparing the packages in comps.xml and those actually installed causes some problems. Packages that are marked as default are not installed, other that are marked as mandatory are not installed either, are marked as optional can be installed and lots of packages that are actually installed are either optional or not included at all.
(Chris) It looks as if my assumption that comps.xml contained a comprehensive list of the packages in the base distribution was wrong. Quite a few packages are missing from comps.xml ("grep" for instance). We think this may explain most of the thousands of errors which updaterpms is throwing up.
I've found a short explanation of the XML metadata files in the yum repository here: http://linux.duke.edu/projects/metadata It looks as if "primary.xml" might be the file that contains a list of all the files. At this stage we could take either of two approaches:
  1. Abandon this troublesome split in the distribution packages between those installed by default and those not installed by default - and replace both of our "base" and "optional" sl5 package lists with one single one which we'd make from the primary.xml file and which would contain the whole of the distribution. This is the approach taken by LCFG porters for various Fedora releases: just install the whole base distribution as it's easier.
  2. Persist with the split. I'm not sure how we'd construct the "base" and "optional" lists accurately if we did this though. Maybe there's something in the metadata?
    If we could use yum instead of updaterpms it would be a lot easier.

  • I tried to get on panos profile and starting testing the lcfg-amd component. At first place it wouldn't work as the amd.h header is not a default one and so I had to include as a header from lcfg/inf/options rather than just specifying it on the sl5_64.h header. For my bad luck, the component seemed to work fine but due to its default configuration in the amd.h header, it uses the file component to create new files and links. And so it linked the /home directory under /amd/nethome where there was nothing and I couldn't access my local directories under /home. I couldn't find any way to remove the link and restore the folder into its primary condition. In the lcfg-sl5 directory under /home I had all the lcfg components packages and dependencies. I had already rebuild all the lcfg components rpms and re-run the tests. Fortuantely, the results are the same. I noticed also that the tests for lcfg-client will only pass if you run them as normal user and the tests for lcfg-syslog if you run them are root. What is important now to do is to take the source rpms from fetlar and build those packages again on panos so we will be sure that the right rpms exist on the rpm repository.

31 July 2007

  • The problem with om on fetlar was that the group lcfg didn't exist in /etc/group. I suppose that one of the components we will use on the next stage will replace the existing group file with the new and right one Probably, and hopefully, this will be lcfg-auth.

  • AMD is configured on both fetlar and panos. However, we don't have access to the rpm repository as it is not yet exported for these machines. We need this to use rpmsubmit.

  • Having got in the profile and installed resources of lcfg-updaterpms I run a test as:


om updaterpms run -- -t

which makes it pretend to update the RPMs without actually making any changes. In this mode it still reports that it's deleting and installing RPMs even it isn't. At first place the test was unable to run because rpmlists were missing from sl5_64 rpm directories on the repository. After creating those lists, updaterpms would start and flag packages but then it would fail because of, 4870, conflicts. Having a quick look through the logfiles it seems that something strange in happening as there packages that ask for "bash", "grep" and other stuff that is already installed.

30 July 2007

  • Started testing lcfg-authorize component. Trying to run om as a user I was getting the error:

-bash-3.1$: om client configure
-bash-3.1$: Can't do setuid (cannot exec sperl)

I got then installed perl-suidperl in order to work:


yum install perl-suidperl

The component has been tested and works fine.

todo.gif We'd better mark perl-suidperl in the lcfg package list as a core prerequisite.

Having set the user pkritika to be able to run all the om methods:


-bash-3.1$ whoami
pkritika
-bash-3.1$ om client restart
[OK] client: restart

Have made new release on the CVS repository.

  • Have been testing the commands provided by lcfg-utils and they work fine. A new release have been made and uploaded on the CVS repository.
  • lcfg-ngeneric has also been tested. Actually, it is being testing during the process of testing the components' methods as these methods are provided by ngeneric. A new release has also been made and uploaded to the CVS repository.

  • lcfg-updaterpms is being tested. It get the correct profile and rpms list:


[root@panos components]# qxprof updaterpms | grep -e rpmcfg -e rpmpath
rpmcfg=/var/lcfg/conf/profile/rpmcfg/panos
rpmpath=http://master.rpms.inf.ed.ac.uk/master/rpms/sl5_64/distro,http://master.rpms.inf.ed.ac.uk/master/rpms/sl5_64/updates,
http://master.rpms.inf.ed.ac.uk/master/rpms/sl5_64/extras,http://master.rpms.inf.ed.ac.uk/master/rpms/sl5_64/autolcfg,
http://master.rpms.inf.ed.ac.uk/master/rpms/sl5_64/lcfg,

However, when starting the component it fails to fetch the rpmlist from distro directory:


[root@panos components]# om updaterpms start
[ERROR] updaterpms: failed to fetch rpmlist http://master.rpms.inf.ed.ac.uk/master/rpms/sl5_64/distro/rpmlist : 403
[WARNING] updaterpms: updaterpms failed
[OK] updaterpms: start

  • lcfg-utils
, lcfg-ngeneric , lcfg-client , lcfg-inventory and lcfg-logserver seem to work fine on fetlar . There is a problem with running om component. Whenever trying to invoke a component's method from om it comes up with the error:


[root@fetlar components]# om file configure
Component group does not exist

Note: this also fails to work from the users being permitted:


-bash-3.1$ whoami
cc
-bash-3.1$ hostname
fetlar.inf.ed.ac.uk
-bash-3.1$ lcfgcap om/all
cc: has capability for om/all
-bash-3.1$ om file configure
Component group does not exist
-bash-3.1$ om client configure
Component group does not exist
-bash-3.1$ qxprof client | egrep ^om
om_acl_configure=om/all
om_acl_context=om/all
om_acl_install=om/all
om_acl_logrotate=om/all
om_acl_monitor=om/all
om_acl_reset=om/all
om_acl_restart=om/all
om_acl_run=om/all
om_acl_start=om/all
om_acl_status=om/all
om_acl_stop=om/all
om_acl_unlock=om/all
om_authorization=LCFG::Authorize
om_methods=configure start stop restart run logrotate monitor reset unlock statu
s context install
om_user=root
-bash-3.1$ qxprof authorize
caps_people=om/all
groups=people
ng_statusdisplay=nocomp
schema=1
users_people=pkritika cc
-bash-3.1$ 

As far as I can see, pkritika and cc should be able to use "om" to run this component's methods.

27 July 2007

  • Having got through testing process of lcfg-client component. It seems to work fine without any problems at all. The steps I followed for testing were trying all of the component's methods and see if they work and if any errors apper on the logfile (/var/lcfg/log/client). I run also all the client's methods on the machine gx620pk.epcc.ed.ac.uk which is already built with LCFG FC6 and I compared the log files. They are identical, the methods work exactly the same.
  • panos is able to pick up values from the profile which is located on the server. Changing the value of inv.manager from root@inf.ed.ac.uk to pkritika@epcc.ed.ac.uk.

fetching the new profile from the server:


27/07/07 11:07:08:    new profile: http://lcfg3.inf.ed.ac.uk/profiles/inf.ed.ac.uk/panos/XML/profile.xml
27/07/07 11:07:08:      last modified Fri Jul 27 11:05:56 2007
27/07/07 11:07:09: -- [changes] changed resource: inv.manager
27/07/07 11:07:09:    profile accepted: 61db3c34e931f56548fcc15a73d54584
27/07/07 11:07:09: -- [changes] RPMs not changed

and qxprof results:


[root@panos components]# qxprof inv | grep manager
display=model location Serial~No=sno allocated manager owner os padlock tags Release~Version=release_version
manager=pkritika@epcc.ed.ac.uk

  • lcfg-client is now official ported to SL5! Have made new release (2.2.30) and submitted to the CVS repository and have changed the lcrfg-client version of rpm in lcfg_sl5_lcfg.rpms on the subversion repository.
  • lcfg-inventory has been tested as well (even the previous tests were around lcfg-inventory anyway). Values on model, allocated, manager and location change normally. Made new release on CVS repository and had update lcfg_sl_lcfg.rpms
  • lcfg-om works fine. It calles and executes all the methods of the client component without any problem. However, trying to make a new release comes up with error:

bash-3.1$ pwd
/home/lcfg-sl5/lcfg-om
bash-3.1$ make release
Some files are not up to date: ChangeLog config 
make: *** [checkcommitted] Error 1

  • lcfg-file will not be configured and started unless lcfg-nsu and lcfg-etcserives are in the profile and installed and lcfg-prelink rpm is installed. It needs the resources of these. The methods of the file component seem to work but trying to create a new file or to edit one from the profile, doesn't really work. There are not errors, nothing, it just doesn't work.
  • Oups! One comment in the profile was finishing with /* instead of */ and the file componet's declarations wouldn't be read by the profile. (Shouldn't there be compiling errors on the profilethough?) Files now can be created and edited.
  • The tests of lcfg-file seem to work fine. It can create files, edit files, use templates, create templates, directories and links:

edit message of the day /etc/motd:


!file.files             mADD(testmotd)
file.file_testmotd      /etc/motd
file.type_testmotd      literal
file.tmpl_testmotd      test lcfg-file on sl5 64

create a new literal file:


!file.files             mADD(test2)
file.file_test2         /home/testfile
file.type_test2         literal
file.tmpl_test2         hope it works
file.mode_test2         0666
file.owner_test2        pkritika
file.group_test2        users

crate a new directory:


!file.files             mADD(testdir)
file.file_testdir       /home/testdir
file.type_testdir       dir
file.mode_testdir       0644

create a new template file in the created directory:


!file.files             mADD(test4)
file.file_test4         /home/testdir/testfile.tmpl
file.type_test4         literal
file.tmpl_test4         template test

create a new file in the created dir using the template file that was just created:


!file.files             mADD(test5)
file.file_test5         /home/testdir/testfile
file.type_test5         template
file.tmpl_test5         /home/testdir/testfile.tmpl

create a link to a file:


!file.files             mADD(testlink)
file.file_testlink      /home/testlink
file.type_testlink      link
file.tmpl_testlink      /home/testfile

  • lcfg-file is ported to SL5! New release made and uploaded on CVS, lcfg_sl5_lcfg_rpms updated with the lcfg-file new verion of rpm.

  • lcfg-logserver is ported to SL5! All of its methods are corresponding to what they are supposed to. By starting the component you see the links of the resources, logs,status and docs on the machines profile url: http://lcfg1.inf.ed.ac.uk/cgi/status.cgi/inf.ed.ac.uk/panos.html The log file of the component shows you which link of which component the user is visiting.
  • Have made new release of lcfg-logserver on the CVS repository and updated the lcfg_sl5_lcfg.rpms list with the new version. The make a new release the system needed to have html2ps which I install. It then tried to pass the tests first and then upload the new release. The test was failing because the lcfg-logserver was already running. Stopping it and killing another process (the one bellow) that was making use of log-server made the test to pass successfuly.

pkritika 13365  0.0  0.3 115204 11904 ?        Ss   15:50   0:00 /usr/bin/perl ./logserver start

26 July 2007

(Chris) While trying to round up people for a quick meeting on how to proceed with stage 6, I got chatting to Stephen, who made a couple of remarks:

  1. Don't forget to populate the repository with source RPMs as well as with binary ones. Just the RPMs you've had to rebuild from source are needed, don't bother with source for the RPMs you've downloaded and installed purely in binary form. (Now done.)
  2. The autobuilt RPMs should be in the autolcfg directory in the repository rather than in the lcfg- directory. Whoops. (Now done.)
  3. rpmsubmit would be very handy - it might be a good idea to todo.gif add a new stage before stage 5 in future ports in which rpmsubmit is installed. You'll need an NFS client (hopefully installed already) and AMD (install a couple of packages; copy across a config file from a DICE machine; tweak it to look for LDAP information elsewhere rather than on the local machine); and buildtools; then just make and install rpmsubmit. I heartily agree that this would have made stage 5 significantly easier, as it automates a lot of work:
  4. it copies the files to the right directories in the repository
  5. it remakes the rpmlist files for you
  6. it runs genhdfile for you

I just found that I couldn't ssh to fetlar.


[fishpond]cc: ssh -v -v -v fetlar
OpenSSH_4.3p2, OpenSSL 0.9.8a 11 Oct 2005
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to fetlar [129.215.46.133] port 22.
debug1: connect to address 129.215.46.133 port 22: Connection refused
ssh: connect to host fetlar port 22: Connection refused

It turned out that the sshd daemon was no longer running. It'd been fine when I'd left it yesterday. I wondered if yum had interfered overnight. /var/log/yum.log said yes it had - it had "upgraded" our locally patched openssh RPMs to the standard EL5 ones. Then when sshd had been restarted as a result of the rpm installation, it had failed to restart because of the local configuration option supported by our local patched version:

[root@fetlar log]# /etc/init.d/sshd start
Starting sshd: /etc/ssh/sshd_config: line 77: Bad configuration option: GSSAPIKeyExchange
/etc/ssh/sshd_config: terminating, 1 bad configuration options
                                                           [FAILED]

rpm -U --force with our local openssh RPMs has fixed it for now, but we're really going to have to stop yum from interfering like this. Must find out how to stop it doing its overnight updates. There's some message about this in the boot messages so it can't be too hard.

  • On panos ssh is working but trying restart sshd didn't work for the same reason as on fetlar
. yum nightly update can be stopped like this temporarily and permanently:

/etc/init.d/yum stop
chmod -x /etc/init.d/yum

  • have moved lcfg-* from the lcfg repository directory to the autolcfg directory, for both sl5 and sl5_64. Have remade rpmlist in all four locations. Have deleted the lcfg-* header files in the lcfg directories and recreated them with genhdfile in the autolcfg directories.
  • have created srpms repository directory trees for sl5 and sl5_64. Have copied over the SRPMS from fetlar and panos to the sl5 and sl5_64 trees respectively. In both cases I put everything called lcfg-* into an autolcfg dir and everything else into the lcfg dir. Have created rpmlist files. No need to run genhdfile for source RPMs, it seems.
  • Using a minimal profile for panos generates mutation errors in the form of:

mutation error: panos.logserver.components=mADD(client) (/var/lcfg/conf/server/releases/default/core/include/lcfg/defaults/client.h:7)
  did you forget to include "mutate.h"? or misspell a macro?
  mutation error: panos.profile.release=mSET(develop) (/var/lcfg/conf/server/data/profiles/panos: 13)

and at the end the error:


profile.release unstable: develop => stable => default => stable ?
  can't locate package file: lcfg_sl5_64_postship.rpms (/var/lcfg/conf/server/releases/default/co
re/include/lcfg/defaults/profile.h:94)

For the package file, the list lcfg_sl5_64_posthips.rpms had to be created under core/packages/lcfg

  • Using the default components from the default.h header, generates package files location errors:

can't locate package file: lcfg_OS_ID_base.rpms (/var/lcfg/conf/server/releases/develop/core/include/in
f/defaults.h:99 /var/lcfg/conf/server/releases/develop/core/include/lcfg/defaults/profile.h:94)
  can't locate package file: lcfg_OS_ID_optional.rpms (/var/lcfg/conf/server/releases/develop/core/includ
e/inf/defaults.h:99 /var/lcfg/conf/server/releases/develop/core/include/lcfg/defaults/profile.h:94)
  can't locate package file: lcfg_OS_ID_postship.rpms (/var/lcfg/conf/server/releases/develop/core/includ
e/inf/defaults.h:99 /var/lcfg/conf/server/releases/develop/core/include/lcfg/defaults/profile.h:94)
  can't locate package file: lcfg_OS_ID_updates.rpms (/var/lcfg/conf/server/releases/develop/core/include
/inf/defaults.h:99 /var/lcfg/conf/server/releases/develop/core/include/lcfg/defaults/profile.h:94)

There are no mutation errors when using the defaults.h which probably means that another component uses the mutate.h. We could include mutate.h header but it doesn't seem to be present at the repository.

Stephen solved this one for us: apparently you can't use cpp variables (like OS_ID) inside mutations. So you need to take another approach. Luckily there is an LCFG resource which contains the same value as OS_ID. The resource is called inv.os and you can use its value in inf/defaults.h by replacing this:


/**/OS_ID/**/

with this:

<%inv.os%>

wherever it occurs.

  • Stephen's suggested approach to stage 6 is to edit our os headers (inf/os/sl5.h and inf/os/sl5_64.h) to remove everything from the profile.components and boot.services resources. That gives you a nice simple profile without having to edit too many headers right off. Then, add components back to these resources one by one. Start with just client:

!profile.components     mSET(client)
!boot.services          mSET(lcfg_client)

Oops, this has the effect of taking away inv.os which breaks the change we made to inf/defaults.h smile So add inv back too:

/* Just the client component to start with */
!profile.components     mSET(client)
!boot.services          mSET(lcfg_client)

/* and inv too to make inv.os work */
!profile.components   mADD(inv)

Bingo! A working SL5 profile, with just client and inv components enabled.
  • We also got some package errors: several packages were included in both lcfg_sl5_lcfg.rpms and lcfg_sl5_optional.rpms. Not sure how to approach this as they really should be in both. Commented them out of lcfg_sl5_lcfg.rpms for now.


bash-3.1$ /usr/lib/lcfg/components/client install http://lcfg1.inf.ed.ac.uk/profiles/inf.ed.ac.uk/panos/XML/profile.xml

This will get the profile in /var/lcfg/conf/profile/dbm


bash-3.1$ ls -l /var/lcfg/conf/profile/dbm/
total 32
-rw-r--r-- 1 root root 32768 Jul 26 14:42 panos.DB2.db

  • Stephen advised to build the rpms and to do all the work as normal user rather than root as this may bring unwanted conflicts. Therefore, all the lcfg-components were build on panos and this ran the tests (thanks the script that saved from hassle). This time most of components have passed their tests. These are:

lcfg-alias
lcfg-authorize
lcfg-client
lcfg-cron
lcfg-example
lcfg-file
lcfg-logserver
lcfg-ngeneric
lcfg-perlex
lcfg-rpmcache
lcfg-utils

Full results on http://www2.epcc.ed.ac.uk/~pkritika/sl5/results/2007-07-26/

  • The client component is up and running on fetlar too. I changed the value of inv.location. The client component on fetlar got the new value and qxprof inv.location gave the right value. The start and stop methods work OK...

[root@fetlar noarch]# ps axuww|grep rdx
root      3723  0.0  1.5  15340  8084 ?        Ss   17:02   0:00 /usr/bin/perl /usr/sbin/rdxprof -R -d -C client -p 10m+30s -A 7s+30s -u http://lcfg1.inf.ed.ac.uk/profiles,http://lcfg3.inf.ed.ac.uk/profiles -n -v -a
root      3759  0.0  0.1   3880   668 pts/3    R+   17:08   0:00 grep rdx
[root@fetlar noarch]# /usr/lib/lcfg/components/client stop
[OK] client: stop
[root@fetlar noarch]# ps axuww|grep rdx
root      3799  0.0  0.1   3880   668 pts/3    R+   17:08   0:00 grep rdx
[root@fetlar noarch]# /usr/lib/lcfg/components/client start
[OK] client: start
[root@fetlar noarch]# ps axuww|grep rdx
root      3855  7.6  1.5  15344  8072 ?        Ss   17:08   0:00 /usr/bin/perl /usr/sbin/rdxprof -R -d -C client -p 10m+30s -A 7s+30s -u http://lcfg1.inf.ed.ac.uk/profiles,http://lcfg3.inf.ed.ac.uk/profiles -n -v -a
root      3869  0.0  0.1   3880   692 pts/3    R+   17:08   0:00 grep rdx
[root@fetlar noarch]#

The install method also works - it had to to get client up and running in the first place smile

25 July 2007

  • Removed the 64 openssh patched package and re-installed from the 32 srpm (just to be sure we use exactly he same patched package). It still creates an 64 version rpm which installs and works fine.
  • Have re-ran the tests, after a few more dependencies installed yesterday, but the results are the same http://www2.epcc.ed.ac.uk/~pkritika/sl5/results/2007-07-25/ Maybe because of difference in rpms installed on fetlar and panos and/or we must have a fully developed LCFG machine.
  • openssh has been been configured and tested with GSSAPI support and now you can ssh to and from the machines without asking you for passwords. To do so we had to obtain Kerberos keytabs as described here: https://wiki.inf.ed.ac.uk/view/DICE/MPUFcFiveInfSixtyFourBit

We had also to enable GSSAPI support for the server daemon on sshd_config:


    GSSAPIAuthentication yes
    GSSAPICleanupCredentials yes
    GSSAPIKeyExchange yes 
     

As also on the client side on ssh_config:


    GSSAPIAuthentication yes
    GSSAPIDelegateCredentials yes 
    

And finally restart sshd: as root, /etc/init.d/sshd restart

  • SL5 os headers created for inf/os on the LCFG subversion repository


   inf/os/sl5.h
   inf/os/sl5_64.h
   

(Chris) Most of the permissions you need to use our repositories and LCFG infrastructure is controlled via capabilities in our authorisation system. This gives Informatics COs all the permissions they need automatically, but not people from EPCC. I've got fed up of giving Panos roles, each containing one or two capabilities, one by one as we find he needs them only to find another one that he needs - so I've created a catch-all role called "portinglcfg" to hold them all. All I have to do from now on is add more capabilities to that role as necessary. So far it has

  • linuxman for rpmsubmit
  • group/cvs_dice and group/cvs_dice_locks for the CVS repository
  • @lcfgsvn for the LCFG subversion repository
  • rfe/lcfg/write, rfe/lcfg/create and rfe/lcfg/edit for editing lcfg files.
  • om/all and om/test for running components
  • lcfg/inventory/write for luck

In the absence of rpmsubmit we've copied all the autobuilt LCFG RPMs and all of their prerequisite RPMs into the repository (/pkgs/master/rpms/sl5/lcfg and /pkgs/master/rpms/sl5_64/lcfg from DICE machines). Comparing what was there with what ought to be there according to the packages/lcfg/lcfg_sl5_lcfg.rpms list showed up a few problems:

  1. The 64 bit versions had all been made with make devrpm. They were remade with make rpm.
  2. One RPM had been slightly misnamed in the list (perl-GD-Graph3d-0.63-2.2.el5 should have been perl-GD-Graph3d-0.63-2.2.el5.rf)
  3. Chris had forgotten where he'd put some of the RPMs and had to hunt for them with find. Panos had been more organised! Once the RPMs were all copied we remade the rpmlist file and reran /usr/sbin/genhdfile on each RPM.

lcfg-etcservices isn't making.

  • Have created the very first version of a profile for panos but not very successfully. There are lots of "mutation" errors. Also, in the beginning the server was able to see the inf/os/sl5_64.h header but it didn't want to later on. Needs to be examined tomorrow.

We had a problem making an RPM for lcfg-etcservices. It wouldn't make as it couldn't find a template file for sl5. We added one and checked it into the repository. Then "make devrpm" worked but "make rpm" failed because it couldn't find services.tpml_sl5. The problem was that we hadn't done "make release" after adding the file. This makes a new version number for the component and tags the files with it; and "make rpm" works with the files tagged with the latest version rather than whatever's currently in the repository. One "make release" later, we'd made a new version for lcfg-etcservices (now 0.100.7 rather than 0.100.6) and this version build properly with "make rpm". Then copy to repository, remake rpmlist, run genhdfile.

And the autobuilt RPMs and their prerequisites are DONE and in the repository at last!

We've realised that we're not entirely sure how to proceed with stage 6 now. We'll talk with Stephen or Alastair about it tomorrow.

24 July 2007

  • The 64 bit packages were also remade and their tests rerun. The results are in http://www2.epcc.ed.ac.uk/~pkritika/sl5/results/2007-07-24/. Authorize, cron, syslog and utils passed.
  • We got our patched version of openssh built and installed on i386:
  • tried the openssh in the cvs repository; it's very old and doesn't build. Simon advises getting rid of it.
  • then tried the version in /pkgs/master/srpms/fc6/ed/openssh-4.3p2-4.11.ed3.src.rpm. It needs "audit-devel-libs".
  • Find audit-devel-libs in the SL distro. Good. Install it with "yum install audit-devel-libs". todo.gif Must go back later and add this to the prereqs.
  • openssh srpm now installs.
  • try building it. Doesn't build. At the "autoreconf" stage you get "configure.ac:1408: /usr/bin/m4: builtin `mkstemp' requested by frozen file is not supported". Simon advises that you need m4 1.4.8 or later to fix this. EL5 has 1.4.5.
  • get ftp://ftp.scientificlinux.org/linux/scientific/5x/SRPMS/SL/m4-1.4.8-1.src.rpm and build it for i386, install it. We'll need to go back later and add this to the prereqs in the package list, or make sure it's in the updates list when we make that. (Note: m4 is in the distro, but it's version 1.4.5, which doesn't work with the openssh build.) todo.gif
  • try openssh build again. Builds this time.
  • then try applying Simon's patch. He advises use of openssh-4.3p2-gsskex-20060223-playnicely.patch (included in current DICE srpms) rather than the one at http://www.sxw.org.uk/computing/patches/openssh.html - the former being specially adapted to apply to redhat RPMs cleanly. This applies, the package builds. Call the new package openssh-4.3p2-16.ed.gsskex.1.
  • It works - as ssh. You can ssh from and to the machine! But how do we know that the local patch works? Do we need to worry about that just now? Need to test.
  • On x86_64, to install the openssh rpm you also need gtk2-devel. This is already installed on our 32 bit machine but not on the 64 bit. We don't know why not. Anyway, to install gtk2-devel you need these:

gtk2-devel              x86_64     2.10.4-16.el5    sl-base           2.8 M
gtk2-devel              i386       2.10.4-16.el5    sl-base           2.8 M
Installing for dependencies:
atk-devel               x86_64     1.12.2-1.fc6     sl-base           125 k
cairo-devel             x86_64     1.2.4-1.fc6      sl-base           130 k
glib2-devel             x86_64     2.12.3-2.fc6     sl-base           1.3 M
libXcursor-devel        x86_64     1.1.7-1.1        sl-base            14 k
libXext-devel           x86_64     1.0.1-2.1        sl-base            57 k
libXfixes-devel         x86_64     4.0.1-2.1        sl-base           9.2 k
libXft-devel            x86_64     2.1.10-1.1       sl-base            16 k
libXi-devel             x86_64     1.0.1-3.1        sl-base            52 k
libXinerama-devel       x86_64     1.0.1-2.1        sl-base           5.1 k
libXrandr-devel         x86_64     1.1.1-3.1        sl-base            15 k
libXrender-devel        x86_64     0.9.1-3.1        sl-base           8.8 k
pango-devel             x86_64     1.14.9-3.el5     sl-base           281 k

Again, need to go back and add these to prereqs or install advice. todo.gif
  • We're not sure whether to use the same srpm for openssh on 64 bit as 32 bit. The 64 bit openssh on fc6 seems to differ from the 32 bit one - certainly the file size is different anyway:

[fishpond]cc: pwd
/pkgs/master/srpms
[fishpond]cc: ls -l fc6/ed/openssh-4.3p2-4.11.ed3.src.rpm fc6_64/ed/openssh-4.3p2-4.11.ed3.src.rpm
-rw-r--r-- 1 linux people 873345 Feb 21 15:32 fc6_64/ed/openssh-4.3p2-4.11.ed3.src.rpm
-rw-r--r-- 1 linux people 873952 Feb 13 09:04 fc6/ed/openssh-4.3p2-4.11.ed3.src.rpm
[fishpond]cc: 

However, having installed the patched 64 version of openssh works fine to ssh to and from the machine.

23 July 2007

Stage 5.

  • Packages matching those listed in lcfg_sl5_lcfg.rpms were downloaded for 32bit and 64bit.
  • Both machines had their prerequisite packages adjusted to match lcfg_sl5_lcfg.rpms.
  • The 32 bit core and standard LCFG packages were remade and their tests rerun. The results are in http://homepages.inf.ed.ac.uk/cc/sl5/results/2007-07-23/. Authorize, client and utils passed. Others failed or had no tests. Some tests seem to assume the presence of an LCFG profile for the machine and are failing without it. We'll carry on with the port and we'll keep revisiting the tests.

Week to 20 July 2007

Stage 5. We ran the tests on all core and standard LCFG components. All tests failed. We made independent attempts for 32bit and 64bit to find and install all the prerequisite packages for core and standard LCFG software. We then compared our findings and agreed on a standard set of prerequisite packages. We adjusted the contents of packages/lcfg/lcfg_sl5_lcfg.rpms accordingly.

Mid July 2007

The white bits are the plan for the FC6 desktop port, with "FC6" replaced with "SL5". The blue bits are reports of our progress in mid July.

  1. Install sl5 (1 day)
  2. Standard sl5 desktop machine

i386fetlar.inf.ed.ac.uk in room 1419

x86_64panos.inf.ed.ac.uk in room 2405

  1. Get onto the Informatics network.

i386Yes, 129.215.46.133

x86_64Yes, 129.215.46.109

  1. Install the development tools.

yum groupinstall development-tools

  1. Authentication with kerberos
i386Yes, done through the SL5 install utility. The realm is INF.ED.AC.UK, the admin server is kdc.inf.ed.ac.uk:749

x86_64Yes

  1. Directory services from ldap.
i386Yes. Initially configured through the SL5 install utility. (Correction: it appears that I copied /etc/ldap.conf from a DICE machine...) You also need to install ldapsearch (yum install openldap-clients). You also need to change the authentication method from sasl to gssapi: copy /usr/lib/sasl2/slapd.conf from a DICE machine, then yum install cyrus-sasl-gssapi

x86_64As for i386 except make sure you just have the 64 bit RPM installed for cyrus-sasl-gssapi, remove the 32 bit version. Later info: maybe the removal wasn't necessary; after a reinstall, LDAP worked with both versions of the RPM still present.
You also need to specify host infdir.inf.ed.ac.uk and base dc=inf,dc=ed,dc=ac,dc=uk

Extra info from Stephen: the LDAP config as currently set up on fetlar enforces use of AFS home directories, so for instance my home directory is /afs/inf.ed.ac.uk/user/c/cc. As AFS is not yet set up or working, this isn't much use! Until AFS is set up, it'll be much easier to use a local home directory. To do this, edit /etc/ldap.conf and edit out the last line (nss_map_attribute homeDirectory afsHomeDirectory). After doing that you will find that your homedir is a local one e.g. /home/cc. Then just make the directory and give it the right user and group ownership and you'll have a proper home directory. Makes things much easier.

  1. RPM repositories (0.5 day)
  2. Create repository directory structure

i386Created /pkgs/master/rpms/sl5/[autodice,autolcfg,dice,distro,ed,extras,lcfg,updates]

x86_64Created /pkgs/master/rpms/sl5_64/[autodice,autolcfg,dice,distro,ed,extras,lcfg,updates]

  1. Populate base, updates, extras
i386Copy contents of SL dir from i386 DVD into /pkgs/master/rpms/sl5/distro

x86_64Copy contents of SL dir from x86_64 DVD into /pkgs/master/rpms/sl5_64/distro

In each case, make an rpmlist file in the same dir (just a list of all the RPM file names, one per line).

Each RPM also needed to have /usr/sbin/genhdfile run on it. This generates the RPM's updaterpms header file. Normally an RPM is submitted to a repository directory using the command rpmsubmit and this runs genhdfile automatically; but in this case I copied all of the RPMs from the DVD, so I just ran genhdfile on each one afterwards.

  1. Package lists (0.5 day)
  2. Create lists for sl5 base, updates, postship

For both i386 and x86_64 I split the "base" package list into two separate files, a "base" list containing the packages which the SL5 install process installs by default, and an "optional" list containing the others. The files were created in the subversion repository in core/packages/lcfg and are called lcfg_sl5_base.rpms, lcfg_sl5_optional.rpms, lcfg_sl5_64_base.rpms and lcfg_sl5__64_optional.rpms.
I made the split into base and optional because Stephen said there had been complaints relating to other releases about our usual habit of just installing the whole base distribution in one go: some people want less RPMs. I'm not sure the split is a good idea as it'll make it more complicated for us to keep track of LCFG components' package dependencies won't it?
(to which Stephen comments: All LCFG packages should completely specify all their direct dependencies. rpmbuild (what happens when "make rpm", etc. is used) will automatically add any dependencies on perl libraries, shells, and libraries when C code is involved. Anything else should be explicitly stated, I've worked on this in fc5 and fc6 and generally most of it is sorted now."
I also created a (currently empty) updates list for each arch, lcfg_sl5_updates.rpms and lcfg_sl5_64_updates.rpms.
lcfg_sl5_postship.rpms is also empty: "This package file lists the RPMs shipped as SL5 updates but without corresponding versions in the original SL5 base".
The base distribution RPMs are all in a directory called SL on the download servers and on the install DVD. There's a directory inside SL called repodata; this contains some XML files including one called comps.xml which looks like a list of every RPM, together with which software group it is in, whether the software group is installed by default or not, and whether the individual RPM is installed by default or not. This comps.xml file is what I used to create the entries in the "base" and "optional" lists. I wrote a perl script called comps2rpms to parse the XML and spit out updaterpms package lists.

  1. Create empty lists for lcfg components

lcfg_sl5_lcfg.rpms was created. It'll contain the LCFG RPMs and their prerequisites and will be shared by the two architectures.

lcfg_sl5_lcfg_installroot.rpms is also shared and empty. It'll contain LCFG RPMs and prerequisites for the first stage of the install process.

Note the cpp pragmas that both of these files have to contain. These help with the eventual export to the LCFG web site.

  1. Essential headers (0.5 day)
  2. Create any essential headers for each platform
Created core/include/hidden/sl5vars.h and core/include/hidden/sl5_64vars.h in subversion. The latter includes the former.

  1. Add basics to lcfg/defaults/profile.h and lcfg/defaults/updaterpms.h
Done. Also overrode profile.packages in info/defaults.h to include the "optional" package list as well as the "base" one. (This split into "optional" and "base" is surely more trouble than it's worth.)

  1. Auto-build and run tests for all LCFG components (2 days). Also auto-build:
  2. openafs client support - makes porting a lot easier
  3. openssh with our patches
No homedir so (a) can't put sources there and (b) don't have a ~/.rpmmacros file so can't make RPMs anywhere other than the default location /usr/src/redhat. So I check out all the LCFG components from the CVS repository (more accurately, all the ones mentioned in the lcfg_fc6_lcfg.rpms package file) into /tmp and try a "make" on each of them as root. They all fail for want of "buildtools.mk" except for "lcfg-buildtools" which builds. So I try installing the resulting lcfg-buildtools RPM. The install fails for want of a dependency, perl-Time-modules. This apparently comes with Fedora but not with SL. Where do I get one? There's no cpan2rpm in the SL distribution that I can see either so I can't (yet) use that to build it.
Couldn't find a handy source of SL5 or RHEL5 RPMs. However RHEL5 is based on FC6 - so I tried installing /pkgs/master/rpms/fc6_64/extras/perl-Time-modules-2003.1126-4.fc6.noarch.rpm and this worked.
Stop press: just found this, it seems to have lots of handy RPMs: http://ftp.scientificlinux.org/linux/extra/

Installing this allowed me to install the lcfg-buildtools RPM I'd made. This then allowed me to "make devrpm" on all the other "core", "standard" and "additional" LCFG components (as listed in core/packages/lcfg/lcfg_fc6_lcfg.rpms).
Copied all build lcfg-* RPMs to rpms/sl5/lcfg repository on pezenas. Ran "genhdfile" on each one. Added names to "rpmlist" file in same dir.
Added entries to core/packages/lcfg/lcfg_sl5_lcfg.rpms as per core/packages/lcfg/lcfg_fc6_lcfg.rpms - for RPMs just built (plus perl-Time-modules). Other prerequisites will have to wait and be tackled one at a time.
NB NO TESTS RUN YET
Tried running "make tests" in lcfg-client. Every one of the 19 tests fails saying that /usr/sbin/lcfgdiff can't be found. lcfgdiff is part of lcfg-utils. Install that. Won't install - needs perl-XML-Parser. Install that. Install lcfg-utils. Run tests again. The tests no longer complain about lcfgdiff but they still fail.

  1. Create basic development platform (3 days)
  2. Develop Inf level to create a basic profile with most components removed
  3. lcfg-buildtools

  1. lcfg-utils
  2. lcfg-ngeneric
  3. lcfg-client
  4. lcfg-file
  5. lcfg-inventory
  6. lcfg-logserver
  7. lcfg-authorize
  8. lcfg-om
  9. lcfg-updaterpms
  10. lcfg-amd (for rpmsubmit)
  11. rpmsubmit

  1. Components necessary to keep a machine LCFG managed (2 days)
  2. lcfg-auth
  3. lcfg-boot
  4. lcfg-cron
  5. lcfg-etcservices
  6. lcfg-init
  7. lcfg-lcfginit
  8. lcfg-nsu
  9. lcfg-pam
  10. lcfg-syslog
  11. lcfg-tcpwrappers

  1. Components for auth/authz, directory services and dns in client mode. (2 days)
  2. lcfg-dns
  3. lcfg-kerberos
  4. lcfg-nsswitch
  5. lcfg-ntp
  6. lcfg-openldap
  7. lcfg-openssh

  1. X support. (1 day)
  2. lcfg-gdm
  3. lcfg-xfree

  1. Other components, mainly just auto-build and install. (1 day)
  2. lcfg-alias
  3. lcfg-mailng
  4. lcfg-mailcap
  5. lcfg-prelink
  6. lcfg-rpmcache
  7. lcfg-xinetd

  1. Installation systems (4 days)
  2. lcfg-fstab
  3. lcfg-grub
  4. lcfg-hardware
  5. lcfg-install
  6. lcfg-kernel
  7. lcfg-network
  8. Create installroot and installbase package lists
  9. Build, install and test lcfg-buildinstallroot
  10. Set up PXE nfs root, installer, etc

  1. Port MPU managed resources to the DICE level. (3 days)

  1. Document new platforms (2 days)

  1. Back port lcfg-buildtools to all other supported platforms

  1. Add sl5 to the list of supported platforms on the LCFG website.
Topic revision: r3 - 2011-02-17 - squinney
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback