This is a piece of my voilas with nfs mounted root directories.
At our office we started with around 15 people. We mainly deal in work involving processing of claims
which from the application perspective boils down to the browser and a word processor. Luckily, I was
given freedom to handle the technical aspect of the whole setup.
I immediately decided on network booting as the best solution. No hard disks!. No installation chores!.
Instant updates! So I set up a linux box with dhcp,tftp and XDMCP forwarding. All machines had more or
less the same motherboard/configuration, so i compiled all the drivers into (no modules) the latest 2.6
kernel making it monolithic. Then managed to create a rootfs with X server( more or less busybox + X
statically compiled + some libs) in around 20 MB. All machines were equipped with around 128M RAM which
was fairly decent to manage a 20M rootfs. After some tweaks here and there we were ready with network
booting.
To sum up the whole process,
With a staff of 15, the machines were pretty fast. All processing on the server and just display on the
clients. Soon 10 more joined in and speeds noticably reduced. Another 15 would be joining soon. I
already knew that as people increase this is going to be a big problem. I had thought on the lines of
load balancing, but was not too happy with the idea of increasing servers which would also mean
increasing costs, maintainance and administration. So NFS mounted rootfs was the answer. Basically all
machines would use their own processor and memory but would have no hard disks. This way there should
be absolutely no need for any other server and should not be a problem to handle the increasing staff.
So, soon created a big rootfs with all packages needed, compiled the kernel with root over nfs and
eventually got a machine to boot with root over NFS. Started firefox, and it opened as normally as it
would. Then tried to start openoffice. The first time openoffice takes some time as it copies the setup
to the users directory.... so i patiently wait for around 30 seconds and nothing seems to happen. I
wait for a minute and i know that something is wrong!
I had to basically look for some sort of error messages, so started oowriter in a xterm.
vinay@debdungeon:~# oowriter
Not a single error message and oowriter just does not start, no splash screen, no error messages,
nothing. I delete .sversionrc and .openoffice directory and start oowriter again
vinay@debdungeon:~# rm .sversionrc; rm -fr .openoffice vinay@debdungeon:~# oowriter running openoffice.org setup... Setup complete. Running openoffice.org...
Now, I can see that it has performed the setup and trying to start oowriter, but nothing more and it
still does not start. I searched for some debugging options for openoffice, ran it with strace logging
to a file called ooolog
vinay@debdungeon:~#OOO_DEBUG="strace -o ooolog" oowriter
Then, i waited for around 20 seconds and pressed the Ctrl+C to stop oowriter.
vinay@debdungeon:~# tail -n 200 ooolog| less
I found that towards the end it was filled with these messages
stat64("/tmp/OSL_PIPE_0_SingleOfficeIPC_4acd679a70dd792afe65dde68cb44c2", 0xbfffc63c) = -1 ENOENT (No such file or directory)
Immediately i tried to find the file in /tmp
vinay@debdungeon:~# ls -la /tmp/OSL_PIP* srwxrwxr-x 1 vinay vinay 0 Oct 11 12:22 /tmp/OSL_PIPE_1000_SingleOfficeIPC_4acd679a70dd792afe65d
After some closer inspection found that the 2 filenames differed in a peculiar way. The file in the
strace log was 10 characters more than the actual file in /tmp. If you notice the characters
de68cb44c2 are missing in file created in /tmp
Now, things got interesting. I repeated the above process again to check the strace logs and the file
in /tmp. Amazingly, even though the filenames differed, the difference was exactly 10 characters. Why
would 10 characters be cut off from the resulting file? And that too this happens only with oowriter
and no other application that i ran!!
The first thought that occured to me was that probably the filename was too big. So to confirm i
created a file of the name found in strace
vinay@debdungeon:~# touch /tmp/OSL_PIPE_0_SingleOfficeIPC_4acd679a70dd792afe65dde68cb44c2 vinay@debdungeon:~# ls -la /tmp/OSL_PIPE_0_SingleOfficeIPC_4acd679a70dd792afe65dde68cb44c2 -rw-r--r-- 1 root root 0 Oct 11 13:56 /tmp/OSL_PIPE_0_SingleOfficeIPC_4acd679a70dd792afe65dde68cb44c2
The file was created without any problem. No characters were cut. Why could oowriter not create the
file? After some further inspection, i noticed that the file created by oowriter was a socket.
The "s" in the ls -al output showed that the file was a socket. The next thought that occured to me was
that probably this was a problem with creation of sockets, but i had gdm installed which also created a
socket in /tmp, and i had logged into the machine using gdm. So i knew the problem was something
specific to oowriter rather than all applications creating sockets.
To just make sure that the openoffice version i had was working fine, i ran openoffice on the server
without any problems. So now i had a combination of nfs mounted rootfs, openoffice, socket and a big
filename. To ensure that nfs has to be a involved i decided to put /tmp on a ramdisk and not on nfs
mount. I added the following to my initialization scripts
mkfs.ext2 /dev/ram0 1024 mount /dev/ram0 /tmp
The client booted and i started oowriter and voila, it started without any problems. Now, i was sure
that it had to do something with nfs, socket and big filename. Some further digging and finally i
figured out the problem
Unix(AF_UNIX) or local(AF_LOCAL) sockets are created using a struct sockaddr_un defined in sys/un.h
which had the following definition
struct sockaddr_un { __SOCKADDR_COMMON (sun_); char sun_path[108]; /* Path name. */ };
Now, on a nfs mounted rootfs what seems like /tmp to the client is actually a directory some where on
the server. I / dir on the client pointed to "/mnt/disk1/work/nfs-client/rootfs/default/192.168.0.75"
on the server. Now if i append the filename found in strace logs i get
echo "/mnt/disk1/work/nfs-client/rootfs/default/192.168.0.75/"\ > "tmp/OSL_PIPE_0_SingleOfficeIPC_4acd679a70dd792afe65dde68cb44c2" | wc -c 118
118 characters and the sun_path buffer in sockaddr_un is 108 characters long.
118 - 108 = 10 which was the reason why 10 characters were always being skipped!!!
Finally i mounted / pointing to "/nfs-client/rootfs" and openoffice started normally.