skip to content

Department of Computer Science and Technology

The department has been running a NetApp filer called elmer as its primary filestore since 2002. Since the original installation, there have been two major hardware replacements, and on each of those occasions the transition was relatively seamless. We had about a day of downtime, during which the filesystem was transferred over and an almost identical service launched on the new hardware. It is now time for a third upgrade, but this one will be a bit different. The purpose of this article is to explain what we are planning.

The existing hardware is performing reasonably well, but is about to reach the end of its supported life. More importantly, NetApp filers are changing to a significantly different operating system known as Clustered ONTAP. Although this provides all of the services we are used to, it is managed very differently and is a major change.

The new hardware was purchased and installed earlier this year. It is already in active use, providing the storage which lies behind the virtual disks in our Xen virtual machine pools. These virtual disks were migrated in the background with very little impact. The next step will be the migration of the user-facing filestore accessed using the NFS and CIFS protocols. This will have to be done in phases over the next few months. There is still some preparatory work to be done before a schedule can be announced, so this is simply advance warning that there is a period of disruption coming up.

We had been hoping to have the migration completed before the start of the Michaelmas term, but the substantial preparatory work needed and the limited staff time available to do it mean that this is now unlikely to happen. It should be possible to start migrating some filesystems in September, but the actual file copying is quite slow and the whole task will almost certainly run into term. However the disruption will be to individual filespaces in turn and not the entire filer. It is a little early to predict how long each will take, but rehearsals suggest that filesystems will each take a few hours to migrate.

Since the disruptions will be large in number but each one very limited in its impact, announcing each one on the teaching-research mailing list is not practical. Nor is is possible to notify individuals selectively, as we do not know which filespaces people are using (other than the trivial case of home directories). Instead, we expect to announce an opt-in mailing list which people can subscribe to if they want to be informed of the detail.

Meanwhile, here is an outline of some of the things that will change:

  • Snapshots and backups. At the moment, the primary filer elmer holds recent snapshots, and there is a separate "disaster recovery" filer echo, updated every hour, which holds long term snapshots (the oldest being about 5 years old). Although this does serve to keep a safe copy of the data, we have always known that recovery from total loss of the main filestore would be time-consuming, as the backup system is not designed to be put directly into service. The new arrangement is more symmetrical. The primary and secondary systems have a similar specification. The primary system will hold both long term and short term snapshots, and all changes will be propagated to the backup periodically. At the moment the update interval is set to five minutes; it remains to be seen whether this is sustainable in the long term, but we hope to keep the lag time as short as possible. It will not be possible to implement automatic failover, but it will be possible to put the backup system into service manually if the main system becomes unusable (and subsequently copy any changes back in the opposite direction when it returns, or make a new copy on new hardware if it was a total loss). The changes mean that we will in effect be starting from scratch with snapshots, and it will take 5 years for the collection of long term snapshots to be fully populated. In the meantime, a separate copy of the existing historical snapshots will be preserved online for reference.
  • Direct visibility of server filestore. Ever since we started supporting Kerberos authentication, it has been possible to mount the filestore directly over NFS, but we have always strongly discouraged this. The reason is that the namespace exposes many details of the physical structure of the filestore which are not necessarily stable in the long term, and the automount maps are intended to hide those details. The new operating system offers us much more control over the structure of the server-side namespace. Although the use of automount maps will continue, direct mounting of the server namespace will be far more user-friendly.
  • Quotas. Filesystem quotas will work differently. On a NetApp filer, quotas are enforced at two levels. There is a internal structure known as a qtree, which can be given an allocation of space; this is the size displayed by the df command. Within that, there can be individual user and group quotas, which are only visible using local tools. At present, a large number of users share one qtree, and there is similar qtree sharing of group filespaces. The new operating system allows us to have a much larger number of qtrees, and we therefore intend to give each user and each group filespace their own. This means that all of the space displayed by the df command will be available for use, and the ownership of the files within that space will not matter. This should be a much simpler and more intuitive approach to quota enforcement.
  • Security style. File access controls in the Unix and Windows worlds are fundamentally different, so a multi-protocol filer must make some compromises. Every file has either Unix or NTFS access controls associated with it, and if accessed with the alternative protocol, the safest equivalent access list is emulated. The style used for files is controlled by an attribute of a qtree called its security style. At present we are mainly using the mixed security style, which means that the access controls set on a file are determined by the protocol which last set them. Whilst this has worked very well in practice, it has occasionally caused problems and there are some filesystems for which it is not ideal. The finer granularity of qtree allocation in the new filestore will offer many more opportunities to set the security style more appropriately, by forcing files to stick with either Unix or NTFS access controls, depending on which is the primary protocol for updating the filesystem. In particular users will be able to choose a different security style for their superhome if they wish. In the first instance it will not be possible to have a different security style for unix_home and windows_home, since disentangling them for all users is too complicated for the bulk migration process. However there is a possiblity of providing such separation later if there is a demand for it.
  • NFS security. The current filer supports Kerberos authentication for access using the krb5 security mechanism, but the data traffic is not protected from modification or eavesdropping. The new filer fully supports the krb5i and krb5p security mechanisms if the clients request them. We expect that managed machines will routinely use the strongest mechanism, krb5p, which provides both integrity and privacy.
  • NFS on macOS. At present, Apple Macs cannot use NFS to mount filespaces on elmer because the two systems do not have a crypto suite in common, and NetApp have declined to update this aspect of the old operating system. The new filer does work with macOS, and is likely to provide a better user experience than the current workaround of using CIFS.
  • The filer alias. At the moment, filer.cl.cam.ac.uk is a DNS-level alias for elmer.cl.cam.ac.uk. This arrangement will not work during the transition, since there will be more than one filer in use. This alias name is primarily intended to provide an abstraction for Windows users. It will therefore be changed to refer to a dedicated server which uses the DFS protocol to provide transparent referrals to the correct server. Some folder paths which are only present as an accident of the implementation will in due course disappear, but canonical paths such as \\filer.cl.cam.ac.uk\userfiles\ will continue to work. Clients who refer to \\elmer directly will need to make changes.
  • Unicode filenames. Both NFS and CIFS permit Unicode characters in filenames, but they encode them in different ways. The translation between the two sometimes causes difficulties. A particular problem is that elmer currently allows NFS users to use (almost) arbitrary byte sequences in filenames, and any byte sequences that are not valid UTF-8 are likely to cause problems during the migration. Most of the offending filenames appear to be accidental, and we hope to be able to eliminate them before copying the files. Some people may have already been contacted about this. We are currently awaiting a software update which will improve the handling of characters outside the Basic Multilingual Plane of Unicode, after which the full Unicode character set should be supported and translate correctly between NFS and CIFS protocols.
  • NTLMv2. We hope to increase security by withdrawing support for the NTLMv2 protocol, which we found was sometimes being used unintentionally when Kerberos authentication failed. This may require a little more work to ensure that Kerberos is configured correctly on clients, but will significantly reduce the exposure to theft of credentials.

Any questions about the new filer or the migration process should be directed to Martyn Johnson <maj1@cl.cam.ac.uk> in the first instance.


Published by Martyn Johnson on Thursday 22nd August 2019