This project came about as a part of the Intel grant to Cornell. Specifically, a part of the grant, under Dr. David Lifka, specified work on developing a batch scheduling system for clusters of WindowsNT machines for the purposes of parallel computation. My project focussed on the issues involving security, specifically on how to make user's job run in their security context. The goal of the project is to create a highly scalable yet manageable system whereby batch jobs can be queued and, later, run on slave nodes in the correct security context.
On any multi-user computer, it is paramount that the programs that a user runs stay within the security bounds set by the system (not taking up too much drive space, not overwriting other users' files, etc.) and this is handled by most multi-user operating systems (like WindowsNT and UNIX). But there are many things about the WindowsNT architecture that make it different from the way all the UNIX flavors operate. For example, in UNIX, a user with sufficient permissions may arbitrarily "become" another user; that is, replace his or her set of permissions with that of the other user. This feature is not available in WindowsNT (the reason for this, I believe, is so that NT can maintain an exact audit log of who did what, not having to concern itself whether User X was acting as User Y when an event ocurred). That feature however, is key to the operation of a batch scheduling system because it allows the system to become the user when the time comes to execute their job. There are only very specific ways in which one user may become another in WindowsNT and that is the heart of the work I did--discovering and exploiting those ways.
So a considerable amount of time was devoted to planning how we would accomplish this "user impersonation." At first, I thought it might be possible to serialize the user's token (the symbol inside the operating system that says the user is logged on) and store it until the time came to start their job. However, I quickly discovered there was no way to simply write the token to disk. Then, noticing the ability to have a thread take on the security context of a logged-on user, I thought perhaps one might spawn a thread for each job in the queue (at the time the job was submitted, thus being able to copy the token from the user when he/she submitted the job) and suspend that thread until the time came to start their job. The problem with this was that it was only as permanent as the program which controlled the system--if that program, or system it was running on, crashed, all the jobs in the queue would be lost which was quite unacceptable.
Then Dave suggested we use WindowsNT services (programs which run independent of the user logged into the workstation). This certainly had its appeal because we could simply configure the service to run as the user (using normal service parameters maintained by WindowsNT). This method also solved the problem of how to allow each user access to only the machines that the system currently had designated for them. The drawback however, was, since we didn't want to be responsible for storing the users' passwords, that each user on the cluster would have to have their own "impersonation service" on each node in the cluster. For a couple reasons, we decided to go ahead with this idea and we designed a system and wrote most of the code to distribute each user into the corresponding impersonation service on each node. This proved to be a considerable task. In the end though, it was abandoned for two reasons:
Finally, we decided that we were going to have to store the user's password. Given that, and using the domain security model, we could now create instances of any user on any node of the cluster when necessary, completely eliminating the need for the impersonation services. It is still necessary to run one service per node to control the node's function in the cluster but this was clearly superior to the previous model. Thus each user's password is kept only on the server and is distributed to each node when that user's job starts on that node. To support this architecture securely, I wrote C++ classes using the CryptoAPI provided by WindowsNT, allowing me to implement full RSA public key encryption when storing or transmitting sensitive data.
The final architecture emulates, and is implemented using, a WindowsNT domain. The Primary Domain Controller functions both as a place to authenticate users and logically as the control center of cluster. Not only does it manage the job queue but it also tracks the status of each node in the cluster. Each user is a domain user having group membership according to normal group management policies. Passwords are stored encrypted in the registry. Each node is a member of the domain but has special permission set as to who can log into it from the network (or interactively if necessary). Specifically, there is a global user group in the domain for each node in the cluster. Nodes only allow access to users who are a member of the group corresponding to that node (thus a user must be a member of the Node X group in order to access Node X). Thus to start a user's job, the master node makes that user a member of the groups whose nodes the job is to run on. Then the master node alerts the control services on each of the nodes that they are to start a job for this user, sending the user's password encrypted along with the job parameters. As for user administration, users can be added to the cluster using the normal User Manager (which wasn't possible using the impersonation service model--separate programs had to be written for all user maintenance functions). To allow users to change their password, a special service was written to run on the control node (and a corresponding client--nothing more than a secure, stripped down, and specialized telnet client) which accepts secure connections from the nodes specifying the username and both the old and new password. The control then verifies this user using the old password and, if successful, sets the new password in that user's domain account and corresponding entry in the registry.
So clearly this final architecture is superior for
a number of reasons. Most notably, it maintains security in two ways,
by using NT's CryptoAPI to be able to implement completely secure subsystems
(registry and sockets) and secondly using NT permissions on the registry
to make a potential hackers job even more difficult. The architecture
also provides for very simple user administration, with no special programs
required for anything except changing the user's password. This reduces
the code base considerably as well as making for a much more elegant architecture.
Finally, the final model eliminates the problem of adding nodes to the
cluster--one simply has to make a node a member of the domain.
There are two main subsystems supporting the Cluster Control Service (the service which implements all the cluster control functionality): The SecureSocketsChannel and the UserSecureRegistry. They are both C++ class and make use of the CryptoAPI provided by WindowsNT.
The idea behind the SecureSocketsChannel object is
that all one has to do is attach it to a CSocket to be able to send encrypted
information across the socket. The CSocket is unmodified so that
the programmer can still send unencrypted information if necessary.
The majority of the work for this class is done in the attach method.
The attach method takes a boolean parameter in addition to the CSocket
which is simply which side will initiate the negotiations. It makes
no difference who does as long as the parties initiate the attach at the
same point in the stream and one side is the sender while the other side
receives. It performs the negotiations as follows:
|Generate a new key suitable for key exchange (public key), export it out of the security context, and send it to the Receiver|
|Import the received "key blob," generate a symmetric key, export it using the public key from the Sender, and send it to the Sender|
|Import the symmetric key and store it for use|
The way the UserSecureRegistry class works is to use the public key pair associated with each NT user to generate a key suitable for bulk encryption (in fact, the CryptoAPI will not allow a public key pair to be used for anything but encrypting another key). The way this is done is to export the user's public key, hash it, and use the hash data to generate a symmetric key. This symmetric key can now be used to encrypt data to be written to or decrypt having been read from the registry (this is all of course done transparently done by the class). Only the user who wrote the information can read it because only that user has access to their public key which is exactly the information necessary to obtain the key used to work with the data.
The last piece of software I developed was the service
to change user's passwords for them. This service is necessary because
if they were allowed to change their passwords using the normal NT mechanisms,
then the password stored in the registry by the Cluster Control Service
would become out of sync with the one maintained by WindowsNT. To
prevent this and to make sure that the the password list is maintained
only by the system administrator, the passwd
program (the password changing client) opens a socket to the pwsrv
service and, using the SecureSocketsChannel, sends the information necessary
to change their password (username, old password, new password).
The password service takes that information and, in order to validate their
old password, tries to log on as them using a special kind of logon (specifically
a Batch logon). If this succeeds, then we know the user is who they
say they are, and we can now change both their NT password and the one
kept in the registry by us. There are many details to note in this
implementation. First, in order to perform user logons, the password
service must be running as a user who has the Trusted Computer Base privilege
(known through the User Manager as "Act as part of the Operating System").
However, I believe (I have not fully tested this hypothesis) the service
must not logon as LocalSystem because it is not really a user and hence
doesn't have a key pair associated with it (for the UserSecureRegistry
class to use). Finally, the built-in NT functions for doing user
management (functions that start with NetUser...) require Unicode strings
and thus the application must convert any ASCII strings (like the username
or password) to Unicode before passing them to the NetUserSetInfo function.
As is to be expected, there were aspects of the project that were left unaddressed by my initial version of the code. For the sake of brevity, I will simply enumerate them.
On the whole, I would say that this project was rather successful. We spent the right amount of time investigating ways to design the system, evaluating each by how well it accomplished our goal of a scalable, manageable (from both system administration and development points of view) batch scheduling system. We initially chose the impersonation services because it relieved us of the responsibility of maintaining the users' passwords while still allowing us to accomplish our goal reasonably well. As the project went on, we decided that we could still provide a good level of security while eliminating the many pitfalls of the impersonation service model, by using the design described in the Final Architecture section. Overall, the best decisions possible were made at each step, resulting in what I believe is quite a good solution.