SALSA - Disaster Planning & Recovery BoF
Fall 2007 Internet2 Member Meeting
October 8, 2007

*Attendee List*
Don MacLeod, Cornell U. (chair)
Heather Flanagan, Stanford U. (speaker)
Celeste Copeland, Duke U. (speaker)
Janet Poley, ADEC
Jose Conda, U. Puerto Rico
Anibel Vega, U. Puerto Rico
Mike Austin, U. Vermont
Lanny Gray, Level3
Liz Lawson, Level3
Joe St. Sauver, Internet2/U. Oregon
Rodney Petersen, EDUCAUSE
Tim Lance, NYSERNet
Mike LaHaye, Internet2
Steve Olshansky, Internet2
Jessica Bibbee, Internet2 (scribe)

*Agenda*
1. Duke & Stanford Back Each Other
  . Agreement
  . Process
  . Lessons Learned
  . Q & A
2. Upcoming Related Events
  . EDUCAUSE Annual 2007
3. Get Involved
  . SALSA-DR Working Group

*Discussion*
{Don} introduced the goals of the SALSA-DR Working Group and introduced the cooperative agreement that Duke U. and Stanford U. have, as was further explained by {Heather and Celeste}.

-Duke & Stanford Back Each Other-
Collaboration first began with DNS, then using a Duke server as another domain authority – initially without hardware transfers. Duke has now shipped out servers that are physically located at Stanford,

All ran smoothly for a few months before they hit their first snag – the patching of a DNS server at Duke did not go properly, Stanford did not know about the work, and DNS resolution for Stanford domains started intermittently failing.  We had not accounted for proper change management. The incident proved to be a good learning experience for handling the change process. The information has been captured, anonymized, and made available on the website for use by this group. They detailed the process after an event triggers this agreement at the other institution. As this was new to both campuses, they erred on the explicit, rather than leaving terms and conditions vague.

{Heather} acknowledged the concern around security and emphasized the importance of ensuring Duke is on another network, in case of a lock down.

Duke chose to host at Stanford for purposes of communication. They also have a blog accessible by high-level admins to put out an alert, should a network go down. They have detailed a process of who gets called at Stanford, e.g., using a satellite phone if necessary. They also send a feed to Stanford with biodemo (Bio/Demo Data) and contact information for all students, staff, and faculty so the can still be contacted; if people can be reached, the rest of the information can be reestablished. Right now, they are using 3 servers (not scalable), meaning they will need more to handle the load in an emergency. There is a feed from our Identity Management structure to keep all up to date.

What happens in the event of that a Duke machine at Stanford is compromised, causing further security issues at Stanford? A solution may be as simple as physically removing their server from the Stanford network. This process is still being negotiated.

{Heather} shared Stanford’s nervousness surrounding sending Stanford servers to Duke with our “keys to the castle”. Once the servers are out of Stanford’s physical control, a new element of risk is added. The scope aims to ensure we have a copy of the KDC; how is that data controlled? She stressed that a key point they learned is that a project manager is vital to act as the liaison and ensure all staff have assigned roles and follow up on their assignments.

{Celeste} noted that Duke has had an easier time getting buy-in from senior management. At Stanford, they have 6 Disaster Recovery projects – which makes it a challenge for management to prioritize.

{Heather} also stressed that back-up issues, while important, are a lesser concern – the real scare is to lose communication. Ensuring communication must be the first priority.

Q: Who is allowed access to these ‘boxes’ with all the data?
A: Though the racks are unlocked and open, security is handled by only allowing trusted staff into the data center.

Q: Are these servers backed up?
A: Replication of back-up data seems like too much redundancy. Assuming a loss of data at Stanford, there ought to be a backup at Duke. It is unlikely that both data at Duke and Stanford will be compromised at the same time; in fact, geographical location was one of the factors for forming the relationship. An earthquake in CA will call upon Duke’s help, and a Hurricane in the East will call upon Stanford.

Q: How did you go about priority-setting?
A: What is broken today trumps what is broken tomorrow, {Heather} said, and so these bases were the first to be covered.

Q: Did you pay for an outside business to handle backups or cycle services?
A: Within Stanford, yes. But adding outside parties becomes a liability issue, i.e., what happens to their tapes?

Q: Are there plans to move authentication data to Duke?
A: That issue will take some concerted thought; no hard decisions have been made yet.

Q: Would Stanford be comfortable putting a tape library at Duke?
A: It is not a matter of comfort; it just seems like more trouble than it is worth.

Q: Where would you run PeopleSoft?
A: Stanford has an *opt-in* central IT system, which provides a certain amount of standard infrastructure. For example, we utilize AFS, but have little influence over PeopleSoft, Oracle, etc., since they are handled by an entirely different branch.

Q: What considerations are raised with HIPAA?
A: {Heather} said their work does not deal with HIPAA. While there is an overlap between Duke central IT and the health system, they do generally operate as two separate entities. Research data from human subjects may necessitate additional handling.

Q:  Would you say that the stand-alone model is easiest?
A: The stand-alone model is more efficient now, though the long-term may prove differently.

Though it was a coincidence that {Heather} used to work at Duke, her background knowledge of Duke’s operations has been very beneficial to their work.

{Don} made a point that it is important to have a plan that can overcome a failure in technology. {Heather} shared a story that clarified this point: After tearing out old generators and installing new temporary generators, they ran a test – but did not realize that someone had left the access door open and the kill switch was thus in the wrong position, disabling the generator. Surprisingly, the *new* generator did not run for more than 30 minutes, and in the end, they were unable to cleanly shut-down everything. In the process, they managed to burn out the control panel and could not put it in bypass mode. Next to go were the circuit maps. Since the generator was so new, no one was familiar enough with it to recognize that the problem was the open access door. Fortunately, Stanford has emergency operations in place – what to do and who does what. This is documented in a folder in one room next to red phones for when cell phones died. There was a scribe to take notes and document everything that happened. The plan was limited not just to central IT, but extended even to a liaison for the student radio center, and impacted construction projects. Damage was minimal, considering what could have happened.

One of the outcomes was that they now have a standing whiteboard with services already listed, reducing some of the responsibility (and stress) on the scribe to capture everything being shouted by multiple people running through the room. This particular incident did not trigger the agreement with Duke, as Stanford’s own auxiliary data centers kept the key services available to our population.

Q: Is it realistic for a campus to plan for all that might happen? Or will there always be unexpected items?
A: While you can hardly plan for everything, it is still imperative to conjure up a worst-case scenario. If it can be imagined, it can and likely will happen at some point in time.

There is much to be learned from experiences on other campuses. In the wake of the fatal shootings at Virginia Tech, Cornell U. has assigned a VP position in charge of Safety, Health, and Risk Management. Such management tasks can and perhaps should be university-wide and have the support of IT and even the Chief of Police. For example, if the responsible individual should be out of the office, there ought to be a backup plan that is fully executable.

{Don} shared how a local, but off-campus, event motivated Cornell U. to take action. There was not a clear policy on how public to make announce emergency announcements, i.e., to what extent ought the message travel. What happens when a message gets out, not just too late, but likewise too early? Additionally, to what extent do these top individuals contact the entire directory? What prerequisites are there before everyone in the system can be alerted in one fell swoop? Such an emergency message could spark unnecessary fears and cause more damage, if it is not propagated out along a well-thought out and a well communicated plan. People acting in good faith may try to follow the rules, but what if they do not know the rules have changed?

{Don} spoke on the Available Notification System, identifying various technologies to send out an announcement. If using a mail blaster to help spread an emergency communications message, it is important to have updated student/staff information, etc. The problem is getting everyone to update their information. A system of rewards or punishment may provide incentives; e.g., prevent a student from registering for a class unless they have updated their profile, or issue discounts at the local student union to those who respond, etc. A campus also has to be careful about other departments wanting to use such a mail blaster for their own advertising purposes. Also, if there is a crisis at 3 AM, who is online to check their messages? Simpler technology may be the wiser option in some instances.

Please refer to the session page, where {Don’s} slides will soon be posted:
     < http://events.internet2.edu/2007/fall-mm/sessionDetails.cfm?session=3472&event=273 >

The attendees voiced some interest in exploring applicable technologies to be used in the Disaster Planning & Recovery space.

-Upcoming Events-
The upcoming EDUCAUSE Annual 2007 conference will have several Disaster Recovery/Business Continuity related sessions:
 < http://educause.edu/11078?ID=Business+Continuity+and+Disaster+Recovery+Planning&Session_Numbers=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 >.

-Get Involved-
For more information, please visit the SALSA-DR (Disaster Planning & Recovery) Working Group home: < http://security.internet2.edu/dr/ >. You will also find subscription information for the SALSA-DR mailing list, where the monthly Working Group calls are announced.