Submitted by Matt Callaway, WashU IT Technical Services Manager
The term “DevOps” is growing in use and frequency in the IT industry, together with the closely related “Site Reliability Engineering” (SRE). Like other “terms of art” these seem to be bandied about loosely with its connotation shifting with the context. What is “DevOps”? What is “SRE”? What do these have to do with “Research Infrastructure” and why is this relevant to the IT services in a University setting? The Research Infrastructure Services team (RIS) is guided by the principles, methods, and tools from the DevOps and SRE community, striving to provide a set of services, a “Service Catalog”, that provides a toolkit for Wash U research faculty. So what does that mean?
The term “DevOps” was coined by Patrick Debois with the beginning of the “DevOpsDays” conference in Belgium in 20081. The reference is to “Development” and “Operations” and comes from the Agile software development movement. That term, “Agile”, comes with its own history and controversies. There are many articles one could read on Agile software development. But the Agile Manifesto2 and the Principles3 behind it formed the beginning of a culture change in software development. Along with that culture change, its principles guided a set of tools and practices. Concepts such as continuous test, integration, delivery, and deployment of software led to significant developments in the tracking of code, change management, dependency management, and building and packaging software artifacts. Those tools and practices were found to apply not only to software development, but infrastructure and operations as well. The growth of cloud services emphasized the drive to stop treating servers as permanent fixtures of infrastructure, but rather as disposable units of compute capability. This was a move away from treating servers and infrastructure like “pets”, and more like “cattle”4.
As system administrators and operations staff learned the processes and tools from the development world, the principles of Agile could be extended beyond just the software to include the “full stack”, from the storage, through the network, to the servers, to the software, the data, user identities, and the full lifecycle of managing each, including monitoring and alert management. The application of software development methodologies to systems administration led to the origin of “Infrastructure as Code”5,6. Agile taught us how teams can iterate in short development cycles with fast feedback. DevOps added collaboration across cross-functional teams, joining software development to systems engineering and operations7.
Around the same time as the first DevOpsDays, Google published its book on “Site Reliability Engineering” (SRE)8. This was Google’s guide to running infrastructure, with an introduction titled “The Sysadmin Approach to Service Management”. In that introduction the author notes that, “One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions”9.
The SRE book as serves as inspiration and guide to how the RIS team structures and executes its mission:
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
The RIS team strives to produce a Service Catalog that functions as a toolbox for researchers. We have learned the lessons of IT teams past, that growing an infrastructure one ticket at a time and simply building by hand what someone asks of us only leads to the construction of unmanageable messes. Rather, we strive for the long term, maintainable, sustainable, predictable, reliable tools that enable researchers to stop thinking about system administration and focus more on their science.
RIS believes that by leveraging the lessons of the Agile, DevOps, and SRE movements, that we can structure our work into predictable, measurable “sprints”, with defined deliverables guided by our customers and stakeholders. We can minimize and measure “toil”10, automating manual, repetitive, tactical work with no enduring value, protecting our time to focus on the development of the services that do provide enduring value. Those services, high performance storage and computing environments, container management systems, and application stacks, can be built and delivered using those tools mentioned above: continuous testing, integration, delivery, deployment, monitoring, and alerting.
DevOps and SRE are the combination of culture, practice, and tooling. Empowered by these foundational principles, tools, and methods we strive to bring industry standard practices to build leading edge computing infrastructures, enabling a world-class University to perform foundational research.