It’s been an interesting week online for research computing; we have relevant news ranging from managing people to managing workflows at global scale.
The Subtle Signs Your Team Is in Trouble - Rachel Muenz, Laboratory Manager
An older article that Laboratory Manager sent around on twitter this week. It’s written in the context of laboratories, but carries over very clearly to research computing teams. They mention two opposite signs to look out for:
In both cases the solutions are the same - do lots of constant listening (e.g., through one-on-ones but also just generally) and give lots of feedback (positive and negative; positive for any input even if disagreeing, and positive for doing so constructively while negative for doing so otherwise).
Tracking Toil using SRE principles - Eric Harvieux, Google Cloud Blog
Writing Runbook Documentation when you’re an SRE - Taylor Barnett, Transposit
“Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”
These two articles came out in the same week, and they work very nicely together. One of the under-appreciated advantages commercial cloud organizations (or other large operations) have at scale isn’t about hardware - it’s that they have better awareness of how much stuff that gets done is repetitive across teams, and they put the resources in to ensuring those things get done smoothly.
Technical teams tend to hate documenting, but identifying these repetitive “toil” tasks and documenting what’s involved in getting them done in a “runbook” has three really important benefits:
Are there things that have to routinely get done in your team that aren’t documented clearly in (kept-up-to-date) runbooks? Would things be easier if there were fewer such tasks?
And related to this; here’s a list of “flight rules” (e.g., very short runbooks) for common nontrivial tasks in git.
I talked last week about Andy Grove’s book High Output Management, which is aimed at management and in particular senior management in technical organizations; this article talks about lessons from the book for those still in the trenches, as technical leads (but not managers).
A Chaos Test Too Far - Kathryn Downes & Arjun Gadhia, Financial Times
I’m a big fan of “chaos testing” (or, arguably more appropriate for a lot of research computing work, Slack’s “disasterpiece theatre”) for testing the hypotheses that “our system is robust to [bad thing]”; the software equivalent is probably fuzzing. We know stuff is going to break, usually at the least opportune time; systems crash, users or operators enter invalid information… best to test it in a controlled manner.
So in fairness I should probably post examples of it going - or nearly going - horribly wrong, like this one. I’d still argue though that the series of events they tested here would almost certainly have happened eventually; far better that it happen with all hands on deck when they were primed to respond rather than towards the end of the day on a lazy Friday afternoon when people were already heading home.
Research computing teams have a lot in common with open source communities - even if you aren’t developers or developing open source software. One of the joys of open source communities is that you’re part of a small, visible team solving problems for your users - and that’s exactly the situation we’re in. But there’s downsides to that, too. Users can be incredibly demanding, and when you’re a small visible team they can be directly demanding to you, personally. Members of huge proprietary teams don’t have that feedback. for good or ill; being one of thousands of devs on MS Word or staff in AWS it can be hard to feel like you’re making a difference, but you’re not getting emails from your users yelling at you because they don’t like that latest change you made.
There’s a lot of talk in FOSS communities these days about dealing with this feedback, including dealing with burnout, and I think this is something that managers should be keeping an eye on.
This was talked about FOSDEM this year, and will be of interest to some teams and/or of their users: it’s a tool for CI testing of data tables. It will perform checks of tabular data on push to github or S3.
There were a few other relevant presentations in the databases session, including on dqlite (simple high-availability distributed database on top of RAFT and sqlite) and LumoSQL, sqlite with faster key-value store and some other advantages. I find it fascinating how a small high-quality (with hundreds and hundreds of tests!) embedded tool like Sqlite ends up being the basis for so many fun experimental projects.
Most of HPC Happens Under the Radar - Michael Feldman, Next Platform
This won’t come as a surprise to our community, but it’s an important point worth emphasizing; most HPC doesn’t happen at big centres whose every procurement gets lots of press.
Workflow manager news: The Pan-Cancer Analysis of Whole Genomes results came out this week (see also: Unprecedented exploration generates most comprehensive map of cancer genomes charted to date; BSC Powers Pan-Cancer Project). Disclaimer: I was somewhat involved in this project.
From a research computing point of view, one of the big accomplishments of this project, particularly given that much of the computational work was done ~five years ago, was coordinating the running of uniform pipelines on ~2800 sets of cancer genomic sequencing data at centres across the globe. This really helped catalyze work in the genomics community around workflow execution - spurring standards like the Workflow Execution Service; repositories of software and workflows like Dockstore and helping accelerate the development of workflow runners like Toil, Cromwell, and Nextflow.
One of the things that fascinates me is the difference in workflow runners like these ones for research computing - which are really about running large complex executables that would otherwise have run with bash scripts - and evolving ML workflow runners like Lyft’s recently-announced Flyte. Are the differences in use cases fundamental, or will the tools start to converge? One of the things I’d like to do at some point if there’s interest is a bit of a deep dive into workflow managers of all types.
In Python, dictionaries are now guaranteed to be ordered by insertion - but sets aren’t. I can absolutely guarantee this is going to cause incredibly hard-to-track-down bugs in research software in the coming couple years.
Jeremy’s Notes on Fast.AI coding style
A bracing reminder (ternary operators! 120-character lines!) that there aren’t “correct” coding styles; the purpose is to make sure teams reduce internal barriers to collaboration by picking one style that works for them and sticking with it.
Are you tired of your desktop’s file system working too well to be able to do good robustness testing of software? Want to mess with a process’s view of system time? Want stderr output always colorized so you can tell the difference from stdout? Check out this list of ld-preload hacks.
And that’s it for another week; the jobs listing is starting to get long so I’ll put it towards the end. One of the things I want to accomplish with the jobs listing is that research computing management is a real profession with a real, emerging community of practice - these job listings may be accomplishing that a little too well.
Have a great weekend, and good luck next week with your research computing team,
Discipline Leader High Performance Computing (HPC) Services and Technology - Department of Defence, Brisbane AU
As a critical member of the HPC Program leadership team you will work collaboratively to provide secure HPC systems, systems support and related HPC technology required to meet researcher needs.
Senior Program Manager, HPC - Microsoft Azure, London UK
In this role, you will work with the Azure global engineering team to drive product design, creation and improvement across different industry vertical focus areas, as well as formulating common engineering building blocks, build structure around these fast growing businesses.
Technical Research Manager - IBM, Warrington UK
The variation in scientific and computational themes and projects requires someone with the necessary experience and oversight to manage not just their area of interest, but also successfully engage in other research areas, lead a team to succeed from a business and research output, and further understand personal and career development needs of their staff.
Technical Manager - AMD, Austin TX USA
AMD Research seeks a strong, collaborative leader with sharp technical skills and the initiative to motivate an expert team. You will manage a research group and work closely with internal teams and external partners to create the next generation of computing technology.
Director-Research Computing - Oregon State University, Corvallis OR USA
The College of Earth, Ocean, and Atmospheric Sciences is seeking a Director of Research Computing. This is a full-time 1.00 FTE, 12-month, fixed term professional faculty position.
HPC Operations Manager - Florida State University, Tallahassee FL USA
The operations manager mentors, trains and supervises a team of staff members assigned to systems administration tasks and leads the management, deployment, configuration and the operation of all computer facilities within RCC.
Tactical High Performance Computing Software Architect - Moseley Technical Services, Minneapolis, MN
The Tactical High Performance Computing Software Architect leads the development of next generation high performance radar and sensor signal processing for some of the world’s most advanced defense systems.
Software Development Manager, HPC - AWS, Seattle WA USA
You will be building a team which directly helps customers around the world better utilize HPC in the Cloud. You will be responsible for building and managing a team which builds tools both based on open source software and from the ground up.
Senior/Principal Solutions Architect (HPC) - AWS, Houston TX USA
As a trusted customer advocate, you will help organizations understand best practices around advanced cloud-based solutions, and how to migrate existing workloads to the cloud. You will have the opportunity to help shape and execute a strategy to build mind-share and broad use of AWS within enterprise customers.
Manager, Research Informatics - Beth Israel Deaconess Medical Center, Boston MA USA
The Research Informatics Manager position reports to the Director of Academic and Research Computing (ARC). […] the incumbent will meet with researchers to understand their IT needs in order to develop and advocate for an infrastructure that is consistent with BIDMC Information Services but supports the special needs of the research community.
Sr Manager, Strategic Initiatives and Administration, Princeton, Princeton NJ USA
The Senior Manager for Strategic Initiatives partners with the Associate CIO for Research Computing to help ensure the efficiency and effectiveness of the department and staff responsible for providing research computing services for Princeton University
Dir Research Computing (Cell and Genetic Therapies) - Vertex Pharmaceuticals, Boston, MA USA
We are seeking a highly motivated computational scientist to lead delivery of software and data solutions supporting our rapidly expanding cell and genetic therapy efforts. The ideal candidate has extensive experience building and delivering exploratory and production software in a scientific or research environment.
Assistant Director for Project Management - University Of Chicago, Chicago IL USA
The Research Computing Center (RCC) is seeking an experienced and talented Project Manager who will oversee a variety of complex projects to make sure the RCC is meeting its deliverables and responsible for high quality customer service.
Director, HPC Cloud - Intel, Hillsboro OR USA
Come join the CESG HPC team as Director, HPC in the Cloud to help build the strategy and GTM activities to ensure that HPC workloads will be optimized around IA - by defining solutions and working with Tier1 and Next Wave Cloud Service Providers.
Principal Data Scientist - Sorenson Communications, Taylosrville UT USA
This senior position requires a strong background in statistics, machine learning, analytics, and data engineering. The principal data scientist leads a small team of data scientists and works with senior management to identify opportunities, set strategy, execute on quarterly and annual objectives, and drive business growth.