Research Computing Teams Link Roundup, 28 Feb 2020
Hi!
Last week’s roundup was a little link-heavy, and I’m hearing that it ended up in some people’s spam filters; sorry about that. If ever you don’t get one of these emails, you can check out the archives - it wasn’t my intention to maintain an archive of these posts, but people have found it useful.
Interesting week this week - building team members up, breaking legacy code down, and SQL takes an unexpected turn. Read on!
Managing Teams
An Engineering Team where Everyone is a Leader - Gergely Orosz
A reader wrote in referencing this article saying that it was useful particularly in the context of a growing research software team supporting multiple projects - providing a structured way to delegate while promoting the team member’s development and responsibility. And it is a great article!
The idea is that, rather than as the team leader you run all of the projects in your team, you tap your team members to run them, with you supervising and coaching. For more junior staff members this could mean designing, kicking off, and managing a project that lasts 3 weeks and is mostly their own work with some contributions from someone else; after racking up some successes it can progress to leading major efforts. By delegating this project management you’re both taking some work off your desk (eventually, anyway - this would take quite a lot of managing and coaching!) and developing your team.
This could end very messily without clear expectations, and the author - who has worked at Uber, Microsoft, Skype & JPMorgan - includes an extremely detailed ten-page google doc of how to run a project; in the article he describes being very prescriptive about “this is the way you will run the project” for his team member’s first attempt or two, and as they become more successful they gain progressively more autonomy about how it’s done. As they grow more confident in their skills, they can even be the ones to mentor the team members who are just starting out on their project management efforts.
I really like this. I see lots of posts along the lines of “Help, I’m a New Manager”, and there’s articles intended to help, like this perfectly decent recent one: This 90-Day Plan Turns Engineers into Remarkable Managers. But the way people learn to manage projects and teams is to actually manage projects and teams - ideally, starting from small and progressing to large, and with a lot of supervision. This is a great way to help your team members do that, and to be ready for their next role, while reducing the number of tasks you are personally juggling.
This reminds me a lot of the idea of the responsibility ladder that Manager Tools introduced me to. I came from an environment where everyone I worked with had gotten a PhD, and so had the experience of managing of at least long-term project - their own dissertation research. The first time I hired someone who didn’t have that experience wasn’t a great time for myself or them — I’d give them a huge multi-month piece of work and just assume they’d figure it out. After all, that’s just how we do things in research, right? They flailed, I was frustrated and baffled - this person was smart, what’s wrong?
Diligently, intentionally walking people up that responsibility ladder, up to managing projects of significant scale, is an excellent way of tackling the two biggest responsibilities we have: getting things done, and developing the people on our staff.
10 ways to get untapped talent in your organization to contribute - Pamela Rucker, O’Reilly Blog
This article pairs quite nicely with the one above - it’s a little more general and gives less specific advice, but the basic idea is the same. While the author spells out ten ways, they all come down to combinations of four basic approaches:
Teaching: coaching and mentoring your people on strategic priorities and using their capabilities correctly
Telling: sharing stories, statistics, and strategies about where you’re going, what’s expected, and what’s worked in the past
Trusting: delegating work to others so you’ll have time to be far more strategic in your own work efforts
Trying: exploring new innovation opportunities as you look for new ways to grow the business or become more efficient.
The idea here is to understand your team members better, find what they’re good at, and find new ways of matching those strengths to things that need to be done, while clearly communicated what the end goals are. It can be applied to turning underperforming team members into good contributors, helping strong performers grow new expertise, or anywhere in between.
How to Coach Employees? Ask these One-on-One Meeting Questions - Claire Lew, KnowYourTeam
The Ultimate 1-on-1 Meeting Questions Template - PeopleBox
All of the above approaches require knowing your team members well - how well they’re doing what they do now, what they’d like to do next, and where they have untapped strengths. Regular one-on-ones, where the focus is on the team member and not on you or on status updates, are by far the best tool we have to learn these things and to build strong working relationships.
Claire Lew’s post focuses specifically on questions around coaching team members so you can help their skill and career development, broken into four categories: their perception of the current state, the ideal outcome of where they need to be, what they are most motivated by, and what’s holding them back.
The resource from peoplebox is a list of 500 (!!) one-on-one questions (plus alternates if you don’t like the wording) that could be useful to skim through occasionally to see if there are any topics that you haven’t been covering but maybe you might want to in the future.
Research Software Development
The key points of “Working Effectively with Legacy Code” - Nicholas Carlo
Exit the Haunted Forest - John Millikan
An awful lot of of code we work with in research computing can be thought of as legacy code - whether it’s functioning but old code that to meet current needs needs to be refactored, or whether it’s new code from a researcher which just isn’t maintainable in its current form. I like to think of reworking such code as helping the code reach its potential.
Working Effectively with Legacy Code by Michael Feathers is a hugely influential book on that topic, and is comfortably in the top half of a list of top-twenty-five most recommended programming books on the web that was published this week. The book has been influential enough that even those of us who haven’t read the book are likely aware of its basic tenants, like “Legacy code is code without tests.” Nicholas Carlo, who has a newsletter on working with legacy code, walks us through a nice summary of the book and its five basic steps:
- Identify change points (Seams)
- Break dependencies
- Write the tests
- Make your changes
- Refactor
The post by John Millikan focusses more on step four, making changes. Code that has become legacy through age usually got that way because it was resistant to change - after developers had tried to make inroads into the haunted forest to be chased away by bad experiences, others started staying clear. While it’s true that good code is all the same while bad code is each bad in its own way, there are some common patterns of the sort of badness that have resided being changed, and the post describes how to wrestle with a few of them.
Early Impressions of Go, from a Rust Programmer - Nick Cameron, PingCap
I like “[Language X] from the point of view of a [Language Y] programmer”-type posts, because it ensures that you learn an interesting spin on maybe-familiar capabilities.
Rust is becoming more common in some research computing circles as a C++-replacement with a package manager (and yes, memory safety), so it’s interesting to read what a Rust programmer has to say about Go, which is a deliberately simple programming language with much tighter scope that favours explicitness. If you’re familiar with either language the post will be interesting.
Quantifying Independently Reproducible Machine Learning - Edward Raff, writing at The Gradient
We worry a lot about about replication and reproducibility in research computing. In this article, the author — who attempted to independently replicate the results and basic methods in 255 (!!!) ML papers. Crucial here is independent replication; it’s not enough to just run the code, but to implement independently. He was successful 162 times.
That’s enough papers to do some quantitative analysis, and it’s interesting what aspects of the work were not correlated with successful independent replication. A clearly written paper and answering emails was much more important than published source code or worked sample problems.
Premortems: The solution to the Preventable Problems Paradox - Shreyas Doshi on Twitter
This is a great twitter thread, which I assume is a summary of one of the author’s presentations, giving very specific advice on how to run a pre-mortem before starting a project to identify potential issues before they arise. It’s so easy for people to see potential issues and not say anything; it could be because they’re not comfortable speaking up, but it could just as easily be because they assume someone more senior would have thought of that already and decided it wasn’t a problem. Get all those tigers, paper tigers, and elephants (you’ll see) out in the open and make sure you have plans for them if you discover them in the wild.
Product Management and Working with Research Communities
Under Discussion: The Maintenance of Large Open-Source Projects - Anne-Laure Civeyrac, Welcome To The Jungle In this article, the author interviews Bert Belder and Vladimir Agafonkin who lead or have lead large javascript products; but the key lessons here will be familiar to those who have lead research computing projects, too, particularly once they leave the exciting new phase and become mature and stable. A vital point identified by both was clarity of focus for the products - both in what they would do and, maybe more importantly, what they would not. With that in place (and understood by the community) it becomes much easier to close issues as being something that won’t be addressed, or to choose not to incorporate an out-of-scope contribution, or to recruit and onboard new team members.
Data Scientists Get all the Glamour But, Wow, Is There a Need for Data Engineers - Ori Rafael, The New Stack.
It is notably harder right now (about 2x it would seem) to find the “back end” people to manage the data systems and do the data processing and translation work than it is to do the analytics or ML on the data. I think there’s a bit of an analogy here with science and research computing teams; just as you can’t do data science without the data engineering, it’s hard to do cutting edge science without strong research computing teams. We just have to figure out how to convince the funders and institutions of that.
Emerging Data & Infrastructure Tools
Five Interesting Data Engineering Projects - Dmitriy Ryaboy
It’s always interesting to keep an eye on what's coming up in data engineering, because eventually it will be taken up by research computing teams who have similar data ingest and analysis needs. This article covers a few tools that are emerging in that community:
- DBT (Data Build Tool) - Version controlled, parameterized, modular SQL workflows.
- Prefect -
- Dask - Regular readers will know I’m a big fan of Dask for complex parallel python workflows; not listed here is that it now has built-in support for HPC-style networking via UCX
- DVC - Version control for analyes and corresponding data sets. I’ve always loved that sort of idea, it’s very relevant to research computing, but I’ve yet to see it succeed. We’ll see how this goes.
- Great Expectations - In the post author’s words, asserts for data. Seems like another good way to do CI/CD for data sets.
CUDA-Python and RAPIDS for blazing fast scientific computing
This Inside HPC post highlights an XSEDE seminar given by Abe Stern from NVIDIA. The python data science community has built some really amazing tooling for high speed calculations for data analysis; Rapids is a really cool data analysis (yes yes, “AI”) toolchain for GPUs, and this talk shows what can be done with Numba (which for math on rectangular multidimensional arrays is frankly kind of magic) and RAPIDS put together. Those efforts, plus Dask as mentioned above, and things like BlazingSQL which build on RAPIDS for SQL analytics on large data sets are powering a number of really cool applications.
Build a Python App with CockroachDB and Pony ORM - CockroachDB
Most research computing applications don’t need multi-region databases for availability or location-based sharding of data, but it does happen, and I continue to be amazed at how relatively straightforward it is with Cockroach; by choosing to support Postgres’s wire protocol it is automatically supported by libraries and ORMs in a number of languages (I’ve linked to a Python example here but from that page you can also see the equivalent application in C++, Go, Java, Python, Ruby, Rust amongst others). Even starting a cluster isn’t too bad. Last month, Karl Seguin had a nice blog post about migrating a postgres-based application to Cockroach; it went mostly pretty well but there are features that he found lacking. Some of those will be easily fixed but some are likely fundamental to the differences that come with a distributed database.
Lustre HPC file systems no longer disappear in a puff on Amazon Web Services – but you'll pay more - Tim Anderson, The Register
Azure HBv2 Virtual Machines eclipse 80,000 cores for MPI HPC - Microsoft Azure Blog
AWS and Azure are making increasingly plausible plays for HPC in the cloud. Tim Anderson’s article for The Reg describes updates to AWS’s managed Lustre service with automatic data replication and file server failover (if you know anyone who manages a Lustre file system you’ll know that a decent managed Lustre service is something that’s genuinely worth paying real money for). Promises are in the millions of IOPS range - which isn’t a lot for a large installation, but there’s no reason you’d have to have only one such FSx deployment - and you can pay more or less depending on your desired throughput to and from the file system. Really interesting.
The Azure article is kind of dishearteningly sales pitchy, but it talks about their HBv2 instances with 200 gigabit HDR infiniband - these nodes only really exist because Azure is going after HPC workflows. It is really interesting to see decent scaling up to 80,000 nodes in the cloud, which was hard to imagine 2-3 years ago; a colleague has seen something similar on AWS.
Reperforming a Nobel Prize Discovery on Kubernetes
New to me but apparently a year old, this work (there are several links under “Publications”, the deep dive presentation is particularly interesting) shows how Kubernetes can be used as essentially a cloud scheduler for the very large number of embarrasingly parallel but very irregular computations needed in particle physics; here they reanalyze the discovery data for the Higgs boson.
Random
Fastpages from fast.ai is a particularly slick and easy-to-get-started-with setup for starting blogs using Jekyll and github pages. It uses github actions both to get things initially started and to make posts - including making blog posts or pages out of Jupyter notebooks, which seems like it may be of interest to a number of research computing teams.
(Is anyone in research computing using Github actions for anything real yet? It seems like it should be really useful but I’m not seeing the number of uses of it yet that I would have expected.)
This week I learned that you can generate ASCII-Art Mandelbrot Sets and do interactive 3-d rendering in SQLite’s dialect of SQL with recursive queries.
In similar but more baffling "because I can" news: WebAssembly is a bytecode and virtual machine for running code in the browser with speed comparable to natively compiled code. Someone’s decided that it’s a good idea to build a tool that enables running WebAssembly in the kernel.
That’s it…
And that’s it for another week! I hope that collectively you enjoy reading these as much as I enjoy putting them together.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
Jobs Leading Research Computing Teams
Assistant Vice President for Research Computing - University of Texas at San Antonio, San Antonio TX USA
Leads and provides strategic direction to research computing initiatives. Manages specific research computing projects. Works collaboratively with campus stakeholders to create an ecosystem that increases throughput of knowledge development, intellectual property, and external funding through grants.
Director, Research Computing - Saint Louis University, St. Louis MO USA
Provides direct oversight of cloud computing strategy and serves as central conduit for cloud computing services that enable an agile approach to providing technologies for research faculty. Develops and maintains strategic vision and operation road map of research computing services, based on qualitative and quantitative data.
Director Scientific Computing systems - Moffitt Cancer Center, Tampa FL USA
The Director of Scientific Computing will work at the intersection of computer science, genenomics and computational and systems biology to ensure that computing architecture keeps up with the rate of innovation in the research institute.
Technical Manager - AMD Systems, Austin TX USA
Leadership - managerial and technical; Technial experience in one or more of: AI/ML, software, arcitecture, systems, cloud computing, big compute, scientific computing
Chief, Automation and Research Computing - Federal Reserve Board, Washington DC USA
The Chief manages the Automation and Research Computing (ARC) section and its staff, designing and implementing its programs and services with broad guidance from division officers. The section develops and operates a dynamic computing environment optimized for economic research, analysis, and measurement, and made available to staff across the Board’s economics divisions.
Program/Product Manager - Azure High Performance Computing - Microsoft Azure, "Atlanta GA USA"
"A successful candidate would have some experience/interest in High Performance Computing (HPC) workloads with interest in large scale applications [think supercomputer scale]. It is critical for this candidate to have or to be able to develop long term relationships with stakeholders at key customers."