Watching the research computing community respond to the pandemic has really been remarkable. We’ve seen centres all over the world make their systems available for COVID-19 research, and even individual desktops by the tens of thousands contributing via folding at home; teams have scrambled to update software and support new users and install newly freely-available software to support those users. We’re seeing a really gratifying flurry of international data-sharing efforts which research computing teams are supporting, reams of truly open data being made available, and rapidly assembled collaborations starting work.
I think it’s worth contrasting the world of research computing with the more general data science community which penned Medium posts ranging from the absurd to the dangerous about how the pandemic would play out; this is the difference between respecting domain expertise and not. But even other areas of research are having to be reminded to maintain some epistemic humility. There have been a number of groups trying to “pivot” to COVID-19 research, sometimes with disastrous results such as a now widely discredited seroprevalence study.
Our roles in supporting research can be frustrating at times — we enable research but often have take a back seat in the performing of that research. But it gives us a birds-eye of the landscape, and we recognize good work from bad relatively quickly. For those sitting on the COVID sidelines it can be frustrating to see our pet projects reduced to lower priority; but the work we were doing four months ago is just as important as it was then, and pivoting for the sake of short term relevance to the determent of important longer term work we know how to do can only be a mistake.
The reason we can start looking back and taking stock of how the pandemic has played out is that we’re now at the “end of the beginning” of the pandemic. The initial shock and dislocation has struck, and we’ve settled into a temporary new way of doing things - but changes are still coming as decision makers in our own institutions and elsewhere are wrestling with the next phases, and the consequences of what has happened so far.
We’re now well past the stage where we need explainers about how to tweak Zoom settings or the procedures for doing video one-on-ones, but we do need to make sure that our new processes make sense for the long haul.
More grimly, as well-endowed research institutions are taking a look at what the economy has done to their and governments start looking at future spending, many of our home organizations will start considering significant changes.
In our own careers and throughout our organizations the next stage of pandemic response its and consequences is going to require a new focus on prioritization, identifying where we can have the most impact, and what gaps need filling to make that impact as effectively and efficiently as possible. That’s the focus of this week’s roundup.
To make this more explicit, I’ve added a new section to the link round up on managing our own careers, as well as helping develop the careers of our teammates. The Jobs section has always been partly aimed at that, but there’s a lot more to developing our own careers than just keeping an eye on job ads; in this section I’ll include information about upskilling as a professional manager and making sure that our contributions gain the recognition they deserve.
On to the roundup!
In the spirit of “the end of the beginning”, I think we have to prepare for the fact that many of us will be working from home - or at least not from our regular office - for a very long time.
Lockdowns will start to relax; in my home province we now have a target amount of spread in the community at which point restrictions will be loosen (with a very close eye kept on those numbers in case they have to be tightened again). But people who can work away from other people are going to be strongly encouraged to do so for a very long time. We may be able to see our team members again in person, which will be good! But spending every day working shoulder-to-shoulder with them remains a long ways off.
The conversations Hicken describes about having with their team members are very good to have. These sorts of big picture, “Let’s look over the past [X time]” ,”Let’s talk about where the team is going over the next [X time] and what we think your contribution should be during that time” are badly needed, and should not be happening only one per year. Our team has them quarterly, which might still be a little too seldom (I say that because it’s still A Big Deal to plan for them and I think it’s stressful for the team members, which I don’t like). Monthly might well be good.
But these should absolutely not take the place of frequent, shorter, less structured one-on-ones with your team members. Going an entire month without your team member having a scheduled time where they can share any concerns or anything they’re especially pleased about, and a slot where you can give feedback and coaching, is not a good idea.
I don’t have a lot to add to these - we’re all feeling this right now (whether you use Zoom or something else) and you probably don’t even need to read the articles to list off a number of reasons. Video conferences give you energy-drain of having to be “on” for a meeting without giving you (or at least you extroverts) the energy gain of being around people.
It’s vitally necessary to communicate with your team and outside of your team during this time, but it doesn’t all have to be videoconferences (and it shouldn’t). Now that you’re settled in, it’s a good time to up your teams asynchronous communications game - sending longer design documents around for comments, talking through code choices through issue tracker discussions, etc. Save the synchronous, video chats — which can be very useful! — for where they are necessary.
Maker vs Multiplier - Pat Kua
There’s an old joke about becoming a 10x developer by spending your time helping ten other developers become twice as effective. I think this article is a nice way to distinguish between the contributions of individual contributors and those doing “glue work” like us managers (not that you have to be a manager to be doing multiplier-type work).
There’s one thing I’d add as a preamble to this article. If things have advanced to the point with one of our teammates where we’re going to have the sort of conversation we need to brace ourselves for, it is almost always our fault, at least in part. We didn’t have to let things slide this long. Giving consistent feedback about small things, even if uncomfortable, will allow you to avoid say 80% of Big Talk situations.
But that still leaves 20%. And while these are hard times for everyone to some extent, that doesn’t absolve us from having difficult conversations if we need to. This article presents a good checklist of things to think about before those conversations. In the preparation, Ringer makes two points to consider I think are under-appreciated, especially by technical folks:
We are not disembodied creatures of pure reason, and we lack any special powers to peer into the souls of our team members to perceive their intentions. When preparing for these conversations, we mustn’t work ourselves up by imagining “attitudes” that may or may not exist, nor be oblivious to our own reactions. Focus on behaviours and outcomes wherever possible, and leave unknowable internal state out of it.
The four steps for actually having the conversation - inquiry, acknowledgement, advocacy, and problem solving - are solid. The article is worth a read, as are the references.
Johns Hopkins’ stark economic outlook and planned cutbacks signal what’s to come for Maryland higher education - Liz Bowie and Phil Davis, Baltimore Sun
Those of us who are grant funded are in the somewhat unusual situation of considering that a source of stability rather than precarity these days, at least during the term of the current grants.
A lot of centres rely on core institutional funding. And at research-intensive universities that rely on income from endowments as well as government contributions, that core funding looks to be in jeopardy. I use JHU as an example of recent news, but research-intensive institutions around the world are considering similar measures.
At most centres with institutional funding, there will be other places to make cuts rather than staff — especially if those cuts are driven by reduction in operating funds from endowments and tuition revenue, both expected to rebound relatively quickly. Hardware and equipment will just have to last longer, and staff will have to go through another round of “do more with less”. Even so, these institutional responses to the consequences of the pandemic are things we will have to keep an eye on.
How to Prioritize Your Work When Your Manger Doesn’t - Amy Jen Su, Harvard Business Review
If you’re reading this newsletter, you probably care about managing your team well — but that’s pretty rare in research, which means there’s a better than even chance that you don’t get the kind of communication, feedback, and career development from your manager that you try to provide for your team.
This is a pretty simple article about prioritizing absent that sort of guidance; about trying to find where you and your team has the greatest impact (“provides the most value”) and where you are passionate about the outcomes. Aiming for the impact is a no-brainer; but doing that sustainably, in the absence of external direction, is going to require internal motivation too.
Many of us have highly technical training and so we’re always looking for things to optimize; when we move into working with lots of people we keep doing that, and initially we often find it mystifying that our suggestions for doing something in a clearly better way get turned down!
But change involving people is a lot of work, so change takes a lot of buy in. And people systems are and high-dimensional, so local optimizations may violate constraints we’re unaware of. This article talks a bit about that, and the importance of talking to people to make sure you’re solving the right problem (as above) as well as making sure your’e solving it right.
I’m still sheepish about the fact that when I first learned about “prewiring” proposed solutions or changes it was a revelation. Apparently the idea that I should, you know, talk to people who would be affected before proposing a “better” approach to something was an eye-opener to myself not that long ago.
Research Computing and Data Capabilities: A Tool for Assessment and Improvement - Data Brunson, Claire Mizumoto, Patrick Shmitz, EDUCAUSE
This is really important and relevant work that I was pointed to by a long-time reader; I hadn’t even known this work was going on.
A working group between EDUCAUSE, Internet2, and the Campus Research Computing Consortium has put together a very detailed capability model of research computing in research institutions. The model is clearly of an HPC-type centre at a university, but I think the model generalizes to beyond that, to general research computing support at a range of institutions both private and public sector or even multi-institutional collaborations
The model considers capabilities in five separate domains:
and within each of those domains it covers a number of key capabilities; an institution can rank its strength in those capabilites from 1-5 in either deployment (not deployed to deployed institution/collaboration wide) or service level (none through lights-on to premium). Researcher-facing capabilities involve outreach, support, and training; data-facing capabilities include long-term storage, data modelling, data lifecycle support, and so on.
This is an excellent resource to use when examining strengths and weaknesses in any research computing support effort; even more narrowly-scoped efforts will be able to use this to suggest areas of growth and need (there’s a column in the worksheet which allows you to weight down areas that are not relevant to your situation).
Do follow up on this, it’s worth your time and can structure your conversation with local decision makers.
Long-Term Planning - Tom Sommer
“For teams that act as enablers — such as my infrastructure team — this can be a bit tricky though. Although we have customers… our impact is really hard to quantify directly.”
Sound familiar? Unlike salespeople (“your numbers are down 10% this quarter!”) research computing staff have a hard time quantifying goals — and the numbers that can be easily measured (number of time your software was run; utilization of the cluster) are meaningless measures of inputs to research, not outcomes.
But that doesn’t mean there can’t be goals we hold ourselves to and we can’t make meaningful plans. It does mean we need to put more thought into it. This short article suggests three steps:
The “Draft a statement from the future” approach, the middle point, is surprisingly effective. Putting aside questions of how you’d get there — what do you want the future to look like, and then back out from there the steps you need — can work really well. Again, you have to prioritize what actually matters and find the most effective way to get there. There’s no short cuts but it’s what’s necessary.
Building Documentation Sites with the JAM stack - Brian Rinaldi
This article walks you through setting up a documentation site with an older static site generator, Hugo, but using Netlify to host. Some of the other tools listed, like Docsify or MkDocs, are also of interest.
Scientific machine learning paves way for rapid rocket engine design - Oden Space Institute Learning Physics-Based Reduced-Order Models for a Single-Injector Combustion Process - Renee Swischuk, Boris Kramer, Cheng Huang and Karen Willcox
This is a pretty compelling use of combining machine learning and physical simulation — using ML to generate a reduced model of the combustion/fluid dynamics equations being solved in a particular case (here, a rocket engine in very specific geometry), but in a way that maintains several physical constraints in the system. This requires a both deep domain knowledge of the set of equations being solved and on the machine learning side. Reactive fluid dynamics has a long history of empirically derived subgrid models or model reductions - related approaches, just without the ML, so this isn’t a completely unfamiliar approach, just with a plausible new twist.
The reason such approaches can work and are interesting relies on a bit of physical intuition - these systems are complex enough to be unpredictable (or we wouldn’t need simulations); but they are reproducible enough (start the flame in the same place under the same conditions, you get the same results) that there must be some robustness and redundancy in the system of equations.
Julia and I got off to kind of a rough start. The project had internal holy wars about purity (e.g. a multi-year debate about what the right operator was to use for string concatenation to properly represent the “mathematical structure” of string processing) which lead to repeated pointless churn and breaking changes (such as changing the capitalization of basic types like unsigned integers — Unsigned was a modifier of a type, you see, not a type, so it should be the I that was capitalized in UInt..)
But it is a super-cool language; the lisp-like nature hidden under modern syntax makes it incredibly powerful for writing DSLs, and product management seems to have stabilized to the point where you can actually count on code to keep working. It’s very strong on numerical code, but works in broader fields too.
One of the things that makes it particularly useful for fast-evolving research code is the approach to polymorphism - multiple dispatch, rather than strong OOP models. This article walks through the approach.
Software and Workflow Development Practices: Update - C. Titus Brown
How to write CRaP Rust Code - Andre Bogus
These are two quite different articles that I’m somewhat unfairly lumping together because they both advocate a pragmatic approach to software development that works quite well in research computing. The first describes the practices of a quite successful research software development lab at UC Davis, led by PI C. Titus Brown; the second building describes correct, readable, and performant (CRaP) idiomatic code in Rust.
In both cases, here’s an emphasis on exploring with code as straightforward as possible, getting the behaviour you want, then nailing that behaviour down with tests and beginning any refactoring or performance enhancements necessary (while maintaining readability). Note that this is different than test-driven development, which doesn’t work so well in research computing (we don’t necessarily know what the right answers are yet!) but still implements tests very early on in the process.
Brown’s article goes further, and is a great case study work in what one successful, sustained research software development effort looks like day to day. Highly recommended.
The Communicative Value of Using Git Well - Jeremy Kun
I’ve mentioned before several of Chelsea Troy’s articles on code review as a sort of asynchronous pair programming, with the benefits both of better quality code and knowledge transfer. In this article, Kun talks about crafting code changes into meaningful commits and PRs exactly to enhance that communication and knowledge transfer.
SDSC Expanse Supercomputer from Dell Technologies to serve 50,000 Users - Inside HPC
Building Innovative HPC for Massively Mixed Workloads - The Next Platform
As research computing as a whole and even within a field evolves, needs get more and more diverse. Increasingly, on-premises systems are going to have to either specialize in very specific niches — not just by field but by workflow type — or be like the new Expanse system at SDSC, built to support HPC and data and high throughput workflows.
For what it’s worth I’m skeptical that any individual centre even one as large as SDSC will be able to build “one-size-fits-all” systems indefinitely, but it is really interesting to see what they’ve done with Expanse; I’ll be very interested to see how uses find it.
SLOs Are the API for Your Engineering Team - Charity Majors
Majors, of observability and “Deploy on Fridays” fame, talks about (internal) service level objectives as a key measure for systems, and to guide decision making. These SLOs don’t have to be for the entire system either (although those should exist too); they can be usefully applied to different parts of the operations to establish boundaries and responsibilities. Her focus is primarily on web services, but this applies equally well to clusters, filesystems, or networks. After discussing the internal benefits of that approach, she also points out the benefits in using such measures when talking with external decision makers.
The Power Users Path to Ceph - Guillaume Abrioux and Paul Cuzner, Red Hat
This is a nice introduction to spinning up a first Ceph object store (I know, I know) with containers and ansible (it’s actually not a bad starting point for ansible in general, for that matter). Ceph is extremely flexible and can meet a lot of use cases, and the relatively new ansible deployment has made it really easy to get started with. If you’re looking to test-drive Ceph, this is a great way to start playing with it quickly.
This is a really nice demonstration of high-throughput computing on the cloud, and a great mix of old and new technologies — a single multi-cloud (AWS, Azure, and GCP) deployment of tasks using HTCondor, a job scheduling tool that started as a workstation cycle-scavenger in 1988(!!) and whose development has continued ever since.
There are lots of workloads where individual tasks are very much preemptable, and this single day, 1 EFLOP32, embarrassingly parallel ensemble of simulations for the IceCube antarctic neutrino experiment certainly qualifies. It was run with no special arrangements or EA with the vendors, and yet it worked and was extremely cost-effective due to the use of spot instances.
There’s some great data in here: there’s no checkpointing of these simulations, once they’re cancelled they have to be restarted from scratch, and still there was only 10% wastage of cycles due to preemption. There’s significant variation of run time even within an instance type but apparently that’s expected from the different simulations. They have very good cost-effectiveness numbers for the different instance types for this particular code. I wish they had job-startup latency distributions (especially by cloud vendor!)
IEEE Cluster 2020 - Conference 14-17 Sept 2020, Abstracts due 3 May, Papers due 10 May
This year’s IEEE cluster conference, which is still aiming to be at least partly physical in Kobe Japan, is looking for 4- or 10-page papers on HPC/Big Data/cluster computing topics in four broad topics:
Linux Foundation Open Source Summit - 29 June to 1 July
The schedule is up and registration open to this virtual meeting. This conference always seemed interesting from a distance but never so compelling as to make travel plans for; this year seems to be a good year to try several meetings like that virtually. The OSS has several tracks relevant to several different types of research computing work:
How Unix Pipes are implemented. I’m always really amazed to go through old Unix system programming books and see how clear and compelling the designs were of fundamental features. (I have a friend who teaches intermediate shell concepts that way.)
SELECT wat FROM sql - SQL is a powerful and mature lanaguge which almost by definition means that its behaviour in some cases is kind of.. well.. WTF.
But bash is like that too. Redirecting output to, e.g. $((i++)).txt works once, but not twice.
A great set of open data on Government of Canada IT projects - 70% of projects under $10M were successful, compared to 35% of those over $100M, consistent with results seen elsewhere. Big-Bang IT projects are almost never a good idea.
A GPU-accelerated terminal emulator, in case you want your monospace fonts rendered faster and don’t care about battery life.
How the pandemic has affected Stack Overflow searches.
Prepare yourself for requests to support R 4.0.0, with reference counting, matrices now supporting array operations, and, for frickin’ finally, stringsAsFactors = FALSE by default.
Deploying hobby projects cheaply with Google Cloud Run.
And that’s it for another week.
Have a great weekend, and good luck in the coming week with your research computing team,
Research Manager, Digital Health - University of Leicester, Leicester UK
You will be responsible for identifying, establishing and running project management structures to oversee and manage research projects. You will provide administrative support to ensure that meetings are arranged, sub-contracts are set up with external research groups, and that research activities are carried out smoothly with milestones in research progress achieved and reported on.
Manager, Research & Early Development (R&ED) IT - Bristol-Myers Squibb, San DIego CA USA
The ideal candidate will provide technical and domain expertise to be the IT application owner for research platforms in Bristol Myers Squibb(BMS) Research and Early Development (R&ED). He or she will be responsible for managing Research IT platform support, and continued enhance technical solutions to meet business demands and innovation activities.
Director of IT – Computing Infrastructure and Technology – AI Research Center - Unknown, Montreal QC CA
You will work in close interaction with more than 400 AI researchers-students working on fascinating research problems, 70 employees and 35 professors. You will also be responsible for the management and evolution of their most important computing infrastructure (HPC) and all of its information systems, including the design, purchase and installation of equipment. (computing clusters), IT operations management, technological developments, user support and network security.
Senior HPC Operations Manager - General Dynamics Information Technology, College Park MD USA
We are looking for individuals to join GDIT’s team to deploy, operate and support leading-edge (Cray Shasta Architecture) technology for WCOSS. Specific technology training will be provided.
Lead HPC Developer - University of Southern California’s Information Sciences Institute, Waltham MA USA
ISI is seeking a Lead HPC Developer interested in helping us develop a shared compute cluster to support our language understanding research. A successful candidate will: Collaborate with technical leadership in the design, development, installation, and maintenance of software for Linux and HPC cluster systems and ensure its scalability and fault-tolerance needs are met.
Director, Research Informatics and Information Technology - Research Institute of the McGill University Health Centre , Montreal QC CA
The incumbent will be reporting to the Director of Administration of he Director of Research Informatics and Information Technology is primarily responsible to oversee the operation of the division and to ensure it aligns with the vision and business objectives of the organization. The incumbent will provide leadership and direction, oversee the acquisition and deployment of cost-effective IT infrastructure solutions and software systems that adequately support and enhance research informatics, information technology, high performance computing and analytics.