RCT #173 - The measuring stick is the best teams, not the absence of work. Plus: I2ic organization report; New CaRCC Capabilities Model Assessment Tool; EPSCoR CI Recommendations; Racks as the unit of infrastructure; Slurm on Kubernetes

prima facie

        November 12, 2023

        There’s a number of technical research support teams where the institution would be better served by disbanding the team and freeing up the money for supporting internal research grant competitions, or better funding other teams, or hiring more staff for a research institute.  In many cases, the only thing stopping exactly that from happening is that it would take a lot of administrative work.  But eventually times get tough and suddenly that work starts looking attractive to decision makers.
One of my many goals in this newsletter is helping encourage people like you, gentle reader, to be ambitious about the impact of your team.
The nice thing about writing a 3,000 word newsletter every other week is that I know that the people who do read really care about supporting research, and prioritize learning about how to do that!
And I care deeply about the impact our teams can have.  It really bugs me when I see teams worrying they’re not good enough (#86).  An article in the roundup below about the importance of cyberinfrastructure for EPSCoR institutions shows the vital significance of the work we’re doing!
But there are other teams that I see who really do lack ambition and focus, and I see the impact when I get some responses to articles like last issue (#172), when talking about having trouble talking to VPR offices.
Here’s the thing.  Our teams’ goals are to advance research and scholarship in our communities and institutions as far as we can, given the constraints we face.  That’s why other teams aren’t the competition; less and worse research is (#142).
But the funding (or subsidization) our teams receive could typically go directly to researchers (sometimes through internal grant competitions, or startup grants, or other competitive allocation, depending on details of the funding source); or it could go to starting or maintaining a research institute/centre; or it could go to other teams.
So I sometimes see teams, especially teams that don’t collect enough testimonials and do product work around researchers’ needs, highlight a few positive outcomes from the team, or the occasional positive thing that came out of some specific effort.  They treat that as prima facie evidence that the effort was worth doing, or that the team is a net positive to the institution.  After all, that good thing wouldn’t have happened without them!
This is a dangerously naive baseline level of expectations to measure ourselves against.  It will make conversations with VPRs or funders unnecessarily challenging; it will make any kind of focus or positioning extremely challenging to sustain; it doesn’t put our team in a good light when there is an opportunity for more funding somewhere, or a need for reduced funding somewhere.
The baseline for comparison, the counterfactual we’re measuring against, isn’t “if the team wasn’t here, and the money that would have funded the team disappeared, too.”
The baseline, the counterfactual, is “if the team wasn’t here, and instead the money went directly to some highly productive new researchers” (or staff funding for a new research institute, or another research support team).
Just because doing something led to some positive outcome does not mean it was worth doing, because funding that work had an opportunity cost.  Other things could have been done instead.
So the question is - does our work have more of an impact on the research and scholarship in our community or institution than just giving that money directly to researchers?  Does it generate more papers, or more citations, or more grant funding, or more community knowledge transfer, more trainee expertise, than funding a focused cluster of researchers instead, and letting them buy services or equipment or hire people as they see fit?
Highlighting a couple of positive outcomes isn’t enough to demonstrate that we have that kind of impact, because:

The researchers would have done something in our absence, not nothing.
The researchers could have done even more had they been given an amount of funding equal to what we “spent” on our part of the work.

Holding ourselves to higher standards (#165), like measuring ourselves against the productivity of highly-productive researchers, matters!  It matters so that our teams can have as much impact on research and scholarship as possible.  It matters for our institutions and research communities; it matters for our team members so that they can see themselves apply their expertise and have an outsized impact.
Having these kinds of standards, holding ourselves to them, empowers us.  It can help make us trusted partners when discussing future opportunities.  It makes it easier to communicate the importance of our teams to funders and decision makers, in language that they understand, showing impact that they care about.
We are 100% capable of this.  We are uniquely qualified to look at our teams’ activities, and investigating weighing they are the best and highest-impact things we could be doing.  We are more then competent to make sure our operations are run professionally and effectively (#121), doing Management 201-level work (#137) to continuously learn and improve how we do things.
Thinking this way helps focus our mind, helping us move to the realization that our teams are vendors too (#123), and in particular, hopefully, professional services firms (#127), ensuring that our teams’ deep expertise is applied and delivered in the most needed way (#157).  It can help drive collaborations, improve knowledge exchange between other teams in our institution or community, and better align ourselves with research priorities.
Our institutions, our communities, our researchers, and our team members deserve that of us - that we’re having the highest impact possible for science.

And with that, on to the roundup!

Managing Teams
Over at Manager, PhD last week, I covered:

A rather blunt and full-throated defence of management, and decried the increasing vapidness of language around “leadership”
An article by Molly Graham about focussing effort on high performers, and helping them grow further
An article on under-management, which tends to be a vastly bigger problem on our teams then micromanagement
Setting expectations as a project manager

Technical Leadership
Open organizational report: Strengths and challenges for 2i2c's team - Chris Holdgraf
Really interesting (and admirably transparent!) report published by the 2i2c team, who manages a JupyterHub service for research communities; you might have also heard of through their work there or with binder, pangeo, or Jupyter executable books.
Interest in their products have grown steadily, to the point that their earlier organizational approach was showing strain, and team members were getting burned out.  They contracted with Difference Digital, a consultancy, to talk with the team, make observations, and give recommendations.   The blog post gives a brief overview of the problems, recommendations, and next steps, with a link to the report.
I cannot overstate how much of a gift it is to the community to talk openly about organizational challenges, and to transparently share reports like these.  It helps other teams look more frankly at their own challenges, with the idea that they are fixable with some advice, rather than ignoring them out of shame and trying to power through it.
Difference Digital recommended new positions, and 2i2c is following through with hiring for them:

A Product Manager, to address the fact that there is “no single owner of, or deep experience in, Product”, resulting in inconsistent focus in work and difficulty with pricing.  In my experience, almost all technical research support teams in academia lack this expertise.
A Delivery manager/Chief of Staff, “responsible for detailed planning and day-to-day management of engineering tasks”, to support the engineering manager so that they can spend more time on strategic and collaborative work, and building processes; this also contributed to a less-than-needed focus on strategy
A People Ops type role, to make sure the team members and processes around them are well supported in this very diverse, distributed, asynchronous team.

Other recommendations included:

Reducing single points of responsibility
Better prioritizing and measuring technical work
Build a system to view and manage decisions (one might even recommend a decision lab notebook!)
Experiment with synchronous, informal connections
Reduce stress in the team

I’d really recommend reading the clear report by Difference Digital.  You should treat with extreme suspicion any consultant that buries you in arcane diagrams and buzzword-laden boilerplate advice (hello, academic strategic planning folks, yes I’m talking about you).  The report is clear, bespoke, practical, conversational, and implementable.

Here’s a wiki of resources for doing agile retrospectives - I’ve written before about the benefits of mixing up retrospective formats so they don’t get stale and repetitive (#61, Crittenden’s article on “Snowflake retros”), this resource can help with that.

Research Software Development
Research Software Engineers: Creating a Career Path—and a Career - US-RSE & IEEE Computer Society
A brochure from US-RSE and IEEE describing the RSE role, what’s involved, and its importance, which might be useful to some readers for your own advocacy efforts.  The first third or so covers the history, and the remaining two thirds are pitched at potential RSEs, those who are interested in the still-nascent career path.

Research Computing Systems
New Capabilities Model Assessment Tool Now Available! - Daphne McCanse, CaRCC
As you know, I’m a big fan of the CaRCC capabilities model.   McCabe updates us on the new tool, including a web-based portal which has usability and data analysis advantages over the older Google Sheets approach (you can benchmark against relevant parts of the community!).

Minding the Gap: EPSCoR CI Workshop report released - Daphne McCanse, CaRCC
McCanse also announces the EPSCoR CI Workshop report, “Minding the Gap: Leveraging Cyberinfrastructure to Transform EPSCoR Jurisdictions”, by Bayrd, Jacobs, Schmitz, Strachan, Clemins, & Harris.
The report describes the outcome of a series of workshops on EPSCoR and CI.  It’s typically thoughtful and includes careful thought of the entire ecosystem, rather than focussing just on technology.  That last should go without saying,  of course, and yet we read many other documents from elsewhere in the ecosystem…
The report is hearting for reminding us why we do what we do:

…access to robust CI support can translate to improvements in areas including science scalability, reproducibility, interoperability, research impact, and security, thereby accelerating competitiveness.

I particularly appreciate the discussion of themes that emerged in the discussion, and the emphasis on alignment so that all the pieces of the CI-powered research mission are pointed in the same direction:

Theme 1: Foundational IT support - Foundational IT must be in place to support modern research.
Theme 2: CI-Research mission alignment:  Administrative, operational, and resource models must align to support the research mission.
Theme 3: Engagement at multiple levels: Institutional CI professionals must be supported to engage the broader community as well as the local research needs.
Theme 4: Workforce development: Experiential learning in CI professional roles must be based on best practices to be an effective Workforce Development Pathway.
Theme 5: CI as human capital: CI professionals with their technical and facilitation/liaison expertise form the key component in CI capital investment.

Long time readers will recognize how strongly your faithful correspondent agrees about the importance of these themes.
It’s also great to see how this work makes use of all the work done on the CaRCC capability model and resulting dataset, while not assuming that dataset is the final word and instead being largely directed and informed by EPSCoR institution research support staff.
The recommendations to NSF EPSCoR are:

Reestablish cyberinfrastructure (CI) as a required Research Infrastructure Improvement (RII) Program core component
Establish an NSF EPSCoR CI Council
Investigate models of CI human resources capacity sharing for EPSCoR
Enhance collaborative partnerships between EPSCoR, the NSF Office of Advanced Cyberinfrastructure (OAC), and Directorate for Technology Innovation and Partnerships (TIP)
Incentivize proposal stage participation by technical/CI staff

And to EPSCoR jurisdictions and Institutions:

Formalize cyberinfrastructure (CI) assessment and planning
Coordinate CI development across Research Infrastructure Improvement (RII)  projects and jurisdictions
Integrate regional network organizations
Align foundational information technology
Measure CI impacts
Communicate the role of CI

The report is clear and relatively short, and worth reading.

Emerging Technologies and Practices
How Oxide Created A Cloud Server by Stripping Components, Wires, Cables, and Chips - HPCWire
Almost 15 years ago, Google wrote about “the datacenter as a computer”, taking a somewhat broader view of what that meant than academic HPC centers typically do (e.g., Borg/kubernetes as an operating system for a cluster).
I don’t know anything about Oxide’s offerings one way or another.  This approach, however - the idea of the tightly integrated rack as the unit of infrastructure - seems the inevitable end state of vendors and cloud providers increasingly thinking the way Google laid out, and the trends we see in nodes getting fatter, increasing interest in “appliances”, vendor-certified reference architectures, and composable computing.  It’s also very consistent with, for instance, on-prem rack or half-rack configurations that make use of Google Anthos or Azure Stack or Amazon Outpost.
It’s useful to explore what this trend would mean for research computing systems if it continues.  It would likely mean reduced administration burden per unit of compute, which is an unambiguous positive; it would also greatly increase the amount necessary to buy in to a cluster under (say) a condo model, which is going to be a business model threat in some cases. (On the other hand, I can imagine teams being happy about not having to constantly be standing up onesie-twosie contributions).  It also means less flexibility in configurations, which honestly has positives and negatives, with less time going into design work which vendors can spend much more time on, but increasing vendor lock-in (at least until some kinds of standards develop).

CoreWeave has announced that they’ll be open-sourcing their Slurm-on-Kubernetes implementation, SUNK, early next year; their article gives an overview.   To my mind this has always seemed the most natural way to mix the two different kinds of workloads, long-running services and batch jobs, with any kind of dynamism.  Coreweave has been operating k8s at scale and supporting clients running both kinds of workloads for some time, so I’m really interested to see how this works out.

Enabling Complex Scientific Applications -  Anne Reinarz, Linus Seelinger
The reason the Slurm-on-Kubernetes matters to so many centres, of course, is that as fields mature, researchers are wanting to do increasingly complex and dynamic flows even with existing, mature, simulation workloads.   Here Reinarz and Seelinger describe the challenges in combining two very different applications in a traditional way, and instead approached the problem with a service-oriented architecture, having the jobs run as two services with a client-server style messaging pattern.

Random
LLVM is an interesting target for research-specific DSLs.  This is a nice introduction to the LLVM IR.
Microsoft/Azure has been doing a great job lately of publishing and open-sourcing hands-on tutorials on lots of DS/ML/AI and cloud infrastructure topics.   Here’s a quite recent one on generative AI.
Cursorless, a voice-driven extension to VSCode for editing and writing code.
New and quite rich family of monospace coding fonts just dropped - Monaspace, from GitHub next.  “Texture healing” makes it look almost proportional.
New (to me) open ebook on Linear Algebra: “Linear Algebra Done Right”, by Sheldon Axler.
An introduction to passkeys by the EFF.
Bad news for that Itanium server you’ve got in the closet there - IA-64 support removed from the Linux 6.7 kernel.
COBOL 2023 is out, and includes asynchronous messaging and transactions support.  Relatedly, SuperBOL Studio is a VSCode + GNU Cobol + language server COBOL development studio.
std::linalg has finally been approved for C++26.  Apparently the C++ committee has decided to gamble that this whole “linear algebra” fad is here to stay.  Can’t wait to see if they’re right!

That’s it…
And that’s it for another week.  Let me know what you thought, or if you have anything you’d like to share about the newsletter or management.  Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology.  It’s teams, it’s communities, it’s product management - it’s people.  It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly.  But no one teaches us how to be effective managers and leaders in academia.  We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.

                                Don't miss what's next. Subscribe to Research Computing Teams:

            Email address (required)

                    ← Newer

                RCT #174 - Roundup - Building trust; Internal engineering conferences; Don't get stuck on finding a mentor; SlackLog; Single Decision Makers; Incident Response and Postmortems

                    Older →

                RCT #172 - 'We Can’t Hire' isn’t a good enough bug report. Plus: Management Problems at ITER; Valuable Software is Updated

                Share this email:

                                Share on LinkedIn

                                Share via email

                                Share on Bluesky