Research Computing Teams #102, 27 Nov 2021
Hi!
If you follow HPC twitter at all, you will have seen a heartfelt thread by a well-known research software developer, one who was a key contributor to the Singularity project among others, lamenting the frankly appalling state of developer productivity in HPC - both in what tools exist, and support for them (and other tools for developers) at academic centres. A lot of people chimed into the discussion, including one of the leading developers of the PetSC project, embedded software developers, some key people at big computing centres, all agreeing that there was a problem, but typically zooming in on one or another particular technical or procedural issue and not coming to any conclusion.
I think the issue is a lot bigger than HPC software development - it comes up in too many contexts to be about specific technical issues of running CI/CD pipelines on fixed infrastructure. The only people to identify the correct underlying issue, in my opinion, were people from the private sector, such as Brendan Bouffler from AWS:
Too much reliance on ‘free’ labour - postgrads and postdocs, who, invariably, decide that burning their time being mechanical turks for their ‘superiors’ just sucks, so they come and work for us. And since we pay $, we’re not gonna waste them on things software can do.
The same argument got made by R&D research staff in the private sector. Their time actually has value; as a result, it gets valued.
In academic research computing, partly because of low salaries — especially for the endless stream of trainees — but also because we typically provide research computing systems for free, we tend to put roughly zero value on people’s time. If researchers have to jump through absurd hoops to get or renew their accounts, or have to distort their workflows to fit one-size-fits-all clusters and queueing systems, or postdocs have to spend hours of work by hand every month hand because tools to automate some work would cost $500, well, what do they expect, right?
It’s not that this is an indefensible position to take, but one can’t take this position and act surprised when researchers who can afford to are seriously investigating taking their projects into the commercial cloud even though it costs 2x as much. It turns out that people’s time is worth a lot to them, and is certainly worth some money. If we were to let researchers spend their research computing and data money wherever they pleased, I think we’d find that significantly less than 100% of researchers would use “lowest price possible” as their sole criterion for choosing providers. Certainly core facilities like animal facilities, sequencing centres, and microscopy centres, compete on dimensions other than being the cheapest option available.
To be sure, there are process issues in academia which exacerbates the tendency to see people’s time as valueless - rules about capital vs operating costs, for instance - but those rules aren’t a law of nature. If we were paying people in academia what they pay in tech, we’d suddenly discover that finance departments were willing to cut some slack on small capital expenses if it meant we could be a bit more parsimonious with people’s time.
Until then, one also can’t be too surprised when the most talented and ambitious staff or researchers get poached routinely by vendors and the private sector.
Managing Teams
Delegation is a superpower - Caitlin Hudon, Lead Dev
It’s true! But, superpowers sometimes take some practicing to use effectively.
I personally am pretty good at the mechanics of delegating when I think to do it, but too often find myself taking on a responsibility so as not to bother anyone else with it, or because only I can do it (well, yes, if no one else gets a chance to learn how to, I guess I am the only one who can do it).
The ending year is a good time to reflect on things you’ve spent time on to see if they could usefully be given to someone else as a welcome growth opportunity. Hudon encourages us to see a task being delegatable by default:
- Is this task something that only I am uniquely able to do [because of the role power of my job]? - Ok, fine, I can’t delegate it
- Do I need to perform this task because of a limiting constraint (for example, you’re running out of time or only you understand the context)? - Do it this time, but start laying the groundwork to delegate it next time
- Should I be the person responsible for this task, or is there someone better suited? - Delgate it
Managing Your Own Career
Invisible Output: Measuring the Behind-the-Scenes Work of a People Manager - Samantha Rae Ayoub, Fellow
One of the problems of being a manager is that the work we do is pretty invisible. Our teams’ accomplishments are pretty clear, but the work we do to support that can be hard to point out. That means it’s hard to point out to our bosses - but even more importantly, to ourselves - the good work we’re doing, and where we need some support.
Ayoub tells us it doesn’t need to be that way; we can set clear goals, and even some quantifiable ones, for ourselves. We managers just need to:
- set measurable goals for ourselves, not just goals for our teams
- measure outcomes and impacts, not just count inputs or outputs
- be personally measured on overall effectiveness, not just team performance
On goal setting, Ayoub urges us to set goals for ourselves, not just for the output of our team, that focus on overall outcome or impact - whether that's internal (like staff engagement numbers) or external (more collaborations or partnerships) - not just input activities. That will help focus our activities on what matters, not just activities for their own sake.
Ayoub also tells us to make sure we’re including team member metrics - things like retention and rate of promotion - but those are less relevant in our world, where job tenure is generally quite long.
Managing Up - Lessons From Scaling Teams at Credit Karma and Lyft - Matt Greenberg, Valerie Wagoner, Dor Levi, Anne Lewandowski
Managing upwards isn’t that different from managing our own team members; and it’s very similar to managing relationships with peers and external stakeholders like collaborators, situations where we also lack the ability to be directive.
Greenberg, Wagoner, Levi, and Lewandowski suggest focussing on three areas:
- Aligning your and their goals — making sure you understand their goals so can find areas that align;
- Developing a strong relationship, and having an empathetic approach to working together — ask lots of questions and don’t make assumptions; and
- Presenting your problems in a way that makes helping easy for them — concisely present the context need, frame the request in a way where it’s clear what they need to do, ask for something immediately doable in the short term, or ask for help getting the problem so well defined that this would be possible.
They also talk about three common failure modes that can cause problems when working with your boss, in the short and long term
- When you don’t realize you’re missing context: so keep asking questions and be aware of your assumptions
- When your timing is off (possibly because of the above lack of context, or just because of mis-prioritization): so keep staying aligned on goals
- When your delivery of a solution or problem is off compared to what they want to hear, perhaps souring your manager on a perfectly good idea: so tailor your communications to the audience (especially around use of data, level of detail, and communications style)
Product Management and Working with Research Communities
The missing millions: Democratizing computation and data to bridge digital divides and increase access to science for underrepresented communities - A. Blatecky et al
Research computing and data are increasingly important for STEM fields, so if we want STEM - and R&D careers - to be available to all, we need to make sure there are as few barriers as possible to being fluent with computing and digital research infrastructure[*], and to having it accessible.
More selfishly - we readers of this newsletter are all pretty familiar with how hard it is to hire in research computing and data. So we have a pretty strong and direct interest in making sure as many people as possible have access to and build skills and understanding in research computing and data technologies.
We know, though, that research computing and data as a community is less diverse than academia or tech as a whole - and those communities are not stellar examples of diversity, equity, and inclusion.
This is a long and frankly somewhat dispiriting read. It’s not great out there. But there are some pretty solid recommendations in this report - and it’s NSF-funded, so there’s real hope the recommendations would be taken up. Recommendations we on the ground can actually engage in are things like:
- Expand investments in data and software to make sure systems are as accessible and useful as possible
- Expand into non-traditionally computing and data heavy fields, and into areas like distributed and edge technologies, for similar reasons
- Engage in long-term inclusive community building
- Improve racial, gender, and other forms of inclusion
By the way, one finding from this report stuck out, as it’s one of the tenets of this newsletter:
It is computing, and software, and data
I can not agree with this enough. Focusing on research software development, or systems, or data management, in isolation is a mistake. In 2021, those lines are so blurred as to be meaningless, and it builds silos where there ought to be none.
[*] sorry, but I just hate the NSF term “cyberinfrastructure”.
An IC’s guide to roadmap planning - David Noël-Romas, Increment
A good introduction to product roadmapping, aimed at a software developer IC tying to figure out how to contribute to product planning. It applies just as well to research computing, whether systems, software, or data products:
- Clarify your [product’s] opinions - I like the framing here of “opinions as hypothesis”, and the distinction between the opinions of the product and those of the individual team members. I also like the insistence on having opinions, rather than trying to be all things to every possible user.
- Understand the process - being clear on what the process is for your team, and up to date with what’s going on in different parts of the processes (backlog, prioritization..)
- Identify stakeholders - customers (users), leadership, team members, and possibly others
- Talking to customers (users)
- Making an ordered list of priorities
- Advocating for underdog ideas
42 things I learned from building a production database - Mahesh Balakrishnan
Relatedly - I really do think R&D at Big Tech, or the work of startups, is pretty close to research computing. It’s certainly a lot closer than University IT! They’re working on things that may or may not work, firming up their understanding of the problem at the same time that they’re developing solutions, and reading (and writing) papers as they go.
Here an academic, Balakrishnan, describes what he learned building a production database (Facebook’s equivalent of Chubby - a central state store which is a key piece of infrastructure that other foundational services depend on). They are classic product development lessons, broken down into categories of customers, project management, design, code review, strategy, observability, and research.
Research Software Development
Go Does Not Need a Java Style GC - Erik Engheim
Engheim give a good and short overview of why garbage collection is typically much less of an issue in languages like Go or Julia than Java or even C#, even though all four absolutely do rely on garbage collection. This matters a lot, even in web services - modern systems typically being a large number of small processes, GC pauses can ripple through systems causing very large amounts of jitter. In desktop technical computing codes, GC pauses can be very bad for performance.
This is a pretty opinionated article, but Java GC is genuinely a harder problem than GC in other languages, and Engheim outlines why. There’s more here than I can comfortably summarize - but basically Java bet heavily on GC in the early 1990s, allocating basically everything on the heap, and avoided value types, meaning even trivial arrays of simple structs cause very large numbers of small objects:
JVM developers have worked heroically designing runtimes and garbage collectors to mitigate this issue, but it’s fundamental issue not easily walked back (although Project Valhalla is trying to introduce the benefits of value types, which would get things closer to C#). Engheim walks us through generational GCs, bump GCs, compacting GCs and escape analysis, demonstrating the problems each are trying to solve, and the tradeoffs made. Those tradeoffs are different than they were three decades ago - multicore is the default, as an example, so concurrent GC is very much desirable.
Research Data Management and Analysis
Sketching for set comparison in bioinformatics - Camille Marchet
In the last few years, bioinformatics has seen a lot of progress on sketching methods for calculating similarity of large sets of data. In the case of bioinformatics, the large sets of data are strings, and applications is include “alignment-free” mapping reads to references, metagenomic classification, and more; but the problem is more general. If you’re curious about this kind of work - MinHash, etc - in the context of other sketching methods you may have heard of, such as hyperloglog, then Marchet has a nice short well-referenced summary.
Research Computing Systems
What is AMD ROCm? - woachk
In #100 we covered some pretty cool looking AMD GPUs for compute. So how to use these beasts? AMD’s revitalized software development ecosystem for their GPU systems is ROCm. Here we have another opinionated - but I think fair - article, discussing AMD ROCm, or in particular HIP.
(ROCm also includes OpenCL, which is a known enough quantity now that it doesn’t really need “What is OpenCL” articles. I’ll add that OpenCL is certainly being used productively by many teams, but the fact that AMD chose to add HIP to their supported frameworks suggests that OpenCL may have been something less than widely beloved.)
From the article:
HIP is a wholesale clone of the CUDA APIs, including the driver, runtime and libraries’ APIs. That’s not a bad thing, it acknowledges what the industry standard is, making portability easier.
There are differences - the biggest one is that you more or less have to literally s/cuda/hip/, a tool called hipify helps with this. But the convergence is genuinely good, meaning that software developers really only have to learn one mental model for developing GPUs for either NVIDIA or AMD.
The article points to some real limitations of the ROCm/HIP ecosystem to suggest why it’s not taking off. There’s a few; none of them are insurmountable but they do add real friction:
- GPUs support is limited to higher-end cards; one huge advantage of CUDA is that you can develop even on tiny NVIDIA cards or laptop integrated graphics cards;
- Even where it is supported, you will likely have to recompile - compiled code isn’t as portable between GPUs; and
- OS support is limited - Linux only
These aren’t insuperable, and in a world where people are using cloud instances or codespaces or the like for development, they may not matter. But an awful lot of research computing software development is still done on local laptops or workstations, where it takes a bit more effort to get into ROCm programming than CUDA programming.
Emerging Technologies and Practices
Learning Containers From The Bottom Up - Ivan Velichko
A well sourced learning-about-containers landing page, with a lot of links to dig into things more deeply.
Calls for Submissions
Minisymposia for Humanities and Social Sciences for Platform for Advanced Scientific Computing (PASC22) - Basel, Switzerland, 27-29 Jun, EoIs due 15 Dec
If you’re interested in arranging minisymposia for social sciences and humanities topics in HPC/scientific computing, PASC 2022 has a special call out with expressions of interest due 15 Dec, with full proposals due 15 Jan. PASC is an ACM SIGHPC sponsored conference and covers HPC and scientific computing topics broadly.
The 14th Workshop on General Purpose Processing using GPU - 12 or 13 Feb, Seoul, papers due 23 Dec
An OG GPGPU conference, covering everything from compilers and programming languages for GPUs to GPU reliability, applications, containerization, and serverless for GPUs.
RSE Survey - US-RSE and other RSE organizations - due 14 Jan
RSE organizations have put together the 2021 RSE survey - it would be great if those of us in research computing and data broadly who might not consider ourselves primarily software developers could participate. You can complete the 2021 RSE survey in English, French, German or Spanish.
The 17th International Workshop on Automatic Performance Tuning - 3 June, Lyon, part of IPDPS 2022, papers due 31 Jan
Topics include:
- Applying machine learning to autotuning
- Automatic program generation
- Performance analysis and modeling
- Adaptive numerical algorithms and libraries
- Runtime systems
ScaDL 2022: Scalable Deep Learning over Parallel And Distributed Infrastructure - 30 May or 3 June, Lyon, part of IPDPS 2022, Papers due 24 Jan
Topics include:
- Deep learning on cloud platforms, HPC systems, and edge devices
- Communication-Efficient Training of DNNs
- Scalable and distributed graph neural networks, Sampling techniques for graph neural networks
- Federated deep learning, both horizontal and vertical, and its challenges
- Elasticity for deep learning jobs/spot market enablement
Random
Want your awk script to run faster? Transcompile it into go.
A free undergraduate cryptography textbook - the Joy of Cryptography.
As with good debugging stories, good “dramatic performance improvement” stories almost always make it into the newsletter. Dramatically speeding up spell checking.
Useful tmux configuration examples, including much better vertical/horizontal split chars (| and -), window swapping, and tmux plugins.
Using awk with csv files - yeah yeah, sure, “-F,” , right? Except for all the commas that aren’t delimiters, and… woah, why did none of you ever tell me about csvquote?
Relatedly - latexrun, a latex runner for use within build tools/workflows that handles multiple running, cleanup, etc.
Most of us in this game have had to absorb a lot of background information about open source licenses without realizing it, and the differences between contracts, IP, etc… but if you have a new team member who’s less familiar, this is an unusually good and comprehensive primer, written by actual technologist lawyers.
Nice from-scratch introduction of transformers, a recent (2017) deep learning architecture used for e.g. translation.
That’s it…
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
Jobs Leading Research Computing Teams
This week’s new-listing highlights are below; the full listing of 157 jobs is, as ever, available on the job board.
SD4Health Platform Manager - McGill University, Montreal QC CA
The SecureData4Health (SD4Health thereafter) platform will create, initially in Québec and in Ontario, the privacy, software, and computational infrastructure needed for the analysis of the genomic and health data that is being generated within Canadian hospitals and research centres. The SD4Health platform will also develop, implement, and facilitate novel information sharing modalities, where the security and privacy of participant data is paramount, and enable Canada to play a leading role in the challenging but critically important movement towards international health data sharing for research. The SD4Health Platform Manager (PM) is responsible for the implementation and oversight of SD4Health, a secure cloud platform that implements computational frameworks, policies, and infrastructure to securely store, share and analyze genomic and health data.
Sr. R&D Manager - Software Technologies - AMD, Markham ON CA
The Software Technologies Team are responsible for forward-looking research and development for AMD’s future products. Due to growth, we are seeking senior manager who will be lead a small team of high-caliber R&D engineers working on leading-edge technologies in AI, Machine Learning, Advanced Graphics, and Multimedia. You will work under the direction of software architecture leadership and be responsible for strategic technical deliverables for the company. You will oversee onsite and remote team members across global geographic regions. You will have well-established experience managing a technology R&D team in machine learning and/or 3D graphics, working in a fast-paced, multi-faceted collaborative environment. You will demonstrate a strong knowledge of technical program management and all phases of the SDLC, particularly in the areas of managing the Early Concept and Feasibility stages of a Software Engineering Project. You will also have experience with Windows, Linux and ChromeBook based systems and their respective device driver and firmware models.
Software Engineering and Solutions Manager - The University of North Carolina at Chapel Hill, Chapel Hill NC USA
The Software Engineering and Solutions Manager (SESM) is responsible for design, project management, support, and team leadership of projects supporting researchers’ and educators’ use of RENCI software systems. The SESM, in close collaboration with the Director of the Software Architecture Group, will advance RENCI’s mission by providing reliable, responsive capabilities to the research and education communities at UNC Chapel Hill and our partners in federally sponsored research. This includes developing funding proposals, both individually and in collaboration with local and national teams to federal agencies, foundations, and state and local governments supporting research at the convergence of domain sciences, data science, and cloud computing. The SESM will also effectively manage sponsored research awards, and author publications. The SESM will work with faculty and other stakeholders to advance RENCI’s mission in the research and practice of cloud native devops with continuous delivery.
Manager - IT DevOps - National Renwable Energy Laboratory, Golden CO USA
The National Renewable Energy Laboratory (NREL) is a leader in the U.S. Department of Energy’s effort to secure an energy future that is both environmentally and economically sustainable. Information Technology Services seeks a Manager for the DevOps team. Are you a passionate DevOps Manager with deep technical know-how and hands-on experience implementing various practices and tools in the DevOps toolchain? Are you interested in growing a DevOps team while promoting the DevOps culture in our research and IT organizations? In this working manager role, you will lead, manage, and work side by side a diverse group of smart, innovative, problem solving technical staff in the deployment and support of complex, high visibility IT solutions.