Research Computing Teams #124, 4 June 2022
Hi!
I wanted to write more this week about the consequences of research computing teams as vendors, and what that means for strategy - I’ll do that next week.
But there was big news in HPC world in this past week - as you almost certainly already know, the Frontier super at ORNL is now the world’s first Exascale system as measured by the HPL benchmark.
This is an enormous technical (and project and product management!) accomplishment, the capstone of work that’s been ongoing since 2006 or so. Colleagues at AMD (which, disclaimer, has long been a worthy competitor of my current employer), HPE, and ORNL deserve all the praise they’ll rightly get. My sympathies to other colleagues at Argonne whose star-crossed Aurora system, facing a series of delays, didn’t make it across the line in time.
And yet, and yet, and yet.
Something can be a massive technical accomplishment and yet be not be the solution to the most pressing problems a field has - a distraction, even, from those.
When you ask research institutions what their priorities and needs are, as in the first item in the roundup today, bigger systems don’t top the list. The biggest priorities are for user support, outreach, data engineering, analysis, and visualization expertise, and processes for handling sensitive data. The priorities are people. Us, and our teams.
Even when CaRCC was asking specifically about systems, the top five priorities were data archiving and preservation, interactive computing, commercial cloud, infrastructure automation, and high throughput computing. That’s worth highlighting - at the time of announcement of exascale, growing HTC was a higher priority for research institutions in the US than growing HPC.
The push to petascale, then exascale, started in the 90s, and is the culmination of a strong vision for the future rooted in the 90s. Everyone involved is to be commended for executing that plan and that vision so well!
But the thing about visions of the future, like 50s sci-fi, or concept videos of personal computing from the 80s or 90s, is that they are extrapolations through the past to the (then-)present. They’re visions of current technology, but more so.
Possible 2020s research computing, as viewed from the 1990s, looked like the same big number crunching tasks, but on a much vaster scale. The same dense linear algebra operations, but much larger. 3D viz caves and grid computing.
But the thing is, over the following thirty years, research computing achieved much greater scope than that cramped vision. It met ambitions completely orthogonal to it, even.
The triumph of research computing and data is that it is essential everywhere, in every field. Digital humanities and computational social sciences, precision health and simulations for urban planning, 3d modelling completely routine for art students.
And that work is vitally important, is helping drive those fields.
The drive for peta- and then exascale was largely about industrial policy, not scientific policy. It was an attempt to push the technology forward so that the 90s version of 2020s research computing could unfold (preferably while supporting local vendors). It was an enormous push, with a lot of incidental benefits for us and for technology broadly. But it was money spent on an overly narrow vision of the future, a future which thankfully didn’t come to be.
Groups driven by science accomplishments have done things differently. The vast majority of research impact, however you want to measure it (papers, citations, scientific awards and recognitions) happens based on research computing, but at the laptop or workstation or small cluster scale. Every research computing team with readers on this mailing list has a much higher bang for the buck, a much higher scientific-impact-per-dollar-spent, than the massive exascale-ish labs. And quite possibly higher scientific impact period. There are research projects who need extreme-scale HPC, but it’s an extremely niche use case, and one that research institutions are demonstrably not overly concerned about.
My fondest hope is that with the exascale race now won, the extreme-HPC part of our community can move on to more productive pursuits. No one but Intel seems to believe that the next big thing is the next power of 1,000, “zettascale”. (Intel could not possibly be more obvious in wanting a do-over on exascale).
There’ll still be an over-emphasis on “big” computing, sadly, since that suits the needs of vendors and those driving industrial policy. I’m hoping that at the very least, we’ll move to another big benchmark suite which will at least be based on real tools that researchers actually run.
But I also hope that there’s more emphasis on people, and less on stuff. Because research computing is about research, not computing. And research is fundamentally a people-driven activity, propelled by expertise and skill, with equipment added only as necessary. Our teams are awesome, drive research, and we need to support them well.
With that, on to the roundup!
Managing Teams
2021 RCD CM Community Data Report - Patrick Schmitz et al., CaRCC
I’m a big fan of the CaRCC Research Computing and Data Capabilities model - it’s popped up from time to time here since #21. Here they report on their collection of 51 US Institute’s results in 2021. The aggregate results are interesting - the capability coverage and level of maturity varies widely. That makes a lot of sense - different institutions will experience different needs in different orders, and so develop strengths and capabilities differently.
The model breaks capabilities down into five “facings” - researcher, data, software, system, and strategy/policy facing. As the figure below shows, software and data facing capabilities are generally somewhat less present or mature than researcher, systems, or strategy and policy facing, but there’s such wide scatter across institutions it’s hard to say anything definitive.
What I found especially interesting was the list of top-10 priorities the submitters had in aggregate; in order, they were:
- Researcher facing - introductory user support/training
- Strategy and Policy facing - strategic plan
- Data facing - data engineering & analysis
- Researcher facing - researcher outreach
- Researcher facing - can researcher-facing staff effectively advocate rof researcher needs
- Researcher facing - marketing-type activities (my words, not theirs)
- Strategy and Policy facing - sustainability of funding
- System facing - data archive / preservation
- Data facing - data visualization experts
- Data facing - sensitive data
(Do those line up with the priorities at your institution? Drop me a line either way - jonathan@researchcomputingteams.org)
These strike me as pretty mature priorities of fairly sophisticated institutions. As more institutions realize that these teams are vendors and no are no longer in a “if you build it, they will come” situation, having outreach and marketing-type activities become important. And as soon as you realize you can’t do everything in the shifting and expanding world of supporting research and scholarship with computing and data, strategic planning becomes vital.
But a lot of institutions are much earlier on in their journey, and are looking for a starter version of this assessment - CaRCC now plans to put out an “Essentials” version of the instrument, which is fantastic news.
A couple other notes:
- I’d really love to get a sense for how the capabilities are actually provided on the ground at these institutions - the default up to now has been “provide everything that’s provided at all in-house”, but I think that’s no longer tenable.
- Even though software-facing capabilities are less mature/present, no software-facing issues made the top-10 list of priorities (!!)
Remote Onboarding of Software Developers - Paige Rodeghero, It will Never Work in Theory, Live!
Please Turn Your Cameras On: Remote Onboarding of Software Developers during a Pandemic - Paige Rodeghero, Thomas Zimmermann, Brian Houck and Denae Ford, arXiv:2011.08130
I’ll highlight some other talks from this workshop in the Research Software Development section below, but this talk is relevant for I think all research computing teams - how to onboard technical staff when we’re all working from home.
Getting someone up to speed on the work of a team and on the team itself is always tough, but in this lower-communications-bandwidth world where there’s a lot less in-person interaction, it’s even harder. Doing it well, though, is vital - getting people successful as early as possible is important not only for the team’s effectiveness but for the newcomer’s morale and engagement.
Rodeghero here presents her work based on survey results from 267 new hires at Microsoft in the first year of the pandemic, when processes were still very much in flux. Successful onboarding had several things in common, and from that she makes eight recommendations:
- Cameras on should be the default culture
- Promote proactive communication (e.g. new team member asking for help)
- Schedule 1:1 meetings with all team members
- Explain the Org Chart (and other information about the organization)
- Assign an onboarding buddy & technical mentor
- Support multiple onboarding speeds
- Assign a simple first task
- Provide up-to-date documentation
I think in our smaller teams, we don’t generally need to distinguish between the supervisor, technical mentor and onboarding buddy - any two of those is likely , but everything else is very relevant. I think one thing that doesn’t come up very much elsewhere is that Rodeghro found new hires really struggled to understand the team dynamics and form bonds with the team when everyone had their cameras off. I think this is easy to forget when our teams have long average tenures and most of us know each other well - we know how to interpret people’s voices. But new people joining will really struggle if everyone is just an avatar in a box on a screen.
Product Management and Working with Research Communities
Following chatter about the Helmholtz AI Conference on twitter, this tweet led me to the Helmholtz AI unit’s fascinating consulting model - researcher submit proposals for vouchers, which are reviewed (as they come in, it looks like). Those vouchers can be for (<2 week) “exploration” engagements, where the project is fleshed out with experts, or longer (<6 month) “realization” engagements where the project is executed on. 169 engagements have been completed so far.
Has anyone worked with Helmholtz AI using this model, or have experience with a similar model? Let me know!
Cool Research Computing Projects
I love computer-powered citizen science projects. This Ars Technica on the Hubble Asteroid Hunter is a great summary of the possibilities of combining web applications, volunteer labour and machine learning for science.
“That’s why you needed a sample of them detected by humans,” Kruk said. “What took us a year to classify with the citizen scientists—it took only about 10 hours with the [algorithm]. But you do need the training set.”
Volunteers on the Zooniverse platform went through 37,000 Hubble images to find over 1,000 new asteroids, typically smaller than those which were known before.
Research Software Development
It Will Never Work In Theory, Live! - Organized by Greg Wilson
At the end of April, Greg Wilson hosted a series of lightning talks focused on actionable results from software development research. There’s many worthwhile talks there, covering issues relevant to our field - I’d like to highlight lightning talks (with slides available) on
- The effect of destructive comments in code reviews,
- The hidden costs and benefits of test-driven development, and
- How code coverage can be used and abused to guide testing
Top Ten suggestions from managing a decade of undergraduate software teams (PDF here) - Weiqi Feng, Mark D. LeBlanc, J. Computing Sciences in Colleges (2019)
One of the defining features of academic research computing and data is that many of the contributors are trainees. This isn’t something you read a lot about handling in articles written for the tech industry or startups!
The authors describe a long-term interdisciplinary project, Lexos, which helps digital humanists analyze their favourite corpus of digitized texts. It will clean up, visualize, and analyze bodies of text. It started off as a set of perl scripts, and is now a hosted or self-hosted web app with a python backend and a d3.js-powered javascript front end, with releases Windows, Mac, and Linux, hosted on GitHub. With the group based in a college, most of the trainees available to contribute are undergrads doing summer projects or working part-time during the year.
Feng and LeBlanc offer their suggestions for managing such teams:
- Recruit actively and iteratively - focus on students who can work independently and can lead tasks
- Precede the initial sprint with a bootcamp - this is a really good idea if you’re hiring cohorts like summer students
- Use CI tools, Unit Testing, and good modular structure (so students can work confidently on specific chunks of the overall product during their short tenure)
- Conduct peer reviews: “Though we do not ask all students to make suggestions to pull requests, we do ask them to read through each and leave comments if they cannot understand what the code does” (great stuff there)
- Meet in daily standups
- Lead toward success: assign easy tasks early
- Improve one tool
- Add new functionalities
This approach provides real professional development for students, while advancing the group’s product. Some of that professional development is hard - the authors call out challenges and tensions around peer reviews in particular - but those are skills that are important for the students to learn.
The authors may undersell the level of project management skills they clearly demonstrate here. Maintaining a long list of well-scoped discrete tasks that can be handed out to short-term undergraduates is no small thing!
Research Data Management and Analysis
Time Series and FoundationDB: Millions of writes/s and 10x compression in under 2,000 lines of Go - Richard Artful
The wide range of open-source databases available now means that it’s increasingly possible to choose something which meets the right tradeoffs for your particular use case. And if not, there’s likely something close you can build on top of.
The README for this repo is basically a blog post on using FoundationDB (Apple’s open-source distributed key-value store) to build a distributed time-series database that met his needs. It’s a great overview of why you can’t just use Postgres, implementing compression, and secondary indexing.
Research Computing Systems
TACC Adds Details to Vision for Leadership-Class Computing Facility - Aaron Dubrow
Atos pushes out HPC cloud services based on Nimbix tech - Dan Robinson
Dubrow’s article gives us details about TACC’s upcoming system and facility, but the part I want to highlight is this:
TACC announced as part of its preliminary design presentation that the LCCF advanced computing system will likely be hosted at a Switch commercial datacenter under construction on the Dell Round Rock campus, 10 miles north of TACC.
The culture of some of research computing, but especially HPC, was formed in the 90s and earlier when research computing was a highly specialized niche resource. But the success, the importance, and maturity of research computing and data is such that it’s now everywhere, it’s necessary for most research, and most needs for it are thankfully technically routine. Complex and important and requiring expertise and onboarding, to be sure, but routine.
Even though the TACC leadership system (“Horizon”) will presumably be quite big and will be able to HPL very hard indeed, its size and scale will be consistent with what commercial data centre operators can handle. And those data centre operators run dozens or more data centres, hosting hundreds or thousands of systems, full time - that is their one job. They are very good at it, with an efficiency and effectiveness that comes from that much experience.
Most research institutes are not located in a place where space and power are cheap. There’s an opportunity cost to using campus or institute-owned space for machine rooms instead of labs or classrooms, in addition to the economic cost. We’ll see much more of this use of commercial datacenter use in the future, and it’ll open up yet another front in the (interminable and mostly uninteresting) cloud vs on-prem debate.
Relatedly, European vendor of HPC systems Atos will now, like HPE, also offer the systems as a service.
It’s fascinating to go through Facebook’s logbook for a hero run - in this case a recent large pre-trained text model. Nodes going down, bad commits, people forgetting to change parameters, an ssh server being taken out because it was running on someone’s instance… super interesting, and very relatable! I also really respect the fact that they kept such a careful log and released it with the model.
Life and leaving NERSC - Glenn K. Lockwood
A long-time HPCer and well known high performance storage expert, Lockwood muses on NERSC (by all accounts a great place to work), the recent history of HPC, and why he’s moving to Microsoft Azure:
A more innovative approach is to start thinking about how to build a system that does more than just run batch jobs. […] Such a “more than just batch jobs” supercomputer actually already exists. It’s called the cloud, and it’s far, far ahead of where state-of-the-art large-scale HPC is today–it pioneered the idea of providing an integrated platform where you can twist the infrastructure and its services to exactly fit what you want to get done. Triggering data analysis based on the arrival of new data has been around for the better part of a decade in the form of serverless computing frameworks like Azure Functions. If you need to run a Jupyter notebook on a server that has a beefy GPU on it, just pop a few quarters into your favorite cloud provider. And if you don’t even want to worry about what infrastructure you need to make your Jupyter-based machine learning workload go fast, the cloud providers all have integrated machine learning development environments that hide all of the underlying infrastructure.
Emerging Technologies and Practices
Some really interesting announcements coming out of ISC this past week:
Red Hat Joins Forces with DOE Laboratories - HPC Wire
For things that (say) Slurm is good at, it’s objectively much better than Kubernetes. But as soon as you don’t fit neatly into the batch scheduler’s model of “N nodes for M hours” requests, it all turns to weeping and gnashing of teeth.
There was always going to be a winning resource manager for more complex workflows. I would not have bet on Kubernetes — it’s fiendishly complex, both over- and under-powered for research computing needs, and still a moving target — yet here we are. It’s standard, widely deployed, and has a fairly stable programmable interface, and most importantly it’s what we have.
This press releases that Sandia is going to work with RedHat on Kubernetes (RedHat has a widely used commercial distribution of k8s, OpenShift) for extreme scale more complex workflows, connecting next-gen batch schedulers like Flux to the kubernetes world. And both Sandia and NERSC will work on improving the rootless container ecosystem.
GigaIO Announces Series of Composability Appliances Powered by AMD - HPC Wire
Composable computing is actually something you can order by the rack, now - AMD + GigaIO are selling racks with EPYC chips and MI210 aceclerators, with rack resources being able to be merged into “nodes” while the nodes are live for particular jobs. Very cool!
Random
A cheap and reliable and widely-used German credit card terminal recently reached the point where it would no longer be supported - and then a certificate or key or something expired, producing mild credit-card chaos in Germany.
A computer-music language for your browser - glicol.
How fast can you push data through Linux pipes if you try hard enough? Pretty fast - up to 65 GB/s.
How fast is the typical malloc/free? With a good library median times can be about 25ns, if you’re playing DOOM3 anyway.
How fast can you push a quicksort implementation? Google’s open-sourcing a library making full (and architecture-independent) use of SIMD and multicore that can sort 16 through 128-bit integers at speeds 9-19x faster than std::sort.
Digging into information theory, optimal strategy, and wordle.
Implementing Forth from scratch, where by “from scratch” I mean “first, design an instruction set”.
The first Lisp compiler.
Telefork a process onto another machine.
Unittests and mocks for bash scripts.
The unreasonable effectiveness of “have you tried turning it off, then back on again”? (Honestly, why do we ever even turn these blasted machines on in the first place?)
That’s it…
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
Jobs Leading Research Computing Teams
This week’s new-listing highlights are below; the full listing of 146 jobs is, as ever, available on the job board.
R&D Manager / Tech Lead - Synopsys, Aachen or Berlin or Stuttgart DE or Livingston UK
You are passionate about high performance simulation solutions? You want to create software tools that enable our automotive and telecommunications customers to develop embedded software for their next generation of self-driving cars, mobile devices or virtual reality applications? You burn for robust, high quality and flexible software architectures? As a tech lead you are responsible to define and drive the development of our testing tool and simulation technology in a highly competitive market. You will work in a team of high professionals and help to define the next generation of Virtual Prototypes.
Manager, Research Infrastructure and Development - Canadian Research Knowledge Network, Ottawa ON CA
The Manager is the lead product owner of the Canadiana platform which includes the Canadiana Trustworthy Digital Repository (TDR) and application (access) platform. The Canadiana platform is home to over 60 million pages of digitized documentary heritage and is a critical part of providing access to and preservation of Canadian cultural heritage material. The Manager leads a team of full-stack software developers and system administrators in building tools and features that enhance the platform for users and CRKN members, and oversees an innovative and agile technical development roadmap that is forward-looking, transparent, and prioritized by members and user needs.
Computing Infrastructure Manager - University of Edinburgh, Edinburgh UK
The School of Informatics is the UK’s biggest, world-leading centre for teaching and cutting-edge research in computing, artificial intelligence and data science. Working within the School’s computing team (~25 FTE), the post holder will be responsible for planning and maintaining the School’s data network (230 network switches supporting 7000 active data ports, 70 individual subnets, IPv4 and IPv6) and data centre infrastructure (5 server rooms hosting 400 physical servers)
Senior Data Scientist / Manager, Advanced Insights - Johnson & Johnson, Chesterbrook PA or Titusville or Raritain NJ USA or Beerse BE
Sr Data Scientist/Manager, Advanced Insights will be responsible for translating analytics and generating insights from data with high level of autonomy. The incumbent will lead data science projects developed by Integrated Clinical Operation Analytics (ICOA) group.
Research Imaging Data Manager - University Hospital Southampton NHS Foundation Trust, Southampton UK
As a result of increasing imaging research activity at UHS and the University of Southampton, we are looking for an experienced Data Manager to join our growing Scientific Computing Team within the Imaging Physics Group. The successful applicant will work with clinical and research stakeholders to develop and manage systems that ensure medical imaging data is available for research purposes in a timely and efficient manner, consistent with Information Governance and Good Clinical Practice. Applicants will be expected to possess a relevant degree in physics or computing, or a degree in a relevant scientific discipline with demonstrated abilities in informatics and computing.
Associate Director - Data Management - Australian National University, Canberra AU
National Computational Infrastructure (NCI) is Australia’s leading national provider of high-end computational and data-intensive services, with a well-respected reputation for its services, expertise and innovation. The Associate Director – Data Management is responsible for developing technical and data management which implements NCI’s national data management and delivery strategies, including delivering a high quality and responsive service to the satisfaction and benefit of NCI’s community, and support the achievement of the NCI’s strategic goals
Research Associate, Project Data Manager - Asimov, Boston MA USA
We’re hiring a full-time Research Associate / Project Data Manager to join our team in Boston, MA. The ideal candidate will bring experience in molecular biology, data management, and project management together to shepherd the development of new tools for synthetic biology. Working closely with Asimov’s world-class synthetic biology and high-throughput experimentation teams, this is a unique opportunity to work at a nimble, forward-thinking synthetic biology startup and help build the future of biological engineering.
Senior Research Software Engineer, Center for Advanced Research Computing - University College of London, London UK
At any grade you will design, extend, refactor, and maintain scientific software in all subject areas, providing expert software engineering consulting services to world-leading research teams, training researchers in programming best practices, and working with scientists and scholars to build software to meet new research challenges. With such a varied job, we don’t expect our candidates to know it all from the start, but to make the most of the opportunity to develop new skills, spending time to study both the research areas we support and the specialist technologies applied in research IT. The Senior Research Software Developer will take on a leadership role within the group, either technically or managerially, helping to guide the vision for this strategically important area for UCL. You may lead the technical design for complex projects, manage research programming projects, and/or mentor and supervise other group members.
Manager, Research and Cloud Tefchnology - UMass Chan Medical School, Worcester MA USA
Under the general direction of the Associate Chief Information Officer or designee, the Manger, IT Research Technology is responsible for the planning, design, review, programming and implementation of institutional software and biomedical big data solutions at the University of Massachusetts Medical School for advancing research and translation of research findings to clinics. This includes designing, developing and implementing various solutions of capturing and integrating heterogeneous biomedical research data. In addition, he/she will serve as product owner, scientific consultant, and instructor to the Medical School faculty and staff. Perform diverse and complex duties in a manner consistent with a dynamic and active biomedical education and research community. This position is Hybrid.
Research Computing Team Lead - University of Vermont, Burlington VT USA
This Team Lead position has responsibility for supervising and managing a portion of the Systems Architecture & Administration (SAA) department. This position is part of the SAA leadership team, with a focus on Research Computing services. This includes the Vermont Advanced Computing Center (VACC) clusters, as well as ETS’ research computing offerings such as research storage, virtualization platform for researcher VMs, and related technologies. This team lead is also responsible for technologies that support data collaboration (such as Globus) and tools that facilitate research data management plans. This team lead will also be responsible for evaluations of public cloud systems to augment our on-premises services for researchers.
Research Computing, Data Security and Compliance Manager, Center for HPC - University of Utah, Salt Lake City UT USA
This position will be the lead in planning, directing, and managing the information security posture for CHPC at the University of Utah. The CHPC resources comprise HPC Clusters, Virtual machine deployments, storage, and other systems. The applicant will work with the CHPC team to ensure that services operate in a manner consistent with U of U policy 4-004 and comply within the targeted scope (security zones). Depending on the scope, compliance must meet one or more of the following: HIPAA, FISMA Moderate, ITAR, CMMC 2.0, NIST 800-171 rev2, CUI. The applicant will work in day-to-day operations to improve security posture, analyze threats, develop counter measures, and advise department security policies and procedures. Additional daily operations may include installing new software releases and system upgrades, evaluating, and installing patches, and resolving system related problems. The applicant will also monitor system configuration and data files to ensure data integrity, system integrity and compliance.
Research Computing Program Manager, Foundations for Research Computing - Columbia University, New York NY USA
The Research Computing Program Manager will lead the activities of the Foundations for Research Computing (FORC) program. The aim of the program is to train Columbia researchers in computational skills and overall computational literacy. As part of the Columbia University Libraries’ Digital Scholarship unit, the Program Manager will advance the program and special events for researchers in close cooperation with other colleagues in the Libraries, Columbia University Information Technology, and the Office of the Executive Vice President for Research.