Research Computing Teams #126, 18 June 2022
I write a lot here on the challenges that research computing and data teams face, and the challenges that their managers and leaders face in particular. And that’s because I want them to have fewer challenges! I hope that by highlighting the challenges they at least don’t face them unawares, and that as they grow their skills they bring along the next generation of managers and leaders under clearer, less stressfull, and more suppported environments than we found ourselves in.
Which is all well and good, but it can get a little gloomy in here sometimes. I don’t think I spend enough time highlighting successes.
Do you have management or leadership wins you want to share, no matter how small? Just with me, or with the community (anonymously or otherwise)? Let me know - just hit reply and send me an email at firstname.lastname@example.org. I think it would be good for everyone here to see what some of those wins look like.
(And let me know if you’d like to chat about those wins, or if you have some challenges you’d like to talk out - always happy to chat with readers.)
I had a really good discussion earlier this week about support with stratgy and the importance of having them aligned with your stakeholders. They gently reminded me that it isn’t everywhere as dire as I made it sound. In a lot of parts of our institutions, particularly if you’re closer to reporting to the CIO or VPO of your research institute than the VPR, there probably is a fair amount of clarity about what good looks like, and what’s next! Even on the VPR side, there may be a lot of support and enthusiasm for discussions about priorities, even if it may be up to you to initiate those conversations.
One of the fantastic and exciting things about research computing and data is that it’s getting unimaginably broad. Which is great! But the point is coming - is already here in some parts of our community - where being a generalist isn’t feasible. We who go into this line of work want to help everyone, but we’re past the one-size-fits-all stage of research computing, where we can’t do everything. That means making decisions, and is why I’m always hammering on topics like strategy and specialization. Making sure your team is pointed in the same direction as your leadership not only takes the weight of those decisions somewhat off your shoulders, but also means you’re supporting efforts the institution is trying to encourage.
Anyway, let’s get straight to it - on with the roundup:
A lot of begin our techincal leadership/management career by us growing with a team - we start off as effectively the senior sysadmin/data scientist/software developer on a team of two or three, and then the team grows. While numbers vary a bit based on the work of the team, below is a pretty typical trajectory with team size:
- 2-3: You’re mostly a team lead, supervising and directing technical work
- 4-6: You’re becoming a full-time manager; the team has to be collectively more technically independent day-to-day
- 7-9: Getting pretty big: you’re starting to need to identify technical leads for sub-teams
- 10+: Sub teams require increasing independence; another manager going to be needed soon
Obviously, this being the research world, no one talks us through any of that, or the different skills needed at each level.
One big reason why this is a problem is that not only do our skills need to be growing during this evolution, so do our team members’; at each step of this path the team is taking on a greater level of responsibility. Even at that 2-3 person team size, they’d benefit from us coaching them to take on new responsibilities as the team grows to the point where we’re full time managers; and then you need to identify and grow team leads, and eventually find a peer manager.
This blog from lighthouse covers things we’ve talked about before, but it’s a good overview of the process above, covering:
- Let go - you’re going to have to give up responsibilities
- Look for the leaders already on your team - who’s doing glue work already?
- (Assuming they’re interested) give them opportunities and set them up for success
- Coach and develop them
- Once they are managing people, meet with your new skips occasionally
Respectful, healthy conflict is fine, even good, for a team - it’s essential, even, in the storming phase of the team formation cycle. But it does have to be managed, or at least monitored, and acknowledged. There are some points here to work through if the conflict becomes A Big Thing, which needn’t be the case. But even for more quotidian cases, there’s a useful breakdown of types of conflict which are worth keeping in mind (especially since conflict can “look” like one when it’s really another)
- Task Conflict – What needs to be done?
- Process Conflict – How does it need to be done?
- Status Conflict – Who needs to do it?
- Relationship Conflict – When it’s getting personal
And the fact that explicitly calling the conflict out is the first step in moving to a resolution.
Not My Job - Silvia Botros
As a junior IC, one has the comfort of knowing exactly what one’s job is. As the scope of one’s responsibility grows, that stops being true - there’s a lot more to potentially be responsible for. But that doesn’t mean we’re responsible for everything and anything! Botros writes to push back against well-intentioned but dangerously un-nuanced points of view like `There is no “that’s my job” anymore’.
She makes an excellent and useful distinction between glue work and “gap filling”. Glue work is something all managers and technical leads genuinely are responsible for; being the connective tissue which stitches together individuals and teams and external stakeholders to make progress to a common goal. But sometimes there are gaps - missing expertise, absent leadership, no one to take on needed work - and we can’t selflessly fling ourselves into every such breach. That way leads to burnout for us, while the underlying problem remains unfixed.
Botros counsels being vigilant to avoid this, and to stay in contact with your manager and leadership to make sure the work you’re doing lines up with what your organization needs most:
Assessing whether what you are doing day to day needs to be an intentional process, something you and your manager re-assess routinely and compare to your goals and the organization goals. Be very aware of being pulled into projects with no measurable milestones. […]. You should use your experience and influence to shed a light on gaps and risks. But you cannot fix them all.
In our line of work I think this article and its recommendations are important not just at the individual level, but for whole teams. We’re doing our work because we want to advance science. There’s a noble but foolhardy tendency for research computing and data teams to want to jump in and solve any researchers problem even tangentially connected to our remit. But we can’t fill every gap. We advance science best by focussing on where we excel, doing our best work there, identifying gaps elsewhere, and then either flagging them to leadership or connecting researchers to other teams elsewhere. We can’t, and shouldn’t, do everything.
Product Management and Working with Research Communities
June 2022 database release - Nathan Benaich, Spinout.fyi
Interesting database from spinout.fyi of University spinouts, slightly over half of which were software-based, and so may be of interest to this community. Typical time to spinout was 9-12 months, although some Universities seem to have extremely capable departments for this and get it done in 3 months or less. Lots of breakdowns on negotiated royalties, equity, and the like. Generally people who went through the standard University processes were fairly unhappy with the results.
Story in nature about how it’s increasingly difficult to recruit postdocs:
“This year is hard for me to wrestle with: … we received absolutely zero response from our posting,” one wrote. “The number of applications is 10 times less than 2018-2019,” another wrote.
Is this consistent with what you team is seeing with your researcher clients?
Cool Research Computing Projects
A couple of great biology projects to highlight this week:
The Venerable UniProt project maintains a comprehensive protein sequence and function information. There’s lots of linked information there - the sequences, genes, external data sets, functional information, co-expression informations, etc. The results are a graph of knowledge, and the SPARQL query interface now queries 105.6 billion RDF triples, “the largest graph database free to query on the web”.
Elsewhere, there’s an article in the most recent Nature Scientific Data by Nishimura & Yoshizawa describing a catalogue with over 50,000 prokaryotic (bacterial, etc) genomes from various marine environments (including low-oxygen deep-water habitats and polar regions), spanning 8,466 species clusters in 59 phyla. The data is publicly available.
Both of these data resources are going to be highly valuable, and are cutting edge in different ways, while being based on an enormous amount of data collection and curation work. And both of them required data, software development, and systems work. The highest impact projects are going to be those that span software, systems, and data, which is why avoiding the accidental construction of silo walls between expertise in those areas is so important.
Research Software Development
One thing I like about this set of guidelines is that it’s not one-size fits all; there’s three categories of software with increasing requirements. Most software has a brief and lonely existence, and that’s likely doubly true for research software. For one-off software, having it be publicly downloadable somewhere with a clear license is likely more than enough. Ramping up the requirements of software management plans as it becomes more vital to an effort - and, crucially, starts being used by others - makes sense. If anything here I’d say that ramp-up isn’t sharp enough.
Research Data Management and Analysis
Common DB schema change mistakes - Nikolay Samokhvalov, Postgres AI
Samokhvalov, a veteran of over 1,000 data base changes large and small, has an outline (and associated slide deck) of 18 common mistakes. He breaks them into three categories - concurrency-related mistakes when doing the migration, correctness mistakes in applying the changes, and some miscellaneous issues. For Postgres specifically, his company has a tool with an open-source community version to quickly clone Postgres DBs to testing migration steps, which seems like it could be very handy.
Research Computing Systems
Now that exascale is finally done with, what’s next? While research cyberinfrastructure has been sort of hijacked to mean modest numbers of large computers, Reed muses here thoughtfully about the scientific opportunities about edge - lots of intelligent sensors with modest compute. Handling these problems are more technically interesting and scientifically motivated than just building the same systems over and over again but larger.
Intel’s Clear Linux Outpacing Ubuntu 22.04 LTS, Fedora 36 & Other H1‘2022 Distros - Michael Larabel, Phoenix
In research computing we tend to recompile a bunch of our stacks anyway, but it’s interesting to see how big a performance difference there still is between linux distros, including on absolutely fundamental things like networking as well as codes we care about (LAMMPS!). Here Larabel runs a number of benchmarks against current out-of-the-box linux distress, and Clear Linux comes tut clearly ahead (interestingly, CentOS Stream also does very well).
Declare early, declare often: why you shouldn’t hesitate to raise an incident - Isaac Seymour, Incident io
Oops, That Almost Happened - Vanessa Huerta Granda, Jeli
If something hurts, it might be because you’re not doing it often enough. Incident reporting is an important enough skill, and produces useful enough document (for your teams ongoing use and for your clients) that it’s worth going through, even for minor things, as Seymour says. And Huerta Granda even encourages writing up near misses, incidents that didn’t occur.
I think we’ll see more instances of agreements like this - modest sized purchases of locally provided specialized cloud services for very specific purposes - here, for grad students and courses in computer vision and machine learning. The use for course work is particularly interesting, because the teaching part of a University’s mission requires a lot more of a professional IT approach to system stability than the usual research computing requirements can support.
Emerging Technologies and Practices
Chip Roadmaps Unfold, Crisscrossing and Interconnecting, at AMD - Timothy Prickett Morgan, The Next Platform
Morgan here goes through the recent AMD roadmap updates and does a typically through job of putting them in context. Between AMD’s existing on-chip fabric, the upcoming Zen x86 cores, the intention to have “APUs” combining x86 cores and GPUs on-die, and the “adaptive silicon” of the FPGA purchase, there’s a lot going on!
The complexity of upcoming CPUs - and we’ve seen this a bit from Intel too - means, I think, we’re reaching the end of one-size-fits-most “general purpose” high-end compute CPUs, and systems. (Exascale was supposed to get us there with “co-design”, but like so much of the exascale project, that was a disappointment). We’ve seen hints of it with some workloads making use of GPUs and some not, and the widely varying I/O requirements of simulation-heavy vs data-heavy workloads; but up to this point, a lot of research computing teams have been able to just buy the current generation of two-socket high-end servers and then just decide about GPUs or no and storage systems.
But even now, AWS has 51 current and immediate past generation instance types for a reason. U Michigan (my go-to example of a relatable university with a well-run research computing and data organization) has a research computing centre has 6 different systems plus cloud systems plus three different storage systems for different use cases. We’re getting to the point of having the mandatory opportunity to specialize thrust upon us, and nimble research computing systems teams are going to do well.
Quantum Algorithm Implementations for Beginners - Abhijith et al, ACM Transactions on Quantum Computing
LANL Publishes Guide to Quantum Computer Programming - HPC Wire
A sizable group from LLNL has released this lovely 90 page introduction to quantum computing, with hands on exercises (based on code samples in Jupyter Notebook for both local simulation and for running on IBM’s real quantum systems) covering 20 different algorithms.
Meta Platforms Hacks CXL Memory Tier into Linux - Timothy Prickett Morgan, Next Platform
TPP: Transparent Page Placement for CXL-Enabled Tiered Memory - Al Maruf et al, arXiv:2206.02878
We’ve talked about CXL here several times, it’s great to see it not just being used but with code upstreamed into the Linux kernel!
Facebook here is continuing long work pooling memory over RDMA using infiniband, to making use of Intel’s CXL protocol 1.0 atop PCIe5. This is going to be much slower than what’s planned for later CXL and PCIe versions, but it’s a great hint at what’s to come.
The idea here is to use CXL memory sitting in a card on a system, available locally (and even remotely, eventually) with a delay like a NUMA hop, but without another CPU around it. The trick then is to decide how to manage that memory without a CPU there - first how to understand what’s being used, and second to decide when and where to migrate memory.
The paper describes a new protocol, TPP, for handling pages motion onto and off of CXL memory, and tests it on a number of workloads. They also report on a tool, Chameleon, for monitoring the memory performance, and run it on some web-services, caching, and data warehouse workloads.
With the hyperscalers taking notice, and with the CXL roadmap fairly concrete for the next couple of years, and with work already being done in the Linux kernel, this is going to be something to keep an eye on in the next couple of years.
HPCWire alerts me to a paper from ISC I hadn’t read about - work, much of which is already integrated into LLVM, for doing OpenMP 5.1’s GPU offloading onto the GPUs of remote nodes, with a proof of concept of farming out tasks to 120 GPUs! I’m a big believer about using standards (OpenMP, standard language parallelism) for making use of accelerators, so this was nice to read about.
A single beaver caused a massive, hours long cell and internet outage in northern British Columbia, Canada.
Great to see that Arm servers have gotten to the point that now we can track down weird multi-core or multi-socket performance bugs on non-x86 systems! Hunting a NUMA performance bug on ARM.
Running Windows NT 4 for MIPS systems using QEMU, for some reason.
Generating true random numbers from bananas, by tracking the radioactive decay of potassium therein.
Animations of arithmetic on elliptic curves, as for cryptography.
I don’t think I know you could group_by, map, or flatten values with jq?
How fast can 1975’s 6502 processor transfer memory? About 57kB/s on the C64, up to 664.2kB/s with a new(!) 6502 motherboard.
Too young for that ARPANET/BITNET/early internet experience? Or just an oldster like me missing it? Welcome to telehack. (Anyone still running any MUDs?)
Decoding weird magic numbers, backwards compatible to to the Burroughs 5700, in old Fortran code.
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
Jobs Leading Research Computing Teams
This week’s new-listing highlights are below; the full listing of 153 jobs is, as ever, available on the job board.
Program Manager, Data Science - City of Richmond, Richmond BC CA
The Data Science Architect manages and controls the increasingly complex data integrations required between the City’s applications and Data Analytics platform to design and deliver advanced analytics solution required by business departments. The role understands the customers’ data analytics needs and recommend solutions. The role controls the complexity of the platform by employing techniques that include designing common data schemas, building reusable pipelines/integrations, and documenting requirements/models/solutions.
Manager, Research Software Engineering - Chan Zuckerberg Biohub, San Francisco, CA
The Chan Zuckerberg Biohub has an exciting opportunity for an exceptional candidate to be our new Manager of Research Software Engineering (RSE). In this highly technical, hands-on role, you will form and lead a team of RSEs that will be focused on developing, optimizing, testing, and maintaining research pipelines and frameworks in multiple scientific domains such as genomics, proteomics, and image informatics. https://www.linkedin.com/jobs/view/3122492456/
Senior Manager, Data Engineering and Architecture - Abbvie, North Chicago IL USA
As a Senior Manager, you will be a core member of a high-performance team of data engineers and architects focusing on driving technology innovation and continuous improvement. This role collaborates with senior leadership, business relationship managers, product owners, program managers, enterprise architects, business analysts, infrastructure team, and service providers to deliver the solutions.
Associate Director, Biostatistics - Everest Clinical Research, Remote CA
Everest Clinical Research (“Everest”) is a full-service contract research organization (CRO) providing a broad range of expertise-based clinical research services to worldwide pharmaceutical, biotechnology, and medical device industries. To drive continued success in this exciting clinical research field, we are seeking a committed, skilled, and customer-focused individual to join our winning team as Associate Director, Biostatistics (Statistical Operations) for our Toronto/Markham, Ontario, Canada on-site location, or remotely from a home-based office anywhere in Canada in accordance with our Work from Home policy.
Software Engineering Lead, Backend - Ellison Institute for Transformative Medicine, Los Angeles CA USA
The Lawrence J. Ellison Institute for Transformative Medicine strives to leverage technology, spark innovation, and drive interdisciplinary evidence-based research to reimagine and redefine cancer treatment, enhance health, and transform lives. The Lawrence J. Ellison Institute is seeking a talented and passionate Software Engineering Lead, Backend, to join its team. In this role, the successful candidate will help build a foundational data and services infrastructure that enhances and integrates our clinical, computational and biomedical research workflows. You will contribute to the entire cycle of project development, from design to operation maintenance and quality controls. The projects require that you learn and use your experience to solve problems related to data integration, deployment of cloud-based services, orchestration, automatization, big-data processing, system analytics, security, privacy, and others.
Software Developer III, Stanford Center for Biomedical Research - Stanford, Stanford CA USA
BMIR seeks a Software Developer 3 to perform advanced and diverse user interface and user experience (UI/UX) development work involving multi-project and broad responsibilities. Successful candidates should be able to contribute to all project phases, from systems analysis through implementation, test, and evaluation, and will work on systems and programs typically covering multiple and/or large systems and functions. Candidates should expect to provide state of the practice and state of the art UI development skills across BMIR, including on the CEDAR project (https://metadatacenter.org). Lead projects, as necessary, for special systems and application development in areas of complex problems.
Senior Product Manager - Quantum Computing - AWS, Seattle WA USA
You will drive Amazon’s Working Backward process to identify customer requirements, define the features they need, and then work with our engineers, scientists, and partners to execute and deliver a compelling product. This role requires an individual who can balance feedback from customers and partners with an understanding of what is possible today and might be possible in the future and then work with our internal teams to prioritize, build a roadmap and execute the launch.
Inaugural Director of UC Berkeley’s Eric & Wendy Schmidt Center for Data Science & Environment - UC Berkeley, Berkeley CA USA
Once fully formed, the Schmidt DS4E Center will include a team of researcher software engineers, data scientists, machine learning researchers, postdoctoral scholars, environmental program managers, administrative staff, science communication experts, as well as UCB faculty and student researchers from various disciplines. The Executive Director will play a lead role in implementation of the Center’s vision, team building, strategic partnership management, operations, impact measurement, reporting, and financial/human resource administration.
Associate Director, Research Software Engineering - Princeton, Princeton NJ USA
The Research Software Engineering (RSE) Group, located institutionally in Princeton Research Computing but extending across campus, is hiring an Associate Director of Research Software Engineering. You will report to the Director of Research Software Engineering for Computational and Data Science, but your area of expertise might range beyond Computational and Data Science. The RSE Group collectively provides computational research expertise to nearly every division at Princeton: Engineering and Applied Science, Humanities, Social Sciences, Natural Sciences. The RSE group is a centralized team of software experts focused on improving the quality, performance, and sustainability of Princeton’s computational research software.
Enterprise Architect - HPC - Lenovo, Various and Remote USA
Support the HPC Services Practice in the development and delivery of HPC related service offerings Technical lead for HPC services offerings, working closely with the HPC Practice leader throughout the service offering lifecycle, including: Assisting in the creation of new service offerings, starting with participation in the brainstorming phase, and providing expert guidance Assisting/leading early implementations with Technical Consultants
HPC Software Development Manager, Elastic Fabric Adaptor - AWS, London UK
The AWS HPC EFA team is building the software stack that enables low-latency, high-bandwidth networking for HPC and ML workloads. This is an opportunity to build systems that enable HPC workloads to scale, interacting with numerous AWS teams and Open Source Communities. You will lead a team that designs, builds and deploys new features that extend the performance and capabilities of the AWS HPC EFA product. Your team contribute to several open source projects for the user space software that makes EFA work. You establish the charter and tenets of a new team, build a roadmap, and set an exciting vision for your engineers. You are focused on the growth of each member of your team, guiding them as they advance in their careers.
Executive Director, Office of Research Computing and Data (ORCD) - MIT, Boston MA USA
To serve as the lead administrator of the newly created ORCD, under the direction of its faculty head and vice president for IS&T. Leadership duties will include oversight of all aspects of creating and directing, recruiting, and retaining the staff necessary to meet the Institute’s research computing and data goals; achieving the mission, vision, and strategy for the further development of collaborative Research Computing Infrastructure and Data (RCID) services; and setting the strategic direction for operational effectiveness and long-term sustainability of RCID services with the goal of providing centralized delivery and support for many of MIT’s research computing capabilities.
Open Science Lead - Wellcome Sanger Institute, Hinxton UK
We have an exciting new opportunity for an Open Science Lead to join our team on a 12 month fixed term contract to ensure that our commitment to open science is met and supports our mission to maximise the benefit of our research and supports the delivery of world-leading science by our committed scientists and technical staff. Responsible for reviewing and developing the Institute’s Open Science portfolio. Help us ensure our world-leading science maximises its impact to the research community. Provide the Director and senior leadership of the Institute with expert advice on open science and develop recommendations to help us improve. You will carry out a landscape review of open science at the Sanger Institute and Connecting Science and develop recommendations for improvement. You will also develop a new overarching policy on open science for the institute, alongside identifying new ideas and areas for improving open science.
Senior Research Software Engineer - Application Engineering - Oak Ridge National Laboratory, Oak Ridge TN USA
Serve in a lead role working closely with management and staff in both CSMD and the Neutron Scattering Division to evaluate and optimize the software development practices of the group, determine appropriate staffing levels, and ensure success of the project. Lead planning and major development efforts on scientific software projects. Lead or collaborate on proposals in computing that support better software and infrastructure for the scientific endeavors. Coordinate, lead, and act as a representative of the Laboratory in international collaborations related to scientific software. Act as a mentor for project members, junior staff, post-graduates, and students to help them grow.
Senior Project Manager, Biodiversity Genomics Europe Project - Naturalis, Leiden NE
The project manager works jointly with the project coordinator to ensure the implementation of the work programme across the consortium. He/she works in close collaboration with other beneficiaries, partners, and boards. Furthermore, the project manager regularly engages with the EU project officer and financial officer, with partner organisations, officials in the research area. The project manager will provide top-level management of the project with regards to all communication, organisational, administrative, logistics, contractual and financial matters. The project manager will be supported by a project administrator throughout the duration of the project.
Systems Administrator - Rutgers Office of Advanced Research Computing: https://oarc.rutgers.edu, Rutgers, The State University of New Jersey
The Rutgers University Office of Advanced Research Computing (OARC) is seeking a Systems Administrator to join our team.
We are a large and diverse team working to create an outstanding environment for research computing at Rutgers. A key part of OARC’s responsibility to the University is to ensure that we are seeking and supporting the best solutions for constantly evolving computational research challenges. Doing that effectively requires us to carefully address the needs of the community we serve and to ensure that our support networks are diverse, inclusive, and equitable. Diversity is welcome and encouraged here to fuel innovation and ensure exploration of the broadest range of possible solutions to research problems. Prioritizing equity ensures access to resources needed to support the ideas, initiatives, and efforts of all contributors to our research support operations. Above all, we value our team members and their unique capabilities, interests, and experiences that, in concert, form OARC’s mission and vision.
Reporting to the Director, Advanced Computing Infrastructure (ACI), the Systems Administrator V is a highly skilled and experienced professional role supporting the university’s Advanced Research Computing (ARC) infrastructure, including High Performance Computing (HPC), High-Throughput Computing (HTC), and Data-Intensive Computing environments. The Systems Administrator V is expected to perform the following duties: conceive, design, develop, optimize, integrate, and maintain HPC systems and on-site cloud infrastructure, lead technical operation and continued development of HPC, on-site cloud infrastructure and storage services, and provide hardware, software, and end-user administration and support to a diverse group of end users that need access to ARC resources. The incumbent operates as a member of the ARC team with focus on one of the University’s campuses.