RCT #161 - Supporting RCD teams as standards rise; Project management for developers; LLMs, software development, and Copilot X; Test flakiness, languages, and fixes; persistent identifiers; SRE approach to risk management

tonne

                March 25, 2023

            RCT #161 - Supporting RCD teams as standards rise; Project management for developers; LLMs, software development, and Copilot X; Test flakiness, languages, and fixes; persistent identifiers; SRE approach to risk management

            I feel strongly about the RCT job board — operationally, one of the things that defines a profession is some common places to look for new career opportunities.  It indicates that there is widespread recognition of and demand for the skills and expertise of the community.  It demonstrates that the community has grown and diversified enough to offer a variety of career paths and options for its members, and that the community supports their professional development and mobility.  It’s important for the specialized and important community of practice we’re trying to build here together, so please keep submitting those jobs!
But another terrific use of the job board is just as a source of data, tracking the needs of the institutions who require our services.  In maintaining the job board over the last three-ish years, I’ve already seen some changes.  And going through the postings recently, there are some pretty clear shifts.
Some won’t be that surprising to… well, to anyone.  There are a tonne of quantum computing lead or manger jobs, something that barely even registered as an option three years ago.  There are a lot of jobs for very applied AI teams, whereas when I started they were dominated by exploratory research groups.  (And when I started this, “machine learning” was declining as a term used, in favour of “data science”; “machine learning” has already caught back up.)  The demand for bioinformatics manager roles, which skyrocketed during the early stages of the pandemic, has somehow stayed steady.
Academic job postings are still staying up forever before getting filled, but I notice amongst those jobs a richer and more challenging set of roles - increasingly, campus-wide jobs, often with multiple of our disciplines (software, data management, data science, computing platforms), and explicitly designed to interact with other teams within the same institution.
This last change is related to the biggest shift I’m seeing.  RCD teams have truly arrived - it’s no longer about exploratory work for a few already sophisticated users.  It’s a fundamental and necessary part of all forms of research.
And that’s being recognized and in increasingly operational, professionalized, “keep the ship on course” sort of roles, titles and job descriptions I rarely if ever saw even 18 months ago:

Operations managers (for data science, data engineering, or even software teams)
Product management jobs, which I’ve mentioned before - for software, but also for data resources or pipelines
Delivery management jobs, which I've also mentioned before - emphasizing consistent, reliable execution outside of the context of projects with clear end dates
Research SRE (Site Reliability Engineer) jobs, or DevOps jobs with a reliability component
Significant data science teams within Institutional Research departments, using data science for day-to-day academic operations planning
PRISM/Research manager jobs for software, platform, or AI teams
Responsible/Ethical data or AI jobs in addition to the data governance/data management jobs I had already been seeing

Our institutional needs are changing.  (Well, they had already changed.  Now the institutional decision makers have recognized that the needs have changed.)  There’s growing understanding that to have the greatest impact on science and scholarship possible, “best effort” ad hoc approaches aren’t enough.  Teams aligned with other teams and with institutions on priorities, executing regularly and reliably while growing and developing new capabilities, are what’s needed.
People like you, reading this newsletter, are by definition already the ones trying to find resources and use them to improve how your team supports research.  We’re preferentially the ones thinking about processes, using retrospectives to build our teams and improve our work, making sure we’re meeting actual researcher needs by actually talking to them, sharing knowledge internally, working with not competing against other teams, defining clear services, and holding and enforcing clear standards - all in the service of making science and scholarship better.
It’s the other teams I worry a bit about.
The gap is already pretty clearly visible between the teams that used the turmoil of the last few years as a spur to become more intentional about how they worked together, more capable, more supportive of each other, more flexible and quicker-reacting, and those that just kind of muddled along.  As the expectations grow from our institutions, I suspect that gap will grow.
I want those other teams to do well as standards rise — the team members, and the researchers at their institutions both deserve that — but I’m honestly I’m not sure how best to make sure they get the support and encouragement they need.
Do you have any suggestions?  What made you first start looking around for resources to help support your team better, and what did you start looking for?  What are some resources that such managers need for them to get started on that journey?  I’d love to hear any suggestions - please just reply, or email me at jonathan@researchcomputingteams.org.

PS - thanks for your responses to the form last week in the email edition - the results came out roughly evenly distributed between "the new way (after the Manager, Ph.D. split) is better" and "about the same", with no votes for the old way being better, so that's a relief.   I'll keep using those to collect quick impressions, but as always, I'd prefer to hear from you directly.  Feel free to send an email - between the newsletters and work, I'm slow to respond sometimes, but I do respond, and I read and appreciate everything you send in.
Also, a reader helped me realize I hadn't mentioned lately that I'm happy to jump on a quick call if you ever have a question - I'm happy to do anything I can to support the people who have stepped up to lead and manage our important teams.
With that, on to the roundup! 

Managing Individuals
In Manager, Ph.D. earlier this week, I talked about:

The advantages we have and can lean in to when having difficult conversations
Using Bing Chat as an easy way to get started practicing feedback conversations

Technical and Project Leadership
Project Management for Software Engineers - Kevin Sookocheff
Sookocheff writes a good crash course into project management for those in computing fields - most importantly because it doesn’t fall into the usual trap of going straight to some tool or process.
Project management, the empirical art of successfully finishing things together, is a response to decades (centuries? millennia?) of projects failing in the same ways.  It can be written out quite grandly, but it comes down to just four things, in order of importance to us in our line of work:

Communication between humans, by means both synchronous (for getting agreement, maintaining momentum, and learning) and asynchronous (for everything else)
Accountability - which is mostly about communication - both within the project team and of the project team as a whole,
Adaptability as things come up,
Planning and foresight.

Planning and foresight are vitally important, of course, but in our teams that’s usually pretty well under control.
If you couple that with communications (both ongoing and of a clear, mutually understood goal), accountability for staying on track, and adaptability as things come up, the project is more likely to succeed than fail.  Without them, the default outcome isn’t success.
Sookocheff does a really nice job of walking through key parts of the project process from PMBOK.  He spends about half the time on project initiation, which is fantastic - if a project takes shortcuts here it can lead to mistakes that just can’t be recovered from.  Then he emphasizes communications and accountability through the rest.  It’s definitely a good resource to keep handy for the next time you delegate a team member their first project management responsibility.

Research Software Development
LLMs will fundamentally change software engineering - Manuel Odendahl

GitHub Copilot X: The AI-powered developer experience - Thomas Dohmke
This is going to be an important topic for a long time  - how do we use LLMs effectively in our work.  Software development was a very early adopter on this, with GitHub Copilot, but now with free options like ChatGPT, Bing Chat, and Bard, on top of dozens of products popping up, it is going to only accelerate.
Odendahl’s article is a pretty positive take on the utility of the tools; I won’t be making any effort to balance it out with a negative take.  As you know, in our line of work “we should keep doing things the way we’ve always done them” is pretty much the default, and we hear it enough from our colleagues.  People like us who are actively seeking out better ways to lead and manage our teams don’t need to hear more of it, and most of the negative takes on the technology’s utility haven’t yet amounted to much more than that.  There are real ethical concerns with the data going into the models and with LLMs wider impacts; I’m not qualified to curate readings on those topics. My beat, where I’ll continue focussing, is  “how do we use these things that exist to advance research and scholarship in our institutions given our priorities and constraints.”
I quite like Odendahl’s article, because it has what I think is the right starting point:

We don't know how to work with these tools yet: we just discovered alien tractor technology. Many critics try to use it as if it was a plain old gardening rake, dismissing it because it plowed through their flowerbed.

and yet suggests some of the ways LLM tools probably will change the day to day of programming:

To help the tools output useful code, we need to give the tools the context by visiting relevant parts of the code base (probably a useful first thing to be doing anyway) and writing a bunch of natural language documentation or machine readable documentation, a la OpenAPI, (again probably a useful step at any rate.)
They makes programming even by yourself more like white boarding and rubber ducking, again probably useful skills to practice.
These tools lower the cost to exploratory coding and rapid testing of ideas dramatically, meaning we can more easily try something, see it’s a dead end, and throw it away and try again.  Honestly this is probably the change in practice which I would find hardest, and yet it seems like one of the most valuable.
They lower the cost to build small useful tools, even if they’re something only we’ll find helpful in the future.
They will enable continuous code review, not just continuous testing.
They help developers spend more time focussing on the big picture.

It’s interesting to me that some of the negative responses to this article describe in detail their failed experiments having not made such changes to their workflow.  To my mind that supports Odendahl’s main line of argument — to make the most of these tools, our workflow would have to change.
(Steve Yegge in one of his trademark rants this week mentions one of the most frequently heard concerns,  that code coming from these tools may not be trustworthy: “software engineering exists as a discipline because you cannot EVER under any circumstances TRUST CODE.”)
Meanwhile, GitHub Copilot is announcing further technology previews based on GPT-4 (the main Copilot model is significantly earlier, and built solely on source code):

Suggesting and helping with descriptions for PRs
Warning if there’s insufficient testing for a PR
Chatbot based on the source code’s documentation
Help at the CLI

Are playing with LLM-powered tools something your software team is doing?  What have you found works well, and what hasn’t?  I’d love to know, and to share it with our RCT community if you’re willing - just email me at jonathan@researchcomputingteams.org.

Test flakiness across programming languages - Costa et al,  IEEE Transactions on Software Engineering

Test Flakiness Across Programming Languages - Greg Wilson, It Will Never Work In Theory
Interesting overview highlighted by Wilson on how causes of test flakiness vary by language of code base (resource leak being most common in C, dependency issues being most common in JavaScript and Python) but also how just a dozen strategies account for 85% of fixes.

Research Data Management and Analysis
Incentives to invest in identifiers: A cost-benefit analysis of persistent identifiers in Australian research systems - Josh Brown, Phill Jones, Alice Meadows, Fiona Murphy, MoreBrains
Persistent data identifiers aren’t glamorous, are fiendishly tricky to get to work correctly in all the various corner cases, and yet are absolutely foundational to data reuse.
This report was commissioned by the Australian Research Data Commons, who has released a number of great reports lately; this is part of their National Persistent Identifiers strategy and roadmap planning.
Here the authors go through three case studies of the use of persistent identifiers.  The first is of the research process itself - e.g. ORCiD for connecting authors and papers, researchers and grants, etc.  As the research endeavour grows (there are apparently 108,873 FTE academic staff in Australia), being able to effectively communicate research outputs of individuals and institutions grows more challenging, and requires some sort of automation.
Or not - the alternative is everyone doing it themselves manually, which the report estimates would chew up $24M worth of people’s time and associated costs.
What’s interesting to me is that even in this case where there’s a clear benefit to the people creating the IDs and using the metadata, and support in within the ecosystem for taking advantage of the persistent IDs, there’s still ongoing support needed to finish the adoption challenge.
Other case studies include PIDs for various use cases in Australia’s Terrestrial Ecosystem Research Network, and use of PIDs for other areas in the research ecosystem.

At a very different level of granularity, a call to use more modern SQL capabilities like window functions, PIVOT, MERGE, and INSTEAD OF triggers - Time to break free from the SQL-92 mindset.

Research Computing Systems
Managing Risk as an SRE - Ross Brodbeck
Broadbeck describes the basics of the "SRE" approach to risk management - something that was called out by a whole chapter of Google's book which popularized the practice of SRE for reliable uptime.
My experience has been that systems team members that have come
from an experimental or observational science background have a
stronger intuition for risk managment than those of us who come
from a theoretical or purely computational upbringing.  Even so,
there's often not a principled approach to risk management, thinking
critically about and prioritizing risks, and having a real, nonzero
risk budget (first rule of risk management: the ideal amount of
risk is never zero, even if such a thing were possible, which it
isn't).
Our systems' needs are different than those of a massive web services
company - but that just means our risk prioritizations would be different.
The approaches can still work very nicely; Broadbeck's short article is a good
starting point.

This has been in the news for a while, so you may well have already seen this (and certainly if you’re UK-based), but the UK Future of Compute report and recommendations is a commendably clear document that is expected to greatly inform the UK government’s thinking and so is worth reading for that reason alone.
The arguments made here will be familiar, so I won’t summarize it, but three things I kept coming back to while reading it myself:

It remains fascinating to me, though not surprising, how completely AI and other data-intensive use cases (including sensitive data) dominate the discussion.  This would have less marked even 5 years ago.
It emphasizes the importance of skill development, knowledge transfer, and professional staff, which is right and proper, but every time this comes up it’s the second point, after hardware.
It repeats a mistake made all too frequently, conflating “pioneering” usage with “largest scale” usage.  Number of compute units/size of dataset/whatever is just one dimension — and, let’s be honest, the least interesting one — along which one can be at the cutting edge.

Emerging Technologies and Practices
Deep(er) dive into container labels and annotations - Christian Kniep
Kneip proposes using labels in container images, and their use in annotations and manifests, as a way to communicate execution-time metadata in HPC container systems.

Speaking of containers, I just noticed that there’s a Manning Podman in Action ebook available for free from RedHat if you’ve registered with a free RedHat account - I’m hopeful that Podman will have most of what systems teams need from container systems while not being quite as limited as singularity.  Kneip’s post above mentions one advantage of podman, being able to use zstd-compressed images.

Random
Early CPUs often didn’t have explicit hardware for multiplication, much less division - reverse engineering the multiplication algorithm in the 8086 processor.
This rise, fall, and influence of Visual Basic.
A cool online textbook on algorithms for decision making under uncertainty.
I can’t help it, I really enjoy various markup-based typesetting systems - some of my first non-handwritten assignments were written in troff, then latex, then markdown + pandoc.  This new one, typst, strikes me as interesting.
“Miller is a command-line tool for querying, shaping, and reformatting data files in various formats including CSV, TSV, JSON, and JSON Lines.”
Implementing a transformer from scratch.
Notes on FFTs for implementers.

That’s it…
And that’s it for another week.  Let me know what you thought, or if you have anything you’d like to share about the newsletter or management.  Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology.  It’s teams, it’s communities, it’s product management - it’s people.  It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly.  But no one teaches us how to be effective managers and leaders in academia.  We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.

Jobs Leading Research Computing Teams
This week’s new-listing highlights are below in the email edition; the full listing of 185 jobs is, as ever, available on the job board.
Director of Research Support, Office of Advanced Research Computing: https://oarc.rutgers.edu - Rutgers, Piscataway NJ USA 

The Rutgers' Office of Advanced Research Computing (OARC) is on the look for a Director of Research Support to join our growing team!
Rutgers is a leading national research university and the State of New Jersey’s preeminent, comprehensive public institution of higher education. We are a diverse team working to create an outstanding environment for research computing at Rutgers. A key part of OARC’s responsibility to the University is to ensure that we are seeking and supporting the best solutions for constantly evolving computational research challenges.
The Director of Research Support will oversee a team of computational research support scientists (ACI-REFs), all of whom work closely with faculty, research associates, students, and staff on understanding their research, and allowing for the customization of training and support solutions. This position will be involved with national programs and organizations (e.g., CaRCC, ERN, CASC, AMIA, Internet2, Campus Champions, ACI-REFs) focused on research support and professionalization of CI Practitioners.
If you have a desire and believe you have what it takes, we want to hear from you!
Interested candidates are welcome to apply here: https://jobs.rutgers.edu/postings/194757.
Head of Research Software Engineering - University of Sheffield, Sheffield UK 

We are looking for applicants who are highly passionate about reproducible research through development of robust software and are committed to advocating software best practice through engagement with the academic community. A key responsibility of the role will be to set strategic objectives, measure team performance and report on this to ensure team and institutional objectives and deadlines are met. Applicants should have extensive experience of collaborative software development and an appreciation of how to work with academics and researchers to specify, develop, improve and deliver research software. You will lead the delivery of the collaborative RSE service across a range of research projects, departments and institutes across the university, as well as free at the point of use support to assist in the development of new strategic collaborations.
Applied Research Cloud Architect - Old Dominion University, Norfolk VA USA 

The Applied Research Cloud Architect establishes and defines cloud standards & governance for center-wide applied research and conducts and supervises technical development and architectural strategy in the areas of software development and infrastructure advancement, with special focus on research technology modernization utilizing cloud computing and managed services. The incumbent may serve as principal investigator or co-principal investigator for specific sponsored research projects and is responsible for the proper completion of statements of work under delivery orders or contracts.
Lead Research Analyst - Purdue University, West Lafayette IN USA 

As the Lead Research Software Engineer will be an integral part of our researcher-facing research software engineering team. This position will be responsible for designing and developing software and systems solutions using computational tools on large-scale research computing systems. You will collaborate with researchers to lead computational implementation needed to address questions raised by cutting edge research while streamlining the scientific process, reducing bottlenecks, and improving scientific workflows. You will also provide leadership and technical project management for shared cyberinfrastructure initiatives and evaluate current and future trends in areas of high-performance computing.
High Performance and Research Computing Manager - University of Hertfordshire, Hatfield UK 

The School of Physics, Engineering and Computer Science (SPECS) is seeking to appoint a High Performance and Research Computing Manager to support research and both undergraduate and postgraduate teaching.
Research Software Engineering Manager - University of Leeds, Leeds UK 

As the Research Software Engineering Manager, you will own and be accountable for IT’s strategic partnership with researchers across campus. You will build, develop and lead an established team of RSEs to meet the strategic challenges of our research partners; engaging with stakeholders from across the University including researchers, educators and other teams in IT. The existing portfolio includes a wide range of research domains including Artificial Intelligence, Bioinformatics, Computational Fluid Dynamics, and Data Science, and this is expected to expand in the coming years into a wide range of emerging research areas.
Principal Software Developer - Tech Lead - University of Cambridge, Cambridge UK 

The University of Cambridge's Information Services is looking for a hands-on Tech Lead to lead and manage a team of engineers working on building new cloud native services and modernising existing applications within a Division of 40 engineers . The services that the Division maintains, of which some are public facing, are mainly used by University staff, researchers, and students which total around 60,000 people. Your work will have a significant impact on the reputation of one of the world's leading universities.
Head of Infrastructure & Network - Children's Cancer Institute, Sydney AU 

The Infrastructure and Network Manager oversees and manages the Institute's technology infrastructure and network operations. This includes implementing, maintaining, and supporting the company's hardware, software, and network systems.
Technology Group Manager, Advanced Research Computing - Federal Reserve Board, Washington DC USA 

The Technology Group Manager (GM) role in the Advanced Research Computing (ARC) section of the Division of Research and Statistics (R&S) reports to the ARC Chief, is a member of the ARC leadership team, and is responsible for managing a group of technical staff. At the direction of the ARC Chief, and with guidance from division officers, the GM works with staff to operationalize and implement Research technology programs and services. ARC is a mission-focused technology team responsible for creating and operating a dynamic computing environment optimized for economic research, analysis, and measurement serving the Federal Reserve Board of Governor's four economics divisions. It works in coordination with the enterprise IT function of the Board.  ARC supports 800+ customers, and manages the hardware, software, compute, storage, and associated security functions for the computing environment.  Additionally, it manages and maintains a high-performance computing (HPC) cluster necessary for the work of the Board.
Quantum Platform Engineer Manager - Zapata Computing, Remote or Boston MA USA or Toronto ON CA 

The Zapata Platform team helps customers transform and evolve their business by building Zapata’s workflow platform OrquestraTM As a Software Engineer working on the Platform team, you'll have the opportunity to work on everything from the engine that powers the quantum workflow language to the systems that enable users to easily analyze workflow data. Behind everything OrquestraTM users see is the architecture built by the Platform team to keep it running. From Integrating quantum hardware to building the next generation of quantum software algorithms we make building Zapata’s quantum portfolio possible.
Senior Manager, Bioinformatics - ThermoFisher Scientific, Santa Clara CA USA 

As the Senior Manager, Bioinformatics you will manage and provide technical leadership to our genetic sciences bioinformatics team responsible for custom array designs for genetic analysis in research, human health, and agricultural genomics. Leading this team requires a good understanding of both population-scale studies as well as an understanding of the importance of a variety of individual markers that might be important in a given study. You will partner with R&D bioinformatics leaders to ensure we are applying the state-of-the-art array design capabilities in a consistent high throughput setting. This position reports directly to the Director, Microarray Laboratory and Bioinformatics Services.
Director of Product Management, Quantum at Scale Federal Incubation - Microsoft, Redmond WA USA 

We’re looking for a Director of Product Management to lead strategic conversations with Federal customers bringing together the scale of Azure High Performance Computing (HPC), state-of-the-art Machine Learning (ML) models, and Azure Quantum.
Software Development Manager - SCIEX, Concord ON CA 

Do you want to lead a team that designs and develop advanced software solutions in one of the world’s top life sciences companies? Does it excite you to know that the software you will build helps to create new drugs, analyze food and water quality, identify chemical hazards in the environment, and perform forensic tests – and thus positively impact people’s lives? The Software Development Manager manages a team of developers to design, implement, and maintain software for Sciex products. This is a technical hands-on leadership position, and you will have the opportunity to lead, inspire, coach, and mentor a team of software developers to deliver high quality software efficiently and effectively that is compliant with our quality system.
Health Data Innovations Delivery Manager, Health Informatics - Eastern Academic Health Science Network, Cambridge UK 

The post holder will work as part of the Health Informatics Team, delivering programmes of work that provide secure cloud-based environments for storing and accessing health and other sensitive data for research and innovation. The team brings together partner organisations from across different sectors to provide the programme management, operational and technical functions for both nationally and locally driven projects. The post holder will support the delivery of a number of our complex projects in parallel, working with the team and external stakeholders to ensure that each project stays on track and successfully delivers its outcomes.
Manager, Machine Learning, Applied Computer Vision - Recursion, Toronto ON CA 

You will report to the Director of Applied Computer Vision. You will manage a group of Machine Learning Scientists & Engineers that develop, own, and improve upon some of our core computer vision models for cell microscopy images. In your role, you define important aspects of Recursion’s roadmap for computer vision microscopy in collaboration with other leaders in Data Science, Engineering, Biology, and Product in service of Recursion’s high-level goals and long-term vision. You will also have the opportunity to collaborate with our research team in Montreal which is housed within the prominent DL research community: Mila.
Senior Product Manager, Cloud and Infrastructure at Cohere - Cohere, San Francisco or Palo Alto CA USA or London UK or Toronto ON CA 

As a Senior Product Manager, your responsibilities include: Customer Engagement and Research, Design and Product Development, Execution

Don't miss what's next. Subscribe to Research Computing Teams: