Research Computing Teams #98, 29 Oct 2021
Hi!
At least one other research group has also taken to providing some interview questions ahead of time. In response to the discussion in last issue, Titus Brown wrote in on twitter:
We’ve also started giving out interview questions in advance. Like you, we've found it leads to better discussions. I actually posted one set here, and [Rayna Harris] blogged about her hiring experience on the other side of this, here. Don't remember where the idea came from initially, but we like being open about things so :shrug:
I’m really pleased the newsletter has grown to the point that we can have back-and-forths about topics like this, because I think we need it in this community.
This week I heard one research software developer leaving a group because they weren’t getting the kind of work they found meaningful - partly because their manager only every really talked with them at the institutionally mandatory annual review time to talk about work goals and plans for the future.
Research computing and data is vitally important, and so are the teams doing that work. But noone ever really teaches research computing team leads or managers about how to support those team members and lead those teams effectively. It’s bad for research, and bad for our team members - who deserve good support, and who are scandalously underpaid compared to industry and can leave at any time.
The thing is - and stop me if you’ve heard this - in research, we pretty much know the advanced skills of managing and leading - building a multi-institutional collaboration, creating a clear vision of a necessarily somewhat nebulous research project, communicating over over different kinds of domain knowledge. But the basics - how do we hire, how do we set performance goals and nudge people towards them, how do we make sure we have open lines of communications with our team members (one-on-ones), when does it make sense to delegate - no one ever tells us this stuff. Hopefully with this newsletter community we can build some of that shared knowledge together.
Let me know (just hit reply, or email jonathan@researchcomputingteams.org) if there are any particular basics you’ve struggled with that we can talk about. In the mean time, on to the roundup, and have a happy Hallowe’en:
Managing Teams
Stand-up Meetings are Dead (and what to do instead) - Ben Darfler, Honeycomb Blog
What if daily standups, but for an hour?
Darfler describes how Honeycomb has switched their standups - from the usual short round-table format to a daily hour long gathering (a “meandering team sync”) that includes social time, and then a collaboratively-edited catch all agenda of work items. Rather than being formulaic, it becomes the standard place for team-wide discussions (technical or process) and also explicitly includes a social component. Darfler finds that it reduces total meeting time by always providing a venue for discussion of any given topic (much the way regularly-scheduled one-on-ones typically reduce interruptions - for both team member and manager - by providing a bucket for topics to go into). Darfler recommends starting with 30 minutes to see how it goes.
I think almost any meeting format can work for a team as long as there’s regular check-ins about the success of the meeting format and opportunities for course correction. I’m not sure our team would switch to this format any time soon, but it’s an interesting idea and I’d be curious to see how it worked.
What sorts of meeting rituals do you use in your team? Do you have anything other than the weekly staff meeting and sprint rituals like standups, sprint planning, and retros? What’s worked well for your team (and alternatively, what did not work?)
A framework to improve performance - TLT21
Long time readers won’t learn much from this article but it’s a short read:
The first step to adequately address consistent performance issues (emphasis on the "consistent") is to ensure that you have a clear and transparent standard for good performance for each role.
It’s “easy” but takes a lot of work to have have shared, common expectations about performance. As a manager or lead you can help that with sharing your expectations regularly with your team members, via feedback, and helping the team develop its own explicit expectations of each other.
Then comes the work of nudging people towards performance if they’re not there yet - if the issue is one of behaviours, then feedback and coaching on behaviour; if its one of skills, then training; and if it’s one of knowledge, then documentation (which often is a whole-of-team effort itself).
Managing Your Own Career
How to get useful answers to your questions - Julia Evans
Evans gives advice for how to get useful answers to questions - the context she uses is technical questions, but honestly the approach works just as well for getting your boss or collaborators to answer questions in email, or anything else.
She offers two pieces of advice for making it easier for the question-answerer to give you the answer:
- Ask yes/no questions
- State your current understanding
And two pieces of advice for getting more out of the answer:
- Be willing to interriupt
- Don’t accept responses that don’t answer your question
- Take a minute to think
Product Management and Working with Research Communities
How To Produce a Webinar Series - Osni Marques et al., HPC Best Practices (HPC-BP) Webinar Series
The Exascale Computing Project has hosted 58 roughly monthly webinars on “HPC Best Practices”, so they’ve gotten it down to more or less a science now. In this github repo, the organizers have a check list, a guidance email to presenters, and a paper from 2019 describing their experiences. This might be a good starting point if your group or community wanted to start organizing such a series.
Better coordination, or better software? - Jessica Joy Kerr
Coordination models - tools for getting groups to work well together - Jade Rubick
Collaboration is good, but in large-enough collaborations it isn’t feasible or even desirable to have everyone working as if they’re on one large team. As Kerr points out, at some point you need to move beyond collaboration to coordination around well-defined interfaces (with occasional, ad-hoc deep collaboration between members of the sub team).
Rubick is slowly writing a pretty impressive compendium of coordination models within but also across that he’s seen work, how to make them work, and their tradeoffs. Several of them are extremely relevant to research computing and data;
- Service provider
- Consultant (not yet written)
- Liaison
- Embedded
- Community of practice
There was a lot of discussion early on in the newsletter about centralized vs embedded RSE or data science teams; it’s nice to see someone thoughtfully writing up a more detailed overview of the kinds of coordination models, with frank looks at their failure modes.
Research Software Development
Is Research Software A Tangled Mess? - Peter Schmidt and Derek Jones, Code for Thought podcast
Jones writes a blog and now a book on empirical data on software engineering. Earlier in the year wrote a post, Research software code is likely to remain a tangled mess, that we mentioned in #63 and that got a lot of, let’s say “attention”, in the RSE community. His comments weren’t really that controversial about research software written by researchers so much, as having some doubt that the growing RSE effort will make much inroads.
Peter Schmidt interviewed him for the RSE Code for Thought podcast. I tend to be pretty sympathetic towards what he’s saying:
- There’s very little objective evidence in favour of most software development best practices; that doesn’t mean that what we mean by best practices are bad bad, but there’s not a lot of evidence to demonstrate they lead to objectively better outcomes
- The default for developers is to write tangled messes
- Researchers aren’t trained to do software development, and don’t do it every day
- Therefore most researcher-written code is a tangled mess
- That may not even be especially bad - one of the few statements about software development that there is a lot of data to point to most software in industry has a very short shelf-life.
- If that's the case, not putting effort into software engineering until a piece of code has stood a certain test of time/users isn’t an obviously a poor outcome
I think it’s likely at least as true in research than in industry that software has a short life, and tends to never be used by any other than its authors and immediate peers in the lab. Most ideas don’t pan out (that’s true in business just as much as in research).
Maybe more controversially, Jones argued:
- Software “sustainability” doesn’t actually mean anything. Testing does, testing is good, but “sustainability” doesn’t.
Minimum Viable CD - Minimum CD signatories
There’s a lot of disagreement about what CI/CD means, with lots of people using it for fairly disparate things. This is a push for a clarity around these terms; this list of signatories has a pretty modest list of requirements for continuous integration, which I think most research software teams probably meet:
- Trunk-based development
- Work integrates to the trunk at a minimum daily
- Work has automated testing before merge to trunk
- Work is tested with other work automatically on merge
- All feature work stops when the build is red
- New work does not break delivered work
and then Continuous delivery, which I think most groups are a little further behind on, not lease because I think many systems teams don’t support it:
- Use Continuous integration
- The application pipeline is the only path to deploy to production.
- The pipeline decides the releasability of changes, its verdict is definitive
- Artifacts created by the pipeline always meet the organization’s definition of deployable
- Immutable artifact. No human changes after commit.
- All feature work stops when the pipeline is red
- Production-like test environment
- Rollback on-demand
- Application configuration deploys with artifact
Research Computing Systems
Real-World HPC Gets the Benchmark It Deserves - Nicole Hemsoth, Next Platform
Following hot on the heels of the news that China might already have two exascale computers but couldn’t be bothered to submit to the always-dubiously-meaningful top-500 list, Hemsoth reports on an actual real set of HPC benchmarks, put out by the Standard Performance Evaluation Corporation (SPEC), of SPECint, SPECfloat, etc. fame (although I notice now that they haven’t been called that since 2006.)
SPEC has long had MPI and OpenMP benchmarks; SPEChpc is a suite of 6-9 benchmarks including kernels modelled after weather, astro, HEP, combustion, and solar physics codes that involve combinations of MPI, OpenMP; there are separate accelerator versions too. The SPEChpc suite isn’t open source, but it’s freely available to academic or non-profit organizations.
The best benchmarks of course are your real workloads. I don’t know if SPEChpc will take off, or even yet if it’s any good, but defining some kind of “official” semi-synthetic set of reasonable and easy-to-set-up kernels is a necessary first step in moving us away from the execrable Linpack benchmark used in the top 500. Too, it allows users to consider the individual benchmarks more seriously if they’re closer to the communication and computation pattern of their application, rather than just reporting a single number.
Emerging Technologies and Practices
Why JAX Could Be The Next Platform for HPC-AI - Nicole Hemsoth, The Next Platform
Machine learning–accelerated computational fluid dynamics - Kochkov et al., PNAS
End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman - Samantha Petti et al, bioRxiv
This is a very odd pairing of papers, with two very different fields I’ve worked in - bioinformatics and fluid dynamics - connected by a python library.
I first mentioned JAX in the newsletter way back in #34 as “a really cool autodifferentiation package for python”. While that is and was a key component of it, even then it was something more than the traditional autodiff tool, emphasizing composability like differentiating through loops, branches, and recursion; and it has primitives for vectorization and can compile code straight to GPU code (or Google’s TPUs). That makes it a useful tool both for traditional numerical computation but also deep learning and AI.
In the first article, Hemsoth interviews Stephan Hoyer, climate physicist turned Google AI applied scientist, about his team’s work with JAX-CFD, including a paper published in PNAS. JAX-CFD uses JAX for both traditional CFD numerics and AI; in the paper, Kochkov et al. first run high-resolution runs and train AI models for interpolation and correction, building (if I’m understanding this correctly) something like a subgrid/modified advection model. They then demonstrate, on larger domains, decaying turbulence, and higher turbulence simulations, that incorporating the model allows them to get the same accuracy running at lower resolutions; or, as Hemsoth succinctly summarizes:]
even though it used quite a bit more computational power (150X more FLOPS) was only 12X slower at the same resolution but 80X faster for the same accuracy.
The paper by Petti et al. is a bioRxiv preprint that uses JAX for something very different - a discrete problem of multiple sequence alignment, then implemented in a proof of concept for a problem in structure prediction for proteins. Smith-Waterman for sequence alignment is a classic dynamic programming problem. First they implemented a “smoothed” version of Smith Waterman, so one can differentiate the solution (what is the change in the alignment score of the modified algorithm if we make a small change to the inputs), so it could be readily calculated with JAX. But now the problem of alignment can be calculated jointly with AlphaFold structure predictions to improve predictions of how the protein folds compared to a known proteins. The senior author explains how in a twitter thread.
I’m not sure where deep learning and research computing will be going, but the fact that JAX is being used successfully in very different fields for very different applications makes it a package worth keeping an eye on.
Developing a unique FPGA testbed for UK researchers - Nick Brown, EPCC
FPGAs have been “emerging tech” since as long as I’ve been doing research computing and data (20 years?); at the time we were waiting for FPGAs to get easier. Now it’s starting to look like everything else has gotten harder, so FPGAs no longer stand out; and there are specific applications with vendor-sold FPGA solutions that have such eye-watering speedups that it’s building interest in other areas. In this article Brown talks about EPCC’s new FPGA testbed and some of the applications that are being worked on there.
Calls for Submissions
22th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2022) - 16-19 May, Italy, Papers due 24 Nov
Tracks for topics are:
- Track 1: Future Internet computing systems
- Track 2: Programming models and runtime systems
- Track 3: Distributed middleware and network architectures
- Track 4: Storage and I/O systems
- Track 5: Security, privacy, trust and resilience
- Track 6: Performance modeling, scheduling, and analysis
- Track 7: Sustainable and green computing
- Track 8: Scientific and industrial applications
- Track 9: Artificial intelligence, Machine Learning and Deep Learning
2nd International Conference on Image Processing and Vision Engineering - Online, 20-24 April, Papers due 30 Nov
From the site:
IMPROVE is a comprehensive conference of academic and technical nature, focused on image processing and computer vision practical applications. It brings together researchers, engineers and practitioners working either in fundamental areas of image processing, developing new methods and techniques, including innovative machine learning approaches, as well as multimedia communications technology and applications of image processing and artificial vision in diverse areas.
Events: Conferences, Training
ECP SOLLVE - OpenMP Teleconferences - Monthly calls, Fridays (usually last of month), Zoom, Free
There are roughly monthly OpenMP calls by the Exascale Computing Project featuring an update of ECP’s SOLLVE and a talk. The talk this month, which unfortunately this newsletter is going out too late to let you know about in advance is “OpenMP Tasking, Part 2: Advanced Topics” by Xavier Teruel-García; but there will be another in early December, and then returning onto the regular schedule in the new year.
Workshop on the Science of Scientific-Software Development and Use - 13-15 Dec, Free, Virtual
This workshop, sponsored by the U.S. Department of Energy, builds on reports from 2019 and 2020 on building software better:
With this increasing diversity, we believe the next opportunity for qualitative improvement comes from applying the scientific method to understanding, characterizing, and improving how scientific software is developed and used.
You can submit a position paper here, or just attend; a workshop report will summarize the breakout sessions.
Random
The thing is - and I don’t like it any better than you do - Javascript is here to stay, everyone already has a development environment for it installed, and webasm has helped push multithreading support. So now there’s a book on multithreaded javascript coming out.
Teach concurrency primitives with this adversarial game where you the players try to break multithreaded code - Deadlock Empire.
The new M1s look like they have really interesting multithreaded floating point performance.
An intermediate language for APL-type languages.
Prolog, but instead of values being true/false, they have probabilities.
When to use each of the git diff algorithms.
Round-robin your way to your new favourite coding font. And here’s how it was built with low-code tools.
Anyone ever used database service free tiers for little side products? Cockroach labs talks about theirs here.
GitHub now has a beta feature where it will try to find “merge queues” of PRs which can be merged sequentially without conflict.
That’s it…
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
Jobs Leading Research Computing Teams
This week’s new-listing highlights are below; the full listing of 142 jobs is, as ever, available on the job board.
Technical Project Lead (Open Targets) - EMBL-EBI, Hinxton UK
Open Targets is an innovative, large-scale, multi-year, public-private partnership that uses human genetics and genomics data for systematic drug target identification and prioritisation. As a key member of the Open Targets Core team, you will design and build cloud-first software tailored for the interface between large-scale biomedical data and drug discovery. Together with the other Open Targets partners (BMS, GSK, Sanofi and Wellcome Sanger Institute) you will lead the technical strategy of current and future informatics tools designed to support the identification and prioritisation of drug targets. Using cutting-edge technologies and leveraging the expertise of our product owners and industry stakeholders, you will work in a dynamic, multidisciplinary, international environment to tackle a wide range of algorithmic and technical challenges.
Team Leader - Data and Statistics Division - Bank of England, London UK
The Data and Statistics Division (DSD), part of the Data and Analytics Transformation Directorate (DAT), is the central data division with responsibilities that span production and management of some of the Bank’s core datasets to aspects of data management and data culture that reach into the working lives of almost all Bank staff. This is a people management role, you will be responsible for day to day management of the team and for defining roles and responsibilities. You will develop your team through sharing information effectively, mentoring and communicating a clear vision to ensure that your staff are productive, engaged and motivated to deliver the area’s aims.
Senior Research Software Engineer - Imperial College London, London UK
This role presents an exciting opportunity to join the growing community of Research Software Engineers (RSE) at Imperial College via its core team within the Research Computing Service (RCS). The RCS encompasses a dedicated team of RSEs, a managed High-Performance Computing (HPC) facility; and provides a range of training for the HPC users and researchers. Imperial College London is ranked in the top ten universities globally and is home to the greatest concentration of high-impact research of any major UK university. You will actively participate in research by providing advice on the application of technologies and delivering software development projects. You will contribute by developing innovative software, promoting good software engineering that ultimately accelerates research and by mentoring less experienced developers. You will work under general direction with a clear framework of accountability, while exercising substantial personal responsibility and autonomy.
Senior Scientist Research Computing (Biomedical and Clinical Informatics) - Rutgers University, New Brunswick NJ USA
Rutgers, the State University of New Jersey is seeking a Senior Scientist Research Computing (Biomedical and Clinical Informatics) for the Office of Advanced Research Computing (OARC). Participate in development of new biomedical and clinical informatics research and support model, potentially leading to creation of a core facility and/or center of excellence.
Product Manager - Data Science - Data Cloud - Veeva Systems, Remote or Toronto ON CA
Veeva is the leader in cloud-based software for the global life sciences industry. We are the first public company to become a Public Benefit Corporation. You are excited about the productization of Data Science and building statistics at a scale that can run with quality across billions of records, delivering data sets directly to customers. You’ll work on the productization of statistical models and approaches, such as sample curation, projection methodologies, anomaly detection, scaling approaches, clustering, and more. This also will include designing features, writing detailed product specifications, and working with the Data Science and Development team to bring designs to fruition.
Software Engineering Manager - Overleaf, Remote UK or EU or CA or US
We are looking to hire an engineering manager who will be responsible for line management of a group of 5-8 engineers at different levels. This role will include some technical work such as code review and writing technical proposals. As an Engineering Manager and part of the engineering team at Overleaf, you will be helping to make Overleaf the go-to place for scientific writing by both inspiring your engineers to do their best work, and making your own contributions directly to the platform.
Software Engineering Manager - Kepler Communications, Remote US or UK or CA
Kepler is on a Mission to bring the internet to space. Incorporated in 2015, Kepler’s guiding star is to enable the space economy through the creation of a communication network in Low Earth Orbit (LEO) that will provide connectivity services to other space missions, be they on orbit in LEO, MEO, GEO, or beyond. Kepler is looking for a dynamic leader and great team player who enjoys technical challenges in a fast-paced environment, applies sound judgement in successful planning and execution, meets commitments, and communicates effectively with all stakeholders. As a Software Engineering Manager at Kepler, you will be responsible for leading a world-class team of software developers and engineers.
Senior Cloud Engineer - Harvard Medical School, Boston MA USA
The Research Computing group in the Harvard Medical School Department of Information Technology is seeking a talented Senior Cloud Engineer with a strong background, and preferably certification in Amazon Web Services (AWS), focusing on managing environments that must meet federal security standards. The Senior Cloud Engineer will be responsible for supporting the HMS IT Secure Research Computing Environment (SRCE), a FISMA certified AWS cloud infrastructure environment, enabling the groundbreaking biomedical research of some of the world’s foremost scientific investigators. Mentor junior staff.
Principal Data Engineer - Harvard Medical School, Boston MA USA
We seek a highly motivated, collaborative individual with excellent communication skills to join our team of technologists and scientists as a Principal Data Engineer. You will help build out the necessary data warehousing infrastructure to support the downstream machine learning analysis and integration of large, complex data sets that will provide a nuanced longitudinal perspective on population- and individual-level health outcomes and disease trajectories. These data sets include healthcare insurance claims, electronic health records, genomics, environmental exposure, and other data modalities. The integration of these data will allow our research teams to make ground-breaking advances in the areas of precision medicine, healthcare AI, healthcare policy/economics, and basic science, all with the goal of improving patient outcomes.
Executive Director, Academic Technologies, Innovation & Research Computing - Georgia Tech, Atlanta GA USA
The Executive Director position for Academic Tech, Innovation and Research Computing serves to support a strategic focus on the research and academic technologies of the Institute by partnering with research and academic entities, and leading and executing the aspects of IT strategy that enable the Institute to achieve its research, teaching and learning, and innovation goals. Responsible for establishing group/departmental/division goals, determining the resources needed to meet those goals, assessing group/departmental/division performance feedback, and making pay decisions. Formulate technical strategy that is responsive to the needs of the Institute and demonstrates a forward leaning view of technology having direct impact on and responsibility for, research funding and general institutional funds. Plan, guide and create a long-range vision and develop it into an executable strategic research, digital learning, and enterprise innovation program for the Institute cyber-infrastructure strategic execution.