Research Computing Teams #99, 5 Nov 2021
Hi, research computing team managers and leaders:
In our team there's been a lot of passages lately - a paper for our original work is (finally!) coming out, as our new version is (finally!) coming together; we're gearing up for a new batch of co-ops as our current co-ops are starting to document and getting ready to present their work; a project manager is joining the team for the first time now that the effort has reached a size and scope that it needs one (well, it needed one a year ago, but here we are).
These passages - and especially the influx of new people, new tasks, new scope - are really important for a team's well being. Stasis isn't stable; systems, including systems of people, are either growing or stagnating.
In academia sometimes it's far too easy for groups to become very comfortable with "the way we do things", and set in their ways. As Boulanger points out in the first article in the roundup, that can quickly lead to problems not being addressed - or even really noticed any more - and eventually people both within the team and "clients" of the team starting to drift away. In fact, I was talking to a colleague this week about one group's services becoming ossified to the point where consumers of those services started moving to those of a different and newer group - the first group didn't take feedback or feature requests seriously, and now there's a real chance it will simply be disbanded (or, maybe worse, left go on indefinitely with less and less actual purpose).
Stagnation isn't inevitable, but it takes active and continual effort to avoid. In research, and technology, and certainly at the intersection of research and technology, it should be easy - there's a constant influx of new ideas available. But ideas don't adopt themselves, and practices don't adapt themselves. They're adopted and adapted in units of teams, and one of our important jobs as managers and leads is balancing real needs for degrees of stability and certainty with equally real needs for change and growth. Balancing openness with continuity is hard, but vital.
And now, the roundup:
Managing Teams
Why The Status Quo Is So Hard To Change In Engineering Teams - Antoine Boulanger
Boulanger here points out a situation that is especially common in academia, with slow-growing teams where individual team members have long tenure. The issue is that a team gets so used to the way things are they don’t even see it any more, and forget that things don't have to be this way. There can be a sort of learned helplessness to the procedural, technical, and complexity problems within an organization.
Having new people come in regularly - even short term team members like interns - can be very helpful for this, as long as they are comfortable making comments like “why is X? Isn’t that bad?” and the team takes the points they raise seriously.
Boulanger has recommendations for us managers or leads:
- Regularly get into your team’s shoes
- Notice when people stop complaining about an issue - this can be a negative, not a positive
- Create some metrics around known issues so you can see if they’re getting better or if the team is just getting more inured to them
There are a lot of strengths that can come from a long-lived stable team, if you’re careful, but the default outcome is stagnation. The manager, and the team, has to be constantly and actively looking for things to improve and areas in which to grow to prevent the default.
You can be directive without being a jerk - Lara Hogan Being Nice and Effective - Subbu Allamaraj
I think one of the hardest things for new managers - especially those coming from the very hands-off collegial culture of research - is determining the right amount of directiveness appropriate for a given situation. The usual failure modes, in order of the frequency which I see them, is the very common laissez-faire absence of direction and the less common tech-lead-becomes-manager “do this, this, then that, and my way of doing it is exactly like this. In fact, why don’t I just…”
Hogan’s article is a followup to an earlier one, fixing a team going in circles. So here it’s a big topic of setting some direction for an entire team at once. But the approach works for a particular team member, too - being specific about whose job is what, focussing on the important thing (the team’s work) and not about individuals, and firmly but kindly applying direction at the level needed - whether that’s on tasks or goals or somewhere in between.
Allamaraj points out that being effective doesn’t necessarily mean not being nice, and being “nice” isn’t necessarily an end in itself anyway; we want to be kind, and sometimes “nice” gets used as meaning sort of inoffensive. Letting someone trudge aimlessly in circles while just smiling and not saying directive may seem from a distance like ‘nice’ but it’s certainly not kind.
Managing Your Own Career
Stop Looking For Mentors - Stay SaaSy
We could all use a bit more mentorship, but searching for A Mentor may make it harder to get the input we need. This article suggests making it easier on yourself:
Instead of looking for a mentor, just find somebody who can answer some questions you have. Then, if you think they can answer some more, ask them again. In reality, a mentor is mostly just somebody that answers questions more than once. That’s it. It’s not cinematic.
Product Management and Working with Research Communities
Using Amazon Service Workbench for Remote Training - Ann Gledson, Danielle Owen, Anthony Evans, and Peter Crowther, Manchester Research IT blog
So AWS Service Workbench is a free offering I hadn’t heard of before that lets you do what you might have done previously with Cloud Formation or a bunch of home grown scripts - spool up individual environments for researchers or (in this case) for a training course - but with aspinnice self-service UI that lets the IT staff approve requests. (“Free” in the sense of no extra cost - don’t worry, they still charge you for the resources that are being used!)
We’ve all tried having students use their own laptops and requiring them to pre-install packages, and know how challenging that is. The Manchester RIT team used service workbench to support an all-virtual Python course that would normally be done in a computer lab they control. Interestingly, the blog reports the feedback of the course instructor, an RIT team developing service to handle restricted data, a TA for the course, and some feedback from participants.
In this case, the RIT staff liked the control, the instructor and TA liked how they could get started teaching the material right away, and the participants seemed happy.
The downside of this approach of course is that if the students are to continue on using the material on their own, they now still have to go through the install process - but certainly the teaching is easier.
And of course this sort of tooling could be made available for on-prem systems, but in practice it never is; cloud providers have an incentive to make their systems as easy to use for these kinds of use cases as possible, because it means more revenue, while typically fixed on-prem systems generally have different (frankly, the opposite) incentives.
Research Software Development
Bring Legacy Code under tests by handling global variables - Nicolas Carlo
When trying to implement component testing legacy code with global variables, Carlo has a simple suggestion - don’t over think it, just pass the global variables in as parameters. It may look ugly, but it’s not new ugliness, it’s just revealing existing ugliness; and that’s the first, necessary, step in defining refactoring plans.
Well-researched advice on software team productivity - Ari-Pekka Koponen, Swarmia
Management is hard, management of something as complex and ambiguous as software development is especially hard, but that doesn’t mean we don’t know anything. There has been a lot of research on what works for making teams work well, and recently particularly in the area of software development. It doesn’t mean there are cookie-cutter solutions for anything, but we do have good guidelines. Koponen walks us through several well-supported (and in some cases ongoing) reports, many of which RCT readers will have already known about
- Project Aristotle (#46) - The follow up to Google’s Project Oxygen, going beyond the effect of single managers to team behaviours. They found that five factors were very strongly correlated with high performing teams
- Psychological safety - team members feeling comfortable speaking up
- Dependability of other team members
- Structure and clarity
- Work having meaning, and
- Work having impact
- DevOps Research and Assessment (DORA) metrics - These are aimed for those developing and operating software systems, but many of them are relevant to those writing and releasing software: they find that teams that are performing well have:
- High deployment frequency - it’s easy to update the software and keep it working
- Low Mean Lead Time for Changes
- Low Change Failure Rate for changes to need to be rolled back
- Low Time to Recovery for the times that errors do occur
- SPACE (#66) found that focussing on these five areas were important for self-reported high-performing teams:
- Satisfaction and well-being
- Performance
- Activity
- Communication & Collaboration
- Efficiency & Flow
And most importantly, Retrospectives - learning and adapting practices based on what is actually happening on your team - allows you to tune.
Embedded malware in NPM package coa - GitHub advisory, RW Overdijk
Another reminder of how vulerable software supply chains are - coa (command-option-argument, a command line argument parser) used in 200 other packages and a gillion repositories, had malicious releases ˜with malware inserted:
The npm package
coa
had versions published with malicious code. Users of affected versions (2.0.3 and above) should downgrade to 2.0.2 as soon as possible and check their systems for suspicious activity. See this issue for details as they unfold. Any computer that has this package installed or running should be considered fully compromised. All secrets and keys stored on that computer should be rotated immediately from a different computer. The package should be removed, but as full control of the computer may have been given to an outside entity, there is no guarantee that removing the package will remove all malicious software resulting from installing it.
The good news is that it seems to have been spotted quickly if I’m understanding what’s been happening.
Research Data Management and Analysis
Choosing good chunk sizes in Dask - Genevieve Buckley
As with any kind of parallel or distributed computing, choosing granularities over which to calculate is complicated. Too small, and you end up spending too much time on coordination/communication and too little time on computation; too little and you have too little flexibility in scheduling or can even run out of memory. In simulation, it’s usually pretty clear what size over which to run; for data analysis, which is normally a lot less computationally intensive, it’s often less so.
In this article Buckley gives some rough rules of thumb:
- Rely on previous single-node prototypes for guidance
- Chunk sizes below 1MB are almost always bad
- Avoid more than 10k or 100k chunks
- Have at least as many chunks as worker cores, preferably significantly more
- Have each chunk take at least seconds to run
and shows how the Dask dashboard can help provide some guidance.
Research Computing Systems
Scaling a read-intensive, low-latency file system to 10M+ IOPs - Randy Seamans, AWS HPC Blog
This is an AWS blog post, but it’s relevant more broadly - it’s a pretty direct use of NVMe-oF, NVMe over fabric.
Here Seamans describes a very high-speed read-nearly-only filesystem, where a gluster file system is replicated onto multiple instances with very high-speed NVMes, and then the NVMe are exposed read-only over NVMe-oF to provide extremely fast read access to files, for use cases like a large number of nodes are doing a read-intensive analysis of a directory full of data.
The yearly backup restore test - Remy van Els
Backups are useless, restores are invaluable. van Elst walks us through his personal annual backup restore test, marked on his calendar, including file integrity checks:
Have you done your backup restore test recently? An untested / unverified backup is the same as no backup, so doing a restore test is a major part in your backup scheme.
Emerging Technologies and Practices
Five-P factors for root cause analysis - Lydia Leong
Rather than “root cause analysis” or “five why's”, both of which have long since fallen out of favour in areas that take incident analysis seriously like aerospace or health care, Leong suggests that we look at Macneil’s Five P factors from medicine:
- Presenting problem
- Precipitating factors - what combination of things triggered the incident?
- Perpetuating factors - what things kept the incident going, made it worse, or harder to handle?
- Predisposing factors - what long-standing things made a bad outcome more likely?
- Protective factors - what helped limit impact and scope?
- Present factors - what other factors were relevant to the outcome?
Running 20k simulations in 3 days to accelerate early stage drug discovery with AWS Batch - Christian Kniep
Following up on earlier Gromacs benchmarking posts, in this post Kneip describes their final use case - running a large suite of simulations for early stage drug discovery. By choosing their instance types based on the previous work, they could tune turnaround time and cost, and by using spot instances and Batch they could fan out 20k simulations over multiple regions relatively straightforwardly:
For our binding affinity study, we completed 20,000 jobs over the course of three days. By using benchmarks and choosing optimal Spot Instances, we were able to achieve a cost as low as $16 per free energy difference (∆∆G value). As we chose to broaden the set of instances for a shorter time-to-solution, we achieved an average of $40/∆∆G value. With AWS Batch, we were able to create pools of resources in different AWS Regions around the globe and handle orchestration within the region. By the end of this, it was clear that we could achieve both a really fast wall-clock time (and hence time-to-result) as well as a low overall cost.
Calls for Submissions
International Super Computing (ISC22) - 29 May - 2 June, Hamburg, Papers due 29 Nov
SC isn’t even here yet and papers for ISC are coming due. ISC of course covers almost everything in HPC:
- Architectures, Networks, & Storage
- HPC Algorithms & Applications
- Programming Environments & Systems Software
- Machine Learning, AI, & Emerging Technologies
- Performance Modeling, Evaluation, & Analysis
EUROSIS Industrial Simulation Conference 2022 (ISC 22) - 1-3 June Dublin, Papers due 21 February
The aim of the conference “is to give a complete overview of this year's industrial simulation related research and to provide an annual status report on present day industrial simulation research within the European Community and the rest of the world in line with European industrial research projects.” Tracks include:
- Discrete Event Simulation Methodology, Languages and Tools
- Artificial Intelligence, IOT AND VR Graphics Applied to Industry
- Complex Systems Modelling
- Simulation in Robotics
- Simulation in Engineering
- Simulation in Collaborative Engineering
- Simulation in Manufacturing
- Simulation in Logistics and Traffic
- Datamining Business Processes, Geosimulation AND Big Data
- Simulation in Economics and Business
- Simulation in Economic and Risk Management
- Simulation in Automotive Systems
- Simulation in the Power industry
51st International Conference on Parallel Processing - 29 Aug-1 Sept, Bordeaux, Workshop proposals due 28 Nov, Papers due 14 Apr.
ICPP2022 is interested in “he latest research on all aspects of parallel processing”. Topics of interest include algorithms, applications, architecture, performance, software, and multidisciplinary work.
Events: Conferences, Training
Oak Ridge National Centre for Computational Sciences Virtual Career Fair - 11 Nov
Four hours of talks, tours, and career tables staffed by 11 teams that are hiring.
Random
I’ve you’ve wanted to start messing around with functional languages, OCaml is a reasonable pragmatic choice that does get used in the wild. Here’s a getting-started-with-the-tooling guide to OCaml.
Or you could use a lisp that fits entirely into 512 bytes.
Use bash functions! They make bash scripts less crummy!
The nice thing about the internet is you can find nice resources about incredibly obscure things. Want a good annotated bibliography and sample code for tree edit distances, the minimal number of edits you can make to transform one tree to another? Great news!
A complete embedded USB stack in Ada.
Learn how X window managers work by writing one.
The anatomy of a terminal emulator.
Lovely explanation of Bezier curves, splines, and smooth surfaces.
Fascinating look at the data infrastructure around python environments that have grown over time in some banks: Bank Python.
Unicode attacks on source code.
Debugging stories are always good! Here’s one deep within the bowels of the Linux TCP stack.
Causing data leaks via maliciously-crafted log messages.
That’s it…
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
Jobs Leading Research Computing Teams
This week’s new-listing highlights are below; the full listing of 142 jobs is, as ever, available on the job board.
Data Access Project Manager - University of Edinburgh, Edinburgh UK
We are looking for an experienced project manager to join the Outbreak Data Analysis Platform (ODAP) team and manage the data governance, particularly in support of data access requests for COVID-19 data from various studies and institutions across the UK - including from the ISARIC4C; PHOSP-COVID and GenOMICC studies (see: https://isaric4c.net/analysis-platform/ ).
Senior Data Scientist - CircleCI, Remote US or CA
You will be a member of CircleCI's 10-member Analytics and Data Science team. You will report to the team’s director. The team is part of CircleCI’s product organization. The team is focused on improving our product experience, enabling data-driven product management, and improving company-wide data practices. You will partner with product managers and leaders. You will collaborate with Data Engineering, Analytics Engineering, Design, and other stakeholders.
Associate Director, Digital Innovation Lab - Swineburne University of Technology, Melbourne AU
Reporting to the Head Digital Innovation Lab with a dotted reporting line to Dean of School of Science, Computing and Engineering Technologies you will play a key role in developing and fostering opportunities for collaboration and partnerships within the university disciplines/departments, research centres and institutes, and with external organisations, including government and non-government organizations and industries.
Team Leader, HPC Performance - RIKEN Center for Computational Science, Kobe JP
R-CCS will set up a new research team in FY 2022 in order to promote further advances in and expand the use of supercomputers, including Fugaku, and to conduct research and development into next-generation High Performance Computing (HPC) systems. As the Team Leader of a new team at R-CCS, the successful candidate will be responsible for leading the team in conducting research and development on the following prospective research topics, with collaboration occurring between computer sciences and computational sciences. In particular, the candidate will work with team members to develop research proposals, carry out research and development, and promote the team’s research achievements in the context of developing HPC systems and applications.
Senior Data Scientist/Manager - University of Oxford, Oxford UK
The postholder will work at the interface between epidemiology, medical statistics, data science and clinical medicine. They will enable high impact original quantitative research by facilitating access to very large primary care databases of electronic health records linked to hospital, mortality and cancer registry data (QResearch database https://www.qresearch.org/ )
Responsibilities will include working together with researchers to define projects, capture data requirements, manage, transform and curate the electronic health record in readiness for large scale data analytics and modelling. In addition, the post-holder will generate written documentation such as reports to funders and internal policies/guidance documents. The postholder will work closely with a wider team of programmers, clinicians, clinical researchers and administrator in multiple departments within the Medical Sciences Division.
Head of Research Software Engineering - University of Glasgow, Glasgow UK
This grade 10 role is a strategic position to found and lead a Research Software Engineering (RSE) group for the College of Medical, Veterinary and Life Sciences (MVLS) at the University of Glasgow. MVLS, with over 2,500 staff, is the university’s largest college, and its RSE group will be the university’s first. Research Software Engineering will support cutting-edge research, and its impact, by developing professionally usable software tools and applying these to address computational and data challenges across the spectrum of MVLS research, impact and consultancy goals. The Head of Research Software Engineering (HoRSE) will be responsible for partnering with senior leaders within the College and the wider University to define the strategic vision for RSE and develop appropriate delivery structures and key delivery requirements.
Research Scientist/Engineer - Senior Principal (Data-intensive Astronomical Research) - University of Washington, Seattle WA USA
The Legacy Survey of Space and Time (LSST), which will be carried out by the Vera C. Rubin Observatory, is the flagship ground-based astronomical survey of the 2020s. We are looking for a seasoned and experienced Senior Manager to build and lead a distributed team of software engineers to design and build a cloud-based analysis framework that can store, search, analyze and annotate data of the volume and complexity of the LSST data. This framework will provide an interface for the astronomical community to run real time and batch analyses to, for example, search for one-in-a-million events in continuous streams of data. As the Senior Manager, you will be responsible for building and leading full-stack engineering teams at the University of Washington and Carnegie Mellon University. You will set the vision and culture for the team, and work with research astronomers to define the priorities and scope of the software infrastructure. You will be a key senior decision-maker, responsible for allocating resources and facilitating work in order to achieve deadlines.
Senior Product Manager - Anaconda, remote US
Anaconda is seeking a talented Product Manager to join our rapidly-growing company. This is an excellent opportunity for you to leverage your experience and skills and apply it to the world of data science and machine learning.
Senior Platform Engineer - Anaconda, remote US
Anaconda is seeking a talented Senior Platform Engineer to join our rapidly-growing company. This is an excellent opportunity for you to leverage your experience and skills and apply it to the world of data science and machine learning. You design solutions for large sized complexity problems in such a way that is simple and easy to understand by others inside and outside the department. You will directly and indirectly mentor others on the team to ensure that the team and department are always moving forward. You create and implement new processes that add to the team and/or department’s success. You will consistently increase productivity skills by constantly improving knowledge of core infrastructure and tooling, as well as testing best practices. You consistently have the ‘bigger’ picture understanding of business and company goals.
Product Manager - Data - The University of Auckland, Auckland NZ
NeSI is looking for a Product Manager to provide expertise and leadership to support product and service design and development. This role focuses specifically on NeSI’s data services in support of researchers, and on data describing the value of the research we support and hence our impact as an investment.
R&D FLARE Research Director in Quantum Computing and Quantum Communication - JPMorgan Chase & Co, New York City NY USA
JPMorgan Chase's Future Lab for Applied Research and Engineering (FLARE) conducts research in cutting-edge emerging technologies, specifically Quantum Computing and Quantum Communication. Our firm is seeking a passionate technologist to contribute to applied research and development of novel solutions in these fields. As a member of the FLARE R&D team, you will work closely with internal project teams, other researchers, and external communities alike to advance these research focus areas in a way that is beneficial to the technical teams in the company. You will also contribute to the firm's intellectual property by patent applications and scholarly articles.
Faculty and Research Support Manager - University of Texas Rio Grande Valley, Edinburg TX USA
Manages the day-to-day operation, security, system administration and maintenance of high-performance computing (HPC) systems to accomplish the research objectives of faculty and students at the University. Provides user support, documentation, and consulting services to ensure the effective use of research computation resources for faculty members and students.
Senior Applied Science Manager - Amazon Alexa, Cambridge UK
We collaborate extensively both within Amazon and external academic partnerships to accelerate deep learning research and invent new methods that go beyond state of the art to bring new innovations to our customers. To give an idea of our work, recent contributions from the team include the FEVER challenges https://fever.ai/, Deep Semantic Parsing https://arxiv.org/abs/2001.11458, and end to end neural data-to-text generation https://arxiv.org/abs/2004.06577.