Hi all -
The folks at Raw Signal Group’s latest newsletter points out that while a lot of bosses are managing to keep the big things going, a lot of us are dropping balls these days especially on the little stuff. There’s a lot to handle.
When I get busy I tend to lose discipline at exactly those things that keep me focussed and productive - prioritization, todo lists, staying on task, - and that leads to dropped balls. In the last week or so I’ve managed to claw my way back up from that and have got things going pretty smoothly again - but it took a while. So I hope you and your team is doing ok and that you’re spending the time you need on keeping yourself well.
With that, on to the link roundup:
What is expected of a Engineering Manager? - Rodrigo Flores
Flores, a manager of managers, has a nice article on what he expects of his managers. In priority order:
Note the order.
A recurring bit of advice from people who have been good managers is to focus on managerial activities that scale, that have high leverage. Yes, delivering what you said you’d deliver is important. If you roll up your sleeves and spend 20 hours doing that piece of the project that only you know how to do to get the next release finished in time, that’s good and all but there’s no leverage there.
If instead you spend say 40 hours working with one of your team members to get it done with them and teach them how to do it, well, you’re a little late on this deliverable, but your team is now stronger because someone else now knows how to do The Thing - every future deliverable will be a bit easier.
A job posting from Lever - Lever.co
So I skim a lot of research-computing-and-data type jobs for the newsletter, and I often see things we could all be doing better from the best of these postings. Jobs at Lever aren’t really suitable for the job board, but look at this job listing - it includes detailed 1 month/3 month/6 month/12 month expectations.
Now sure, in our line of work looking 12 months into the future might be a bit much, but how many groups in research computing have even thought the job through in this much detail, much less had the transparency to post it in the job ad? (Answer - a few, but not nearly enough) And how can we possibly make intelligent hiring decisions if we don’t even have clear, written out expectations we need them to meet?
Leading People With Experience - Mark Wood
A common issue - one in retrospect I’m sort of surprised didn’t come up in the recent “Ask Managers Anything” round - is what to do when you’re placed in a position of managing or even being team lead someone with more experience in the job that you have.
This happens all the time, and it especially worries new managers or team leaders who haven’t quite understood the role yet - in a software developer team they might think of the manager as “top programmer” who makes the technical decisions, in which case managing someone who is an even better programmer (say) seems impossible. But that’s not what managing research computing teams is about! It’s a bit closer to the truth for team lead positions but even there that’s not the goal.
Wood has four areas of advice
4 Tips for Addressing—and Even Embracing—Tension - Holly Bybee & Lauren Yarmuth
Tension between ideas is good and healthy and normal in technical teams, and Bybee & Yarmuth talk about how to address it and harness it constructively.
No More Misunderstandings Paraphrasing - When, Why, and How - Padmini Pyapali
A reminder of a powerful technique for communicating with anyone - your boss, your team, or stakeholders. Double check understanding, constantly, by asking questions about what the other person has said, paraphrased in your own words. Paraphrasing is a great approach but the more fundamental method here is being aware that you may not have understood what was meant by the other person.
What Is & How To Create A Risk Management Plan - Emily Luijbregts, Digital Project Manager
Most of us in research probably started off on fairly small and well-scoped projects where risk management wasn’t a significant factor; or the risks were well-understood enough that everyone knew how to deal with them so there wasn’t a need to separately track and manage the risks. If you get involved with larger and longer projects, though, especially ones that bring together a number of skills, it may be useful to explicitly keep track of risks - it might even be a funder mandate.
Luijbregts does a good job of demystifying terms like Risk Registers, RAID logs, and related concepts in this article, and provides templates as well. Like so much of management, dealing with risk is not rocket surgery; it’s just a matter of thinking things through systematically and applying some structure.
The first step is just to create a document where you’ll keep track of things that could risk your projects success - a risk registry. You can do this with your team; “pre-mortems” like we discussed in #43 are a great way to do that. Then prioritize them and come up with ways to mitigate the risk, or how to proceed if the risky event does happen.
For large enough projects there might be some pseudo-quantitative ways to prioritize the risks or different ways you might find helpful to categorize the risks, but the basics are there above. Think things through, write them down where people can see them, revisit periodically; that’s the heart of risk (or project, or people) management.
How to prepare for the coming CPU confusion - Mike Roberts
I’m old enough to have started my career well before the current x86 hardware monoculture, and the pendulum is swinging back. In the next year or two “works on x86 and ARM(s)” will be a hard requirement for a lot of research software, and once that happens it’s not hard to imagine there’ll be others too (RISC-V? Maybe POWER makes deeper inroads for some use cases?).
Enough people have done enough work that porting high-level code to ARM is mostly pretty boring now, but it’s still going to have to be routinely tested; and if you’re writing low-level libraries you may have a little more excitement in life. (Still though, for the foreseeable future no plausible number of common CPUs is going to make testing backend or command line code as challenging as it is for those poor souls doing front-end or mobile development).
Roberts talks about how to prepare for this. While there’s a lot of things you can do, if you don’t already have a good CI/CD setup for your software - ideally cloud based so you can make use of different types of hardware - this is probably a good time to start prioritizing getting one. Automation plus a breadth of hardware available will make the transition a lot easier.
Video for the SORSE event I was part of is up, and Trisovic’s talk pairs nicely with this article from Daniel Katz on software in repositories.
Trisovic’s talk points out that data in data repositories increasingly has to come with code to be meaningful, and so data repositories have to support this meaningfully, both with supported formats and with code-relevant metadata. The metadata has to at least include version information and the like - Trisovic tried the experiment of getting a lot of R and Python code stored in one data repository to run and there was no machine readable option other than trial and error to try all recent R/Python versions to get the code to execute successfully - and even then there was pretty poor success rate. Partly for that reason, Trisovic recommends data repositories linking to specialized archival code repositories rather than trying to do both.
Katz’s article makes a related point that fits nicely into the argument about code repositories. Preservation (with sufficient documentation and metadata) is enough to make data available for reuse. But while we can archive snapshots of code indefinitely, and for a briefer time archive executables, preservation isn’t enough to keep code reusable:
In other words, we can preserve source code for reading, and we can preserve executable software for re-execution, but we cannot simply preserve software for reuse. For reuse, we need to sustain the software: to supply human effort that continually works on the software to adapt it as needed for changes in the environment and dependencies.
What you can do when code is really hard to review - Nicolas Carlo, Understand Legacy Code
One distinguishing feature of research software is that it’s often subtle. Subtlety combined with how often it is legacy code makes it difficult to follow, and makes changes doubly so. In this article Carlo describes some general principles for handling hard-to-review code changes, with the caveat that the hard to review changes are the ones that especially need review, both for QA purposes and for knowledge transfer:
Assorted thoughts on zig (and rust) - Jamie Brandon
Zig (docs, tutorial) is an emerging systems language in the C++/Rust space (rather than the higher level Go, or higher still Nim or or Julia). I’m not sure I’d recommend it for anything yet, but it’s starting to gain enough adherents that it’s at least something you might want to be aware of. It’s getting to the point where people are starting to write Zig vs X posts, which are useful ways of getting insights into a language if you have some familiarity with the X in question.
Brandon has some experience in research computing, and in languages like Julia as well as Rust. His post is a laundry list of thoughts on Zig vs Rust which I won’t try to summarize here, except for the fact that for the work he is currently doing, he finds Zig much simpler (and it’s not about the lifetimes).
Limiting Work In Progress - Daniel Truemper
A trap research computing managers fall into fairly frequently (including me) is seeing the big picture, seeing all the things that need to get done, and trying to start them all at once. After all, we know about parallel computing, a wider pipeline can mean higher throughput, right? But human beings don’t work like that. You get more done by diligently limiting the amount of work in progress, which has the advantage that it requires prioritization.
In issue #42 we covered Palen’s article on a better tar for large datasets; here he tests several new-ish compression tools, which use either faster compression methods or faster (including multicore) implementations of existing compression methods to see how they compare. There’s a plugin replacement for gzip (pigz), some new bzip2 implementations (lbzip2, mpibzip), a couple of xz methods (xz, pixz), and lz4.
His findings on a large set of numerical research data (read the article for more):
A List of Post-Mortems - Dan Luu
In research computing, when it comes to running systems we could be a lot closer to industry best practices than we are. We’ve talked about post-mortems more than once; here’s a list of postmortems from many companies collected by Luu. It’s nice to see that they don’t necessarily have to be long or complicated or intricate; like risk management, just simple documents for ongoing clarity can be a huge step forward.
What do we (not) know about RSE? - Dr. Anna-Lena Lamprecht, Dr. Michelle Barker, Dr. Carlos Martinez, SORSE talks, 28 Oct 8h00-11h00 UTC and 30 Oct 15h00-18h00 UTC
In this talk, presented twice to be available to audiences in multiple timezones, Lamprecht, Barker, and Martinez discuss the state of research software development professional tracks, and what is not yet known about such efforts.
Configuring Sphinx from scratch: making your own documentation and making your documentation your own - Sadie Bartholomew, SORSE talks, 3 Nov 15h00-16h00 UTC
In this hands-on demonstration Bartholomew walks through configuring Sphinx from scratch to create documentation for your project.
Research Data Alliance’s 16th Plenary Meeting - 9 - 12 Nov
The 16th annual plenary meeting is being held online, with the theme being the creator of knowledge ecosystems.
Elliot is a Python logging/tracing package that works particularly well for scientific code.
An opinionated take on alternatives to JSON for object serialization.
A nice detailed walkthrough of demand memory paging.
A nice (possibly non-portable) walkthrough on how C++ exceptions work, and how to catch them in C code.
Delta, a diff viewer that knows about syntax.
Sphinx is often used for documentation - ubiquitously in the Python community but also well outside. But a lot of people (me) don’t love rST. MyST is a Markdown++ that incorporates all restructured text capabilities as markdown extensions.
Locked up network where the only way to get in is a simple HTTP proxy? SSH in via websockets.
Every now and then I see kickstarters for open source software and try to figure out how much of research code maintenance could be crowdfunded.
JuliaMono - a monospaced code font with a lot of unicode support for e.g. greek letters for variable names, and with some modest ligature support (but not too much).
An argument in favour of rolling release OS distros instead of “big bang” new discrete releases. TL;DR - continuous small changes while less predictable both prevent things from becoming stale too quickly and reduce the chaos of everything breaking at once with discrete releases.
Sure, LD_PRELOAD is cool, but have you ever tried LD_AUDIT?
You’ve probably heard of the “You are not expected to understand this” comment in v6 UNIX. Well, part of the reason it was hard to understand was that it was wrong.
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Highlights below; full listing available on the job board.
Associate Director of Research Computing Systems Software - Harvard, Boston MA USA
Harvard is seeking an Associate Director of the Systems Software Group to drive the development and adoption of leading practices in site reliability engineering within FAS Research Computing (FASRC). This position will provide leadership and expertise in designing and implementing a suite of tools and services to support Operations Team in maintaining, deploying, and securing the vast server infrastructure across three datacenters. This position will join the management team in FAS Research Computing and report to the University Research Computing Officer.
Manager of Research IT Applications - The Jackson Laboratories, US Various
Reporting to the Director of Information and Research Technologies, the Research IT Manager - Cloud Systems will closely work with the research faculty and scientists across all campuses of JAX on projects of wide ranging complexity from a diverse array of biomedical disciplines such as cancer biology, aging, immunology, infectious diseases, metabolomics, neural disorders, muscular disorders, Biostatistics & Statistical Genetics, and stem cell & developmental biology. Research projects will involve the large-scale collection, processing, and integration of biological data, with emphases toward imaging, genomics, and metabolomics.
They will be responsible for the life cycle planning of Research IT hardware, software and cloud resources through selection, operation, refresh, migration, optimization, retirement and disposal. They will also be responsible for ensuring an exceptional customer experience for JAX’s researchers and faculty; for streamlined processes and stable operations; for managing vendors; for developing a culture of and skills in process improvement; for proper documentation; and for staff training, performance and development.
Note - below is one Oracle Cloud job but there are a half-dozen or so across Europe and North America on both the HPC and the Data Science side. I’ll just list this one here but if you’re potentially interested, look at other jobs on the same Taleo site:
HPC Program Manager - Oracle Cloud, WFH Canada or US
This is your opportunity to be a core part of the team that defines and delivers exciting new capabilities on OCI. The Oracle Cloud Infrastructure team is looking for Solutions Architects (SAs) possessing deep technical skills in systems architecture, cloud computing, and HPC. The ideal candidates will have excellent customer-facing skills and a passion for educating, training, designing and building cloud solutions for a diverse and challenging set of customer solutions.
As a Solutions Architect on the High Performance Computing Team you will continuously experiment on our platform by building, deploying and running application across thousands of cores which aligning to customer use cases in order to develop best practices and reference architectures. This role will have the opportunity to help shape the next generation of cloud computing and influence the adoption and usage patterns among enterprise and hyper scale customers.
Development Manager-Data Analytics (Python/Node.js/Angular) - CaseWare, Toronto ON CA
[TLDR; We’re bringing data science to the audit. Come help us change the world.]
The rules for audit assurance are different in every part of the world. That’s why we’ve built a web platform that enables our global distributor network to develop and customize products that fit their individual market needs. CaseWare’s Data Analytics product line is laser-focused on modernizing how the audit works in accounting firms, corporations, and governments worldwide.
We’re looking for a remote Development Manager to manage up to 10 remote developers. This is a solid, innovative, group of people who are actively building game-changing data analytics technology. Remember, we’re building a platform for resellers - we’re the AWS of Cloud Audit (Yes, we use AWS).
Senior Technical Program Manager, HPC and Cloud – AI - Nvidia, Santa Clara CA USA
Lead cross-functional teams by integrating the customer, OEM/OEM, network & storage vendors, software partners, with NVIDIA, to deliver computing platform solutions for High Performance Computing and Artificial Intelligence. Define the strategy and tactics for successfully executing GPU computing server solutions and integration of complex software stacks, from evaluation/proof-of-concept to design/development/qualification/validation, through customer acceptance and start of service. Execute the plan you craft by: scoping the project, identifying and assembling the core support team, formally kicking off the project with customers, developing a joint project timeline, supporting the design-in effort by providing technical collateral/tools, leading engineering design reviews, ensuring the implementation of any customer-required features, driving resolution of any blocking issues/bugs, and ensuring timely customer acceptance and production deployment.