Sorry for the late issue. Between some stuff going on at my end and me still figuring out what a revamped newsletter will look like, this week’s newsletter is both a little late and a bit of a hybrid between where I’d like it to go and what it has been. (As always, suggestions for things you would like to see change to change or would like to stay the same are welcome - just hit reply or email me at email@example.com)
The big news of the past week has been Apple’s new M1 CPU. Don’t worry, this newsletter is not going to degenerate into the cliched HPC/research computing blog writing solely and breathlessly about various new CPUs/network cards/SSDs and endlessly comparing speeds-and-feeds. The M1’s specs in and of themselves aren’t what’s interesting. Rather, the M1 is an example of how CPUs are going to get more different as time goes on, and that will have impacts on research computing teams. The M1 going to be a trial run for a future of more diverse computing architectures that we’d do well to get ready for.
Large-scale research computing systems have all been about “co-design” for ages, but the truth is that big-picture CPU design choices has been pretty fixed, with most of “co-design” being about choice of accelerators or mix and match between CPU, memory. and acceleration. Now that the market has accepted ARM as a platform - and with RISC-V on its way - we can expect to start seeing bolder choices for CPU design being shipped, with vendors making very different tradeoffs than have been made in the past. So whether or not you see yourself using Apple hardware in the future, M1’s choices and their consequences are interesting.
M1 makes two substantially different trade-offs. The first is having DRAM on socket. This sacrifices extensibility - you can’t just add memory - for significantly better memory performance and lower power consumption. Accurately moving bits back and forth between chips takes a surprising amount of energy, and doing it fast takes a lot of power! The results are striking:
LINPACK - solving a set of linear equations - is a pretty flawed benchmark, but it’s widely understood. The performance numbers here are pretty healthy for a chip with four big cores, but the efficiency numbers are startling. They’re not unprecedented except for the context; these wouldn’t be surprising numbers for a GPU, which also have DRAM-on-socket, and are similarly non-extensible. But they are absurdly high for something more general-purpose like a CPU.
Having unified on-socket memory between CPU and integrated GPU also makes possible some great Tensorflow performance, simultaneously speeds up and lowers power consumption for compiling code, and does weirdly well at running postgreSQL.
The second tradeoff has some more immediate effects for research computing teams. Apple, as is its wont, didn’t worry too much about backwards-looking compatibility, happily sacrificing that for future-looking capabilities. The new Rosetta (x86 emulation) seems to work seamlessly and is surprisingly performant. But if you want to take full advantage of the architecture of course you have to compile natively. And on the day of release, a lot of key tools and libraries didn’t just “automatically” work the way they seemed to when most people first started using other ARM chips. (Though that wasn’t magic either; the ecosystem had spent years slowly getting ready for adoption by the mainstream.)
“Freaking out” wouldn’t be too strong a way to describe the reaction in some corners; one user claimed that GATK would “never work” on Apple silicon (because a build script mistakenly assumed that an optional library that had Intel-specific optimizations would be present - they’re on it), and the absence of a free fortran compiler on the day of hardware release worried other people (there’s already experimental gfortran builds). Having come of computational science age in the 90s when new chips took months to get good tooling for, the depth of concern seemed a bit overwrought.
This isn’t to dismiss the amount of work that’s needed to get software stacks working on new systems. Between other ARM systems and M1, a lot of research software teams are going to have to put in a lot of time porting new low-level libraries and tools to the new architectures. Many teams that haven’t had to worry about this sort of thing before are going to have to refactor architecture-specific optimizations out and into libraries. Some code will simply have to be rewritten - some R code has depended on Intel-specific NaN handling to implement NA semantics (which are similar to but different from NaN) that M1 does not honour, so natively compiled R needs extra checks on M1.
It’s also not to dismiss the complexity that people designing and running computing systems will have to face. Fifteen years ago, the constraints on a big computing system made things pretty clear - a whackload of x86 with some suitably fast (for your application) network; the main question were how fat are the nodes and what’s the mix of low, medium, and high-memory nodes. It’s been more complex for a while with accelerators, and now with entirely different processor architectures in the mix, it will get harder. Increasingly, there is no “best” system; a system has to be tuned to favour some specific workloads. And that necessarily means disfavouring others, which centres have been loathe to do.
So the point here isn’t M1. Is M1 a good choice for your research computing support needs? Almost certainly not if you run on clusters. And if you’re good with your laptop or desktop, well, then lots of processors will work well enough - but a lot of software is going to now have to support these new systems.
And CPUs will keep coming that will make radically different tradeoffs than choices than seemed obvious before. That’s going to make things harder for research software and research computing systems teams for a while. A lot of “all the world’s an x86” assumptions - some that are so ingrained they are currently hard to see - are going to get upended, and setting things back right is going to take work. The end result will be more flexible and capable code, build systems, and better-targeted systems, but it’ll take a lot of work to get there. If you haven’t already started using build and deployment workflows and processes that can handle supporting multiple architectures, now is a good time to start.
But the new architectures, wider range of capabilities, and different tradeoff frontiers are also going to expand the realm of what’s possible for research computing. And isn’t that why we got into this field?
With that, on to the (shorter) roundup…
8 Steps to Creating a Virtual Employee Onboarding Program - Bruce Anderson
Preserving Culture When Someone Leaves the Team - Mark Wood
Whether people are leaving or joining, as managers we have to be deliberate about the team we’re helping build. Wood points out that when someone leaves, you have the opportunity and responsibility to think about what behaviours that person had that shaped the team, and what you’ll do to replace those - asking other people (maybe including yourself) to do some of some of those things to fill in the gap, hiring someone who will do similar things. You also have the opportunity to think about what behaviours make sense to end or change and take steps there, too.
When onboarding, new employees are no longer immersed in the team’s culture they were when you were all colocated, so you have to take active steps to make sure they understand how your team works and get up to speed quickly. Make sure there’s quick wins lined up for them, communicate endlessly about the things that are important, make sure you’re building horizontal communications between team members not just vertical communications between you and them, and make sure you (and other team members) are modelling the behaviour you expect to see from them.
Managing Your Own Career
Renegotiating your first vendor contract - Will Larson
Eventually we all have to negotiate or renegotiate our first contract with vendors. This short article won’t give any secrets that will win you huge concessions. It will hopefully make the process less stressful by providing a game plan and by setting some realistic expectations, on their side (your vendor wants money, and is willing to make some small trades on the margins for the exact amount of money, but not much) and yours (you should absolutely be able to assume decreasing marginal costs, reasonable timelines, and to see issues seen previously addressed).
Organising Large Miro Boards For Remote Workshops - Nick Tune
The Miro Sprint Planning Playbook - Miro
Since no longer being colocated with our team members and research communities is going to be the norm now, we have to continue to improve how we run events that used to be highly interactive and based on whiteboards or flip charts or the like.
Miro seems like a perfectly good distributed whiteboard application - there are others which also seem perfectly good, but Miro is the leader at the moment. Here are two concrete sets of recommendations for using Miro for long multi-day workshops (by Tune), and one for running sprints and retros (by Miro themselves).
There are other tools which are good for more specific applications - for strategic planning type workshops I’ve had good luck recently with Axis to generate prioritized service catalogues and do things like audience-sourced SWOT analyses.
Research Computing Systems
From Sysadmin to SRE - Josh Duffney, Octopus Deploy
As research computing becomes more complex, our systems teams are going to have more and more demands on them, moving them from sysadmins to systems reliability responsibilities, and working more closely with software development teams. It’s an easier transition for sysadmins in research computing than in most fields, as our teams generally have pretty deep experience on the software side of research computing too.
Duffney’s article lays out how to start thinking about these changes to responsibilities and what people can start doing today to move in that direction. The key thing - it’s not about tools, its about how to think about your role, seeing rough spots in the build/deploy/operate cycle, and working to improve them.
Emerging Data & Infrastructure Tools
Supercomputing 2020—New MPI heights, joining the Graph500, and 1 TB/s filesystems - Evan Burness, Microsoft Azure Blog
Azure is increasingly agressively going after research computing and HPC, and announced several quite cool things at SC2020:
- Running NAMD across 86,400 cores, with performance quite favourable compared to TACC’s 2019 Frontera system
- A top-20 showing in Graph500, demonstrating really good memory and network performance
- And 1TB/s filesystem performance (1.46 TB/s read, 456 GB/s write - interesting choices there) using not lustre but BeeGFS.
Mersenne twisters aren’t great random number generators.
Resources for teaching data engineering.
An opinionated list of CLI utilities for monitoring and inspecting Linux/BSD systems.
A growing list of training material for new research software developers.
A jumphost-only ssh server, lazyssh.
An argument for teaching people to program starting with testing.
Interesting book (and pointers to other resources) on career trajectories in and perspectives on software architecture. Along the same lines, here’s an architecture playbook.
AWS’s automatic S3 access-frequency tiering is getting deeper and smarter.
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Good luck in the coming week with your research computing team,
Jobs Leading Research Computing Teams
Highlights below; full listing available on the job board.
Technical Programme Manager – HPC/AI - Hartree Centre, Warrington UK
You will demonstrate the capability and potential to build upon a strong foundation of programme delivery expertise, existing technological knowledge relating to HPC and AI, and their external reputation developed through robust stakeholder relationships.
You will have a good technical understanding of the opportunities and challenges relating to the adoption of technologies such as AI, machine learning (ML) and deep learning (DL), and be able to work collectively with staff across STFC and our partners to translate these into meaningful programme deliverables and outcomes.
Director of Training - Michigan State University, East Lansing MI USA
MSU’s Institute for Cyber-Enabled Research (ICER) is hiring a Director of Training to develop and deploy a broad range of educational programs relating to research computing and computational and data science at Michigan State University. The goals of this position are to facilitate MSU researchers in their effective use of computing (including high performance computing) through hands-on and web-based training, to develop and support recruiting and outreach efforts relating to computational and data science, and to support the development of externally-funded projects in these areas. This position will involve coordination of activities both within ICER and across Michigan State University.
Young Investigator Group Leader Position “Interactive Machine Learning” - DKFZ, Heidelberg DE
We are looking for an excellent researcher with an outstanding scientific record of accomplishments in machine learning with high interest in interactive components. Within HIP, the independent group will complement the DKFZ image analysis activities by focusing on the design and implementation of algorithms and intelligent user interface frameworks that facilitate machine learning with the help of human interaction.
Potential research topics include active learning, transfer learning, incremental learning, one-shot learning, online learning, out-of-distribution modeling, human-AI collaboration, human computation, crowdsourcing, gamification, human-computer interaction, intelligent user interfaces and user modeling.
Data Management Manager - GSK, Brentford UK
The Data Management Manager will be responsible for governing processes to ensure master and reference data will be actively managed within GSK CH and will work with the Data Management Director to oversee implementation of the strategy, processes, technology and teams to manage master data, reference data, metadata and data quality within the GSK CH Data organisation, working closely with the wider organisation to ensure the compliance with the established company and line of business data policies and procedures.
Data Infrastructure Manager - Sanofi, Toronto ON CA
The Data Science is led by the Head of Data Science and consists of Data Management, SPC and Process Modeling teams. The Data Infrastructure Manager will provide high quality data engineering solutions to support data scientists, data analysts and business users. This position will report to Deputy Director, Data Management.
Director, Data & Analytics Centre - Johnson & Johnson, Toronto ON CA
The Director will lead the Data & Analytics Centre, which is one of the core organizations of the Customer Experience Excellence (CEE) division. This role reports directly into the Sr. Director of CEE and is part of the CEE and Sales & Marketing Leadership Teams.
Janssen Canada recently announced a major initiative to lead change in the Canadian Healthcare market. The vision behind this initiative is for Janssen to evolve from being the market leader that delivers valuable medicines and services to the market, to the leader that also delivers sustainable and measurable health outcomes for patients. The CEE division, the Data & Analytics Centre, and this role will play a critical part in realizing this vision for the Janssen Canada organization.