Research Computing Teams Link Roundup, 15 May 2020
Hi!
I asked last week about onboarding; people who responded last week said they weren’t bringing on student interns or new hires at all at this point in the pandemic. It sounded like at least our research computing team managers had succeeded in avoiding layoffs in their teams, sometimes despite an unfavourable - in one case very unfavourable - pre-COVID-19 environment, but several were not bringing on research students, either because of complexities on onboarding, that they just didn’t have the time, or because schools had stopped their placement programs.
A question for this week - how are work hours and productivity going in your team right now? There’s some evidence that after an initial dip, productivity has actually risen; but I think it’s more complicated than that, especially for research computing teams. GitHub has seen more hours worked (and more variable hours) but it’s not clear that total number of PRs or commits has gone up - people are working longer but not necessarily getting more done.
Anecdotally it seems like in among some teams I’m familar with, productivity for lots-of-small-tasks work has gone up, but productivity for deep-work will-have-to-think-about-this-for-a-few-days type tasks has dropped noticably. Has that been your experience, or are you seeing something different? How are you handling it?
Managing Teams
The Engineering Manager Event Loop - David Loftesness via Chris Eigner
This isn’t new, but I really like the idea: what a generic tech software development manager should be thinking of daily, weekly, and monthly on people, projects, processes, and themselves. It’s not quite right for research computing - thinking about recruiting and hiring on a daily basis is to put it mildly not the regime we’re normally in - but a lot of the other items hold up. What other changes would we have to make?
Attitudes toward Change and Transformational Leadership: A Longitudinal Study - Matthew David Henricks, Michael Young & E. James Kehoe, J. Change Management
There’s a lot of people who spend their careers studying how workplaces work. I don’t think most of the management literature percolates into STEM fields — the papers look funny. They’re full of impenetrable jargon (unlike our normal completely buzzword-free reading) and the text seems impossibly fluffy compared to dense STEM work (but, well, people’s behaviour doesn’t lend itself to formulae and small dense sentences).
This is a longitudindal study that probably aligns with you already believe, but it’s worth highlighting - you absolutely can change culture or practices in a team or larger organization, but you can’t do it by having a single event and then putting up some posters. It takes time and sustained effort. People systems are like large ships; it’s hard to get them to change course, but once actually you get them turning they have a lot of momentum.
Managing Your Own Career
Practical Skills for The AI Product Manager - Peter Skomoroch, Justin Norman and Mike Loukides
I’m not saying you should leave academic research computing and move into AI product management — in fact, if you care enough about managing research computing teams well to be reading this, please don’t, the community needs you — but we often undervalue how transferrable our skills are to other areas. Read some of these practical must-have skills and see if they don’t sound like your day job:
- Design/Innovation: “…It’s incredibly important to determine what outcome is desired, how that outcome will be delivered, and how the product will be used before embarking on the long (and expensive) development journey.”
- Feature Development and Data Management: Data management, you say?
- Experimentation,
- Research, and
- Modelling: yeah I think we’re good
- Serving infrastructure: check
How to expand conflict capacity in times of crisis - Marlene Chism
We’re all tired and stressed right now, and it’s easy under those circumstances for small things to cause conflict where it normally wouldn’t. This is something that we can work on, but it takes some time and practice. Pay close attention to things that trigger unhelpful reactions, try to be aware of them as they come up and short-circuit them, and then buy some time saying you’ll get back to them and step away.
Research Software Development
NVIDIA HPC SDK - NVIDIA
The newest NVIDIA C++ compiler, NVC++, as part of a new HPC SDK, looks pretty cool - it includes GPU-accelerated versions of the C++ STL parallel algorithms. There’ll be a talk about the HPC SDK at the virtual GTC on Tues May 19th.
Five Code Review Anti-Patterns - Trisha Gee, Oracle
We’ve talked before about having clear expectations on code review; here’s five common traps to avoid, and that could be made explicit as part of your team’s CONTRIBUTING.md or similar:
- Nit-Picking
- Inconsistent Feedback
- Last-Minute Design Changes
- Ping-Pong Reviews
- Reviewer Ghosting
Has someone on your team been meaning to play with Webassembly to run some C++ code in client’s browsers? Here’s a template repository for turning C++ code into webassembly and deploying it as a node.js package.
Research Computing Systems
Brit research supercomputer ARCHER's login nodes exploited in cyber-attack, admins reset passwords and SSH keys - Gareth Corfield, The Register
Look, none of us want to read about breaches of our systems in the press, but the fact that this article appeared is a mark of seriousness and professionalism on behalf of the ARCHER staff and leadership.
In the private sector the emerging standard, increasingly enforced by law, is full, frank, and fast public disclosure of breaches. That is — that has to be — the future in our community, too. Our researchers deserve at least as much consideration as ecommerce site customers.
Especially since it’s not just our researchers. Research is inherently collaborative and international in nature. People have accounts on lots of different systems. I have heard about breaches in November and February that I’m 95% certain are related to this breach - and if not this one, some less-publicized ones in Germany and the US. The breach spread almost certainly from one system to another via accounts common to two or more clusters.
In research computing we have this culture of secrecy around breaches, so they were kept quiet, not publicized; maybe information was sent in dribs and drabs to a few trusted others through back channels. We are way past the time when that was good enough. These breaches could have been contained, or at the very least slowed down, by earlier public disclosure like the ARCHER team made so that other teams elsewhere knew to be on the lookout. Sure, hold back some signatures of the rootkit so that the attackers don’t make cosmetic changes to avoid detection if you want, but our researchers and our colleagues deserve to know about your breach. We’re in the middle of a pandemic, for pity’s sake! We should understand how irresponsible it is to be walking around infected while acting as if nothing is amiss.
I get it, no one wants their funders or VPRs asking awkward questions. But that’s not a good enough reason for this amateurish practice of hiding the inevitable under the rug. We’re ashamed about these breaches because we know at some level we’re not doing a good enough job on security. But the first step towards having some professionalism is to hold ourselves accountable. As a community we need to do better.
In other security news: no, you probably don't need to worry about that Thunderbolt Vulnerability if you’re running an up-to-date OS. (“All the evil maid needs to do (with an un-updated system - LJD) is unscrew the backplate, attach a device momentarily, reprogram the firmware, reattach the backplate, and the evil maid gets full access to the laptop” - oh, ok then.) But still, don’t leave your laptop lying around, and be careful what you plug into your computers.
Emerging Data & Infrastructure Tools
New – EC2 M6g Instances, powered by AWS Graviton2 - AWS
AWS’s new ARM chips, the Graviton2, are now generally available on AWS in some regions. This is a great opportunity to play with them - not only are they substantially cheaper per unit performance than Xeons but they seem to actually have better performance on some database-y type loads than other M-class nodes.
This is particularly well timed, coming alongside the announcement of what’s likely the new top supercomputer in the world, the ARM-based Fugaku.
Azure HPC Cache: File caching for high performance computing - Scott Jeschonek and Scott Hanselman, Azure
The various commercial clouds are developing products expressly around the issues research computing teams have migrating to using more cloud. HPC cache caches your on-prem storage to an Azure region (exporting an NFS-esque posix FS or object store via Blob Storage) and handles transfer back and forth automatically - meaning you can burst workloads more or less automatically. It has some sophisticated features like being able to specify the type of I/O workload (read-heavy, write-heavy, mixed).
I’d really like to see more solutions like this for caching even in-cloud S3/Blob object store to POSIX FS so that the staging in and out of data from workloads that require expensive POSIX fs to much cheaper object storage…
Sort of relatedly, Azure now finally has spot instances.
How I switched from classic hosting to Kubernetes - Anders Borch
A nice walk through of moving a side project from docker compose to kubernetes, for those who been meaning to give it a try.
Calls for Proposals
ARM Dev Summit - Proposal deadline June 9th 2020
If you’ve already been playing with ARM or intend to, this October virtual conference is calling for proposals for workshops, demos, or panels. Tracks most relevant to us are:
- AI in the Real World: From Development to Deployment
- Building the IoT: Efficient, Secure and Transformative Software Development
- Cloud Native Developer Experience
- Infrastructure of Modern Computing
ALGOCLOUD 2020 - Paper deadline June 24th 2020
This virtual conference in September “ is particularly interested in novel algorithms in the context of cloud computing, cloud architectures, as well as experimental work that evaluates contemporary cloud approaches and pertinent applications. ALGOCLOUD also welcomes demonstration manuscripts, which discuss successful system developments, as well as experience/use-case articles and high-quality survey papers.”
Events: Conferences, Training
Chapel Implementers and Users Workshop 2020 - Friday May 22, 2020 8:30am–4:30pm PDT, Free via Zoom
Chapel seems to have come a long way since last I played with it; they have an all-day pacific time workshop next Friday.
Random
NSF has funded an NCSA/Chicago effort to develop and deploy serverless (function-as-a-service) computing for research computing: the software product is funcX. Anyone know the advantages funcX has over e.g. Apache OpenWhisk?
Programmatically draw architectures for AWS, Azure, GCP or on-prem systems by writing python code. The cool thing there of course is now your architecture diagrams can be meaningfully version controlled.
I’ve often said that one of the problems people new to computing at scale have is that laptops and their filesystems work too darn well, so their intuitions about what’s easy and hard actively mislead them on clusters or distributed systems. This article expresses that more clearly, and might be useful for discussions with users who are having trouble making the jump.
You shouldn’t write code when avoidable. Even something “trivial” like CSV parsers.
If you haven’t seen this already, there’s a nice survey of deep learning for scientific discovery on arXiv.
A nice looking coursera course on research data management.
That’s it…
And that’s it for another week.
Have a great weekend - a great long weekend if you have one! - and good luck in the coming week with your research computing team,
Jonathan
Jobs Leading Research Computing Teams
Cloud and HPC Architect - BASF, Research Triangle Park NC USA
BASF is looking for a Cloud and HPC expert who will manage and maintain our High Performance Computing Platform in AWS and continue development of it. As a key contact of the Seeds R&D business, you will assist our science community to optimally use the personalized self-service HPC clusters.
Senior Product Manager Compute - DigitalOcean, New York NY USA
We are looking for a passionate, results oriented, and motivated self-starter to lead product within our Compute products team. Looking for someone with cloud computing domain knowledge, especially with hands-on experience launching and supporting large/distributed cloud compute products and services. The Product Manager, Technical will build products and features to drive Compute product initiatives while improving operational and capital expenditure.
Senior Manager (Technical) - AMD Research, Austin TX USA
AMD Research seeks a passionate, collaborative leader with strong technical skills and the initiative to motivate an expert team. You will manage a world-class research group and innovate with internal teams and external partners to create the next generation of computing technologies.
APAC Business Development Manager, HPC - AWS, Sydney NSW AU
This business is the primary infrastructure for everything from dev/test, web applications to AI/ML to HPC workloads. The BDM will be responsible for defining, building and deploying effective and targeted programs to accelerate broad based sales and business development activities for compute workloads on AWS.
Research Computing Service Owner - University of Bradford, Bradford UK
As part of a newly established team, you will work to the Research and Innovation Environment Manager at a time of significant development of partnership working at the University of Bradford to deliver our Health Research agenda of evidence-based solutions to transform lives and reduce health inequalities. This new role is based within the Research and Innovation Services Department and has responsibility for managing the High Performance Computing facility on a higher floor in the same building.
Principal Specialist HPC - AWS, San Francisco CA USA
This business is the primary infrastructure for everything from dev/test, web applications to AI/ML to HPC workloads. The BDM will be responsible for defining, building and deploying effective and targeted programs to accelerate broad based sales and business development activities for compute workloads on AWS.
Scientific Computational Lab Manager - Institute for Research in Biomedicine, Barcelona ES
The group aims to attract a Scientific IT Lab Manager to work on managing the high-performance computing infrastructure and the genomics pipelines of the team. Their overall goal will to be organize the infrastructure that enables analyses of human genomic big data, thus fostering translational and basic research of cancer. The team is dedicated to discovering new ways to diagnose and treat cancer in the framework of the European Research Council (ERC) project HYPER-INSIGHT.
Senior Analyst - Digital Research Technologies - STFC - The Science and Technology Facilities Council, Swindon UK
This new role is part of the Cross UKRI Digital Research Infrastructure directorate. The post holder will work closely with the Director of Digital Research Infrastructure in the development of UKRI’s approach to development and delivery of a national infrastructure for digital research: comprising supercomputers, data facilities, clouds, software and networks, and the related people and skills capacity.