Sorry again about the slightly irregular time, and the lack of much of a preamble - heavy deadlines these past two weeks, and I’ve been thinking about directions we can steer the newsletter in the coming year. But the deadlines are past now and it’s back to normal.
As always, share your thoughts with me - just reply to this, or email me - about directions you’d like to see this go. There’s so little out there specifically for us research computing team managers that there’s a lot of things we could usefully do! I’m always interested in direction from the community.
In the meantime, on with the link round up!
This is a talk that Neil Chue Hong gave at the 2020 International RSE Leaders Workshop. The numbers he gave are UK based - he’s with the Software Sustainability Institute in the UK - but they’re pretty grim. UK research software developers (RSEs) have as low or lower percentages of people who identify as women or are black, asian, or other ethnic minorities in the UK than either academia or tech, which themselves are decidedly unrepresentative of society as a whole.
This is posed as an issue with research software development, but I think it’s more than that. One of the themes of this newsletter which I hope comes through is that binning research computing into “software development”, “systems”, “data management”, or “HPC” isn’t helpful - we all have the same challenges, the same goals, interlocking needs, and the boundaries between the bins are super fuzzy. We only solve issues by working together. And this is a research computing problem which the whole discipline needs to address together.
And it is a problem. Witness the fiasco around Numpy. Numpy’s paper, which I celebrated being in Nature last newsletter, had 23 authors - every single one of them men. When that was pointed out, the numpy twitter account started blocking people(!!) and then a number of contributors started trolling and dog piling on critics which … did not dampen the concerns.
We do have this issue, it’s worse than in just academia or tech individually, and we need to start fixing it. As managers, we can make sure our own hiring processes are surfacing excellent talent from all communities.
We’ve talked a lot about the importance of psychological safety in teams - making team members comfortable expressing their opinions, including raising issues. Without that, you’re missing important input and potentially running into foreseeable (and foreseen!) problems.
Premortems give explicit encouragement to raise issues. I’ve used these to good effect in some project-kickoff situations - trying to get the team to see obstacles ahead so they can be avoided. With pre-mortems, one step is actually brainstorming ways that things can go wrong. This makes it much easier to chime in with foreseeable issues, and for you to get those insights that they might not be willing to share. And it’s a good way to get people comfortable raising potential problems. Cohn’s article is a good intro to the idea if you haven’t seen it before.
Meeting everyone on a new team - Anna Shipman
Last time we talked about leaving a team, this time an article about doing one specific thing when joining a new team as a manager or director - speaking with every person in your new organization. Shipman describes having 30 minute meetings with each person in her new 50 person organization over the course of several months. Long time readers will recognize it as looking a bit like the first half of a weekly one-on-one; mostly listening, driven by the team member. Shipman made it clear that this was for informational purposes, and that she wasn’t intending to attach the team member’s name to comments, and structured the discussions around five questions:
How To Handle Email Calmly - Lukáš Linhart
Email is the bane of all managers and so any article on handling email almost always gets a quick look.
Linhart’s first suggestion is something I follow that I only recently learned not everyone does - religiously keep multiple email accounts, and keep them separate:
He also suggests
I’d add that I also have “email addresses” associated with my RSS reader account that I use to sign up for long-form reading like newsletters - bringing them out of my inbox and instead routing them somewhere that I go when I’m actually looking forward to reading the contents.
I’d also add: Set your email clients to update every hour or so, not every couple of minutes. There’s no possible email that you need to have read within minutes of it arriving in your inbox. I’ve tried other tricks, but those are what have worked for me. How about you?
Improving your RSE application - Ian Cosden
We don’t often get career articles actually written specifically for our discipline, so I wanted to include this short 3-part set of articles by Cosden, the Director of Research Software Engineering for Computational & Data Science at Princeton. It came as a follow up to an RSE community chat about hiring.
The articles are simple but they’re useful to look at both from the point of view of seeing what a fellow hiring manager looks at, and as a reminder of things we need to be keeping in mind ourselves as we look to our next roles.
Cosden’s CV tips:
and his cover letter tips:
At some point I need to do a whole thing on hiring… what are issues you’ve had with hiring and what biggest issues do you face? Hit reply or email me (email@example.com) and let me know if there’s something you’d particularly want to read.
What Does This Line Do? The Challenge of Writing a Well-Documented Code - Miroslav Stoyanov, Better Scientific Software
In this article, Stoyanov describes how the team behind Tasmanian, a library for high dimensional integration and interpolation, went from PDF-based documentation to a Doxygen-powered web based documentation system. The lack of good internal documentation was a problem for Tasmanian:
Tasmanian has always had a well-documented external API, but not internal documentation, the lack of which is especially problematic when chasing moving targets such as GPU support for multiple vendors. Porting code to GPUs is hard, doubly so when it is undocumented and comes from an external contributor no longer working on the project.
They took what internal documentation that existed and moved it into Doxygen and their regular build system. By moving to web-based internal documentation, this automatically raised the visibility of gaps in the documentation, and made it easier to spend development time filling those gaps - which in turn makes it easier to port to multiple GPUs.
Enhancing software development through project‐based learning and the quality of planning - Marco Antônio, Amaral Féris, Keith Goffin, Ofer Zwikael, and Di Fan, R&D Management
We’ve talked about sharing knowledge across the organization before - whether through talks, pair-programming, or other shared experiences (documentation, share, but not just documentation). In project management, “Project Based Learning” (PBL) is a number of techniques that builds into the project planning ways to make sure the things our team members learn from the project becomes shared knowledge within the team and organization.
This is a paper that shows that PBL works; it’s not just a good career and skills development practice (which it is) and not just a nice-to-have, but it actually measurably improves performance on future projects. The authors looked at 47 software development projects across three multinational organizations and found that the data supported all five hypothesis they tested; project based learning:
Since research software development is generally under time pressure, is always quite collaborative, is normally organized into teams around projects, and is notoriously uncertain, this is quite relevant to us.
Spindle: Scalable Shared Library Loading - Matthew LeGendre, Dong Ahn, Todd Gamblin, Bronis de Supinski, Wolfgang Frings, Felix Wolf
When large number of modes start running a task, they often hammer the filesystem as each process independently loads the necessary often large number of dynamic libraries (or .py and .pyc files, or..) to start the program. From the authors:
We encountered cases where it took over ten hours for a dynamically-linked MPI application running on 16K processes to reach main.
And during that time any other processes trying to access the filesystem are also slowed.
The Spindle package, on github, is a (largely) LLNL team’s approach to this problem. On task startup, for each file (dynamically linked library, dlopen()’ed library, configuration file, etc., configurable) a process will be chosen to read it and broadcast it to the ramdisk of the other processes, greatly speeding startup and reducing the impact on other users. No recompilation needed, one just starts the program with, e.g.,
spindle mpirun -n 128 mpi_hello_world.
The Cost of Software-Based Memory Management Without Virtual Memory - Drew Zagieboylo, G. Edward Suh, Andrew C. Myers
I had just been reading an older article on Virtual Memory Tricks, as well as one on mmap tricks, both making the argument that we don’t use virtual memory features often enough to simplify our programming when this link crossed my inbox. This article makes the argument that virtual memory has outlived its usefulness, and now is a drag on performance and system predictability:
With large memory workloads, virtualized environments, data center computing, and chips with multiple DMA devices, virtual memory can degrade performance and increase power usage.
It’s not hard to understand how virtual memory greatly complicates DMA devices. With virtualization, the argument is about performance: large-memory, complex-read-pattern workflows like many data analytics workflows often incur TLB misses and that with virtualization this now requires nested page table walks to resolve.
Embedded systems have done without virtual memory for eons, and the authors suggest that it’s time this became more common. We simplify OS issues dramatically without virtual memory, and can increase performance; further and perhaps more controversially, the authors suggest that applications can more easily now handle their memory directly, forgoing the virtual memory system-provided contiguity, migrating, and swapping by doing it at the application level.
The authors try their hand at implementing their recommended approaches with several SPEC and PARSPEC benchmarks, and claim that the programming and computational overhead for implementing memory management at the application level is modest, and encourage more work in the area.
What do you think? Most “HPC codes” do this anyway, because memory placement is so crucial, and HPC sysadmins would typically rather see a code crash than swap - but I’m a little more skeptical about its application more broadly.
Google, which is notoriously close-lipped about technology development in the company, is getting more and more open with their training materials. This is terrific, because google takes training materials very seriously, and they’re quite good.
In Google’s systems reliability practice, they emphasize large systems design and “back of the envelope” estimation approaches which will seem quite familiar to those of us who were trained in the physical sciences. They teach this approach with quite concrete examples, their so-called “Non-abstract large systems design” (NALSD) examples. This lets them quickly evaluate the feasibility and tradeoffs of different approaches before they start building things. There’s a nice chapter in the SRE book working through a simple example.
They’ve just released a nice workshop on NALSD with a pub-sub worked example. In the package are slides, worksheets for attendees, and a workbook for workshop facilitators. It looks like a nice set of materials for you or a team member to work through if you’re curious about architecting these kinds of systems, or a cool afternoon course to offer within teams or externally.
Oracle Cloud Deepens HPC Embrace with Launch of A100 Instances, Plans for ARM, More - John Russell, HPC Wire
Oracle bulks up high-performance computing services on its cloud - Paul Gillin, SiliconAngle
Oracle, trying to carve a niche for itself in the commercial cloud provider world, has clearly set HPC (and other areas of research computing?) as a strategic target. With bare metal instances, the newest A100 NVIDIA GPUs, and ARM plans, Oracle is clearly betting this can be a specialization that can pay off.
The SiliconAngle article points out something of particular interest for research computing workloads - much more flexible instance types, allowing you to choose your number of cores and amount of memory directly:
“If you want three cores and 15 gigabytes of RAM, you can’t get that from anyone else,” Batta said. “It’s like a slider: You pick cores and memory and we give you an instance on demand.”
Has anyone tried Oracle’s HPC efforts?
SORSE Call for Contributions - 30 Sept
The ongoing Series of Research Software Events has their monthly abstract deadline at 8pm BST on Sept 30. They are accepting proposals for talks, panels, software demos, and more.
Were you going to submit something to an RSE conference this year? Do you have a project you’d like to share? Is there a talk/workshop/panel you’d like to see happen? Then you’ve found the right place! We encourage contributions from all time zones and will schedule events on a day and at a time that suits the presenter. We would like to record events where appropriate. The Call for Contribution form asks for your permission to record the event, for you to give permission to provide your uploaded materials under a CC BY licence and be happy to publish them to Zenodo.
GPU Technology Conference - 5-9 Oct, multiple timezones, $99 USD
Nvidia’s GTC has gone all-virtual this year and really embraced it - with sessions, live or recorded, being given in multiple time zones (and some given in multiple languages). As you might expect, there’s a lot of AI and deep learning sessions, but also multiple sessions on topics of interest to research computing such as geospatial data, drug design, HPC, GPU + Infiniband, genomics with both short reads and nanopore sequencing, IoT/edge computing, and technical sessions on pseudo-spectral fluid dynamics methods, RAPIDS (GPU-powered database analytics), GPU + Spark, algebraic multigrid, CUDA programming, and more. Not hard to find $99 worth of talks to attend.
Digital Humanities RSE: King’s Digital Lab as experiment and lifecycle - 29 Sept, 15:00 – 16:30 UTC, James Smithies, Arianna Ciula, SORSE talk series
Next up in the Series of Online Research Software Events series, a walk about a research software lab at Kings College, London:
This SORSE event describes King’s Digital Lab (KDL), a Research Software Engineering lab operating within the Faculty of Arts and Humanities at King’s College London (UK). The KDL team of 18 project managers, analysts, designers, engineers, and systems managers specialise in arts & humanities, cultural heritage, and creative industries research and development. The talk will provide a current state overview of the lab, and describe our RSE HR roles (see https://zenodo.org/record/2564790) and a relatively recent trial initiative that defines the different ways the team can contribute to research.
Hacktoberfest 2020 - 1 Oct - 31 Oct
Not getting much coding in these days as a manager? More time spent in spreadsheets than in editors? Here’s your chance. Sponsored by Digital Ocean, this annual project encourages contribution to open source projects. If you’re one of the first 75,000 participants to complete the challenge by submitting 4 valid non-spammy PRs to any public GitHub repo (many projects will label issues with #Hacktoberfest in addition to good-first-issue or the like), and you will be eligible for a prize like a t-shirt. There’s a bunch of research software projects participating in climate science, geo sciences, and of course a zillion COVID-19 projects.
Fluid Numerics Cloud HPC livestreams - Weekly starting 1 Oct
Joe Schoonover of Fluid Numerics is doing weekly livestreams of doing HPC fluid dynamics in the cloud.
Graduated beyond Little-Bobby-Tables style SQL-injection mischief in the “name” field of your various web accounts? Worried your service provider is storing passwords in plaintext? Maybe try choosing antivirus test strings as your password.
With the update of Windows you now read Windows-System-for-Linux linux files from within Windows … by way of plan 9.
Relatedly, someone’s put together a DOS system for linux, so you can run your favourite MS-DOS commands from linux.
A nice style guide for SQL from the folks at kickstarter.
In fact, there’s a lot of really nice coding exercise websites out there. One I just found, exercism.io, has 118 exercises for… Tcl?
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Highlights below; full listing available on the job board.
Research Project Manager - BHF Data Science Centre - Health Data Research UK, London UK
Work with the BHF Centre Director and Operations Director and other members of the team to lead the definition of project tasks and resource requirements Work with the Centre team to develop full scale project plans and schedule project timelines across a complex set of work streams, work programmes and driver projects. Project management of day-to-day aspects of Centre work programmes/streams and projects, including pro-active risk / issue management and mitigation and tracking of and reporting on progress, across work programmes/streams and projects. Create and execute project work plans and taking remedial action where necessary to ensure that the project deliverables are achieved
Director, Research Software - McGill University, Montreal QC CA
The Digital Research Infrastructure (DRI) unit at McGill University includes three teams - Advance Research Computing (ARC), Research Software Development and Analytics, and Research Data Management. The unit works in close collaboration with staff from other member institutions of Calcul Québec and Compute Canada for providing support to researchers at McGill, Quebec, and across Canada. The Director, Research Software is responsible for a team of analysts and is expected to build expertise and capacity in research software. The incumbent works in close collaboration with the Director, IT Development and Operations who is overseeing the ARC platform and services. The DRI unit provides computational support, assistance and resources to researchers on campus such as research application development and support to users of the ARC platform, e.g. planning resource allocation, access to ARC, help desk, and analytical support. The position is administratively under the Office of Vice-Principal (Research and Innovation) and reports jointly to the CIO of Information Technology Services. Collaboration with the Libraries is expected in some aspects of the mandate.
Director of Scientific Computing - University of Wisconsin at Madison, Madison WI USA
Under the general direction of the Chief of Biomedical Informatics (CBMI), the Director of Scientific Computing will work collaboratively with researchers, clinicians, informaticians, IT teams and external partners to lead the development of a state-of-the-art computing environment at the School of Medicine and Public Health (SMPH) at University of Wisconsin, Madison. The Director will leverage industry best practices and standards related to scientific biomedical computing for research programs. The incumbent will consider HPC services at peer institutions, and those provided by cloud companies with the intent of enabling a hybrid computing environment that allows for agility, flexibility and security for biomedical research.
Director, IT Infrastructure (Cloud and High-Performance Computing) - BioReference, Gaithersburg MD US
GeneDx is seeking a strategic and operational leader to develop and manage our growing computing and storage infrastructure and team. This leader will report to the Chief Technology Officer and be responsible for the successful planning, execution, and operation of infrastructure supporting all genomic data applications across a hybrid cloud and on-premises environment.