Research Computing Teams Link Roundup, 1 May 2020
Hi!
The newsletters have been getting kind of long lately, which makes them unwieldy to read. In this issue I’ll keep things much shorter (fewer items, and fewer words per most item).
Let me know (just hit reply) if there’s other approaches you’d like - like being able to subscribe just to certain sections of the newsletters, and/or having smaller more frequent issues.
And as always, please do give feedback on anything that you like or don’t like about the newsletter. Research Computing Teams doesn’t have any analytics - no creepy single-pixel trackers or link redirection to track link clicks - so the only way I know if links were popular or uninteresting were if people email me back to let me know. So do let me know!
With that, let’s get straight to the roundup…
Managing Teams
Picking problems for programming interviews - Will Larson
If you do do coding as part of your interviews, it’s tough to find something that is relevant, hard enough to successfully distinguish between candidates, but easy enough to be doable. Here Larson plays with a few examples (one of which is a particular kind of data munging: something broadly relevant to our needs). His suggestions are to aim for problems that:
- Support simple initial solutions and compounding requirements
- Are solvable with a few dozen lines of code
- Require algorithmic thinking but not knowing a specific algorithm
- Don’t require juggling numerous variables in your head
- Does support debugging and evolving the code
- Are not domain specific to [unjustly: LJD] advantage arbitrary experience
- Require writing actual code in language of their choice
- Are done in a real code editor
14 questions to ask an underperforming employee during a one-on-one meeting - Clair Lew, Know Your Team
Good question types to get to the root of why an employee isn’t performing well on some kinds of tasks. They come in two categories:
- Looking inward, to find: “How have I been letting this person down? How have I been getting in the way?”
- Looking outward, to find: “What on the employee’s end is limiting them? What choices or capabilities of their own are keeping them from the results you want to see?”
Manager’s Playbook - Kamil Sindi
One tech manager’s distillation of a lot of resources into a short simple cheat sheet. One-on-ones, coaching, specific frequent feedback, strategy, hiring, etc.
Managing Your Own Career
Success Factors in R&D Leadership - Gritzo, Fusfeld, & Carpenter, Research-Technology Management
This is a paper from a few years ago which took a look at leadership development data from 47,000 respondents; both managers and those evaluating them, in R&D and outside of R&D, and compared the two.
They found - well, it’s hard to read it any other way than R&D managers were generally worse managers than non-R&D managers:
When the results were consolidated, R&D managers were rated more favorably than their non-R&D counterparts for only 3 of the 115 positive attributes: Quickly masters new technical knowledge necessary to do the job, is creative or innovative, and is calm and patient when other people have to miss work due to sick days.
That refers to just one set of questionnaires, but the whole thing is pretty grim. I think the way to understand these results is that there are some must-haves in R&D managers; they are things like:
- Understands higher management values, how higher management operates, and how they see things
- Creative/innovative, and fosters a climate of experimentation around them
- Offers novel ideas and perspectives
- Is prepared to seize opportunities when they arise
- Adjusts management style to changing situations
- Quickly masters new technology
and that the R&D managers that have those qualities can have some success despite the following common deficits:
- Can’t make mental transition from technical manager to general manager
- Hires people w/good technical skills but poor ability to work w/others
- Does not resolve conflict among direct reports
- Won’t take the lead on unpopular yet necessary actions
- Doesn’t communicates confidence and steadiness during difficult time
- Doesn’t regularly seeks data about customer satisfaction.
The lesson I take from this is that we all pretty much by definition have the positive “must-haves” to become research computing managers - and certainly taking a look at that list, I think that’s a fair assumption. The way we can become better managers, for our teams and for our bosses, is to work on the management fundamentals - better working relationships with our team (one-on-ones), better communication, better feedback, better decision making, better execution. Doing even modestly better than the current accepted practice will help us, our teams, and the research products we support stand out from the pack.
Product Management and Working with Research Communities
The evolution of the Software Underground - Matt Hall, Agile*
An extremely cool research computing group that I had never heard of:
There are now more than 2,160 “rocks + computers” enthusiasts in the Underground, with about 20 currently joining every week. It’s the fastest growing digital subsurface water-cooler in the world! And the only one.
One of the problems the research computing community has always had has been that relatively few people consider themselves to be part of a community as abtract as “research computing” — they tend to think of themselves as part of communities like genomics, or computational astronomy, or digital history, or…
This is a great example of how a more naturally-occurring community — even a fairly multidisciplinary one (“subsurface science” covers a lot of fields) can grow organically if it meets a need that the people in that community have.
Extreme Interest In Crowdsourced Projects Requires More Traditional Management - Science Blog
Interesting analysis by University of Michigan researchers showing that in crowdsourced projects — in our context things like citizen science efforts or open-source development — eventually need some kind of fairly traditional management approaches when the amount of interest gets large enough.
Research Software Development
Migrating large codebases to C++ Modules - Takahashi, Shadura, & Vassilev
C++ Modules in ROOT and Beyond - Vassilev, Lange, Muzzafar, Rodozov, Shadura, & Penev
C++20 is finally coming. There are five major new features - Contracts (preconditions/postconditions/assertions - which I think are potentially extremely interesting for research computing), Co-routines, Concepts, Ranges, and Modules.
Modules are probably the biggest change to the language. Ever since C, the approach that’s been taken for modularization of C/C++ code is C-preprocessor style include statements. These are hard to reason about and slow/difficult/repetitive to compile, because macros, etc defined earlier in your file - er, ‘translation unit’ - can modify the interpretation of what you’re including. You can also easily find yourself in a twisted mess of circular imports.
With modules (a good quick introduction is here), there are import and export statements setting explicit boundaries around modules and, frankly, improving clarity - it’s now much easier to understand what the interface is to a collection of code.
In the first paper, Takahashi et al. describe the partial rewrite of CERN’s ROOT to modules. For ROOT, which is (a) a huge codebase and (b) is often used in the context of a C++ interpreter/JIT system, Cling, the problem modules solve is particularly urgent. This work is completed by the second paper by Vassilev et al. They discuss both the logistics of moving a large code base to modules, and the substantial compilation/JIT performance improvements they see.
(PS, congratulations to GSOC student Arpitha Raghunandan who apparently helped with the module indexing work of the second paper).
Asymptotics of Reproducibility - Roger Peng
A reminder that reproducibility/repeatability is not an immutable property of some computational work — it decays over time, requires maintenance, and that maintenance has to be done by someone.
PostHog - a package for open source product analytics
There are research computing tools here in there in many fields that “call home” for logging and to inform further development. The research community quite often doesn’t love the idea of the tools logging their use, but absent that information it’s hard to secure funding or identify features of a tool that are or aren’t used to guide effort (as opposed to just relying on the noisiest users).
PostHog is an interesting-looking new tool for aggregating whatever callbacks you put in your code - it’s aimed for web applications but with API calls or SDKs in multiple languages, it would be possible to put this in command line tools as well.
Despite the problems with these approaches, I do hope that there is some way agreed on within research communities to get development or operations teams the data they need to make informed decisions.
Research Computing Systems
How to create an incident response playbook - Blake Thorne, Atlassian
This is a really good starting point for putting together an incident response playbook, and includes links to Atlassian’s own playbooks and a workshop on incident communication.
This is something we’re working on in our own team. We’re not there yet, but we’re getting there. On the other hand, colleagues-of-colleagues of mine were involved in a major incident recently in an organization where there were lots of security policies in place about keys and ciphers (good), security architecture like DMZs, bastion hosts etc (good), but nothing really like a playbook or even an incidence response approach, which meant critical decisions were being made as they went along. That meant response was slower, more halting, and more stressful than it needed to be for either the staff or the researchers relying on that resource.
In research computing there’s understandable resistance to the extremely process-bound approach that corporate IT requires, because change happens so quickly. All well and good. But we’re professionals, and we have professional obligations to the researchers we support.
No playbook will tell you all the answers to an unfolding incident, and playbooks will often have to be updated in light of what you learned from a past incident. But having some pre-agreed-upon clarity on how to make certain decisions at the start frees you to focus the harried decision making on the things that are actually particular to the incident you’re responding to. This is a good way of getting started.
The Flux Resource Management Framework
This is something in development for a while but that I just found out about - an emerging customizable framework for resource management and scheduling for large batch-style clusters. Development is led out of LLNL, appears to focus on being able to programmatically (via APIs not just batch scripts) support very large numbers of jobs and very heterogeneous resources (including cloud resources).
Emerging Data & Infrastructure Tools
Getting storage engines ready for fast storage devices - Sasha Federova, MongoDB
As NVMe devices become more common, they’ll be used by researchers both directly, experimenting with using NVMe to increase performance of either memory- or IOPS-intensive workloads (Do you know any interesting work? Send me a link!); and used indirectly, through file systems or tools like databases.
But the current generation of NVMe hardware has performance that is pretty complex, and people are still trying to figure out how to extract significant performance wins. It’s interesting to see fairly mainstream databases like MongoDB wrestling with how best to approach these devices. Here (motivated by some previous work) they take a surprisingly simple approach - using mmap-ed files, which makes some sense, given NVMe is something like both memory and a file system - as well as keeping as many operations in userspace as opposed to going through the kernel file system code (the faster the storage, the more relatively expensive context switches are). Similarly, batching I/O operations matter, but that’s true of almost all I/O so they bound that the underling storage engine was already quite optimized that way.
Calls for Proposals
There are going to be a lot of undergraduate students having a hard time finding summer employment in the coming weeks - NSF has announced supplemental funding for research experiences for undergrad in CISE.
Random
Nature is launching Nature Computational Science in January 2021. I think Computational Science has changed a lot since the last round of computational science journals started; I hold out some hope that this will be bigger picture than the various J. Comp. [Discipline] algorithmic journals. They’re looking for an Associate Editor, Computational Biology.
Thinking about technical decisions as choosing what problem to have can be a useful frame.
The Spotify Model of technical teams was never really a thing at scale, and to the extent that it was, it sounds like it was a bog-standard matrixed teams approach.
Springer has a list of 407 books which are free to download during the COVID-19 outbreak. Of interest: a number of programming texts (Python, C++, data science, verilog/VHDL), math (linear algebra, PDEs), stats, bioinformatics, digital humanities.
Writing good meeting notes is 100% a “deep work” task, and one you can get better at with practice.
A comprehensive list of embedded data engines (e.g., sqlite and friends).
How to declutter your digital life. I’d normally think there’s no way I’d ever get around to this, but so far in this pandemic I’ve even cleaned out and organized my “maybe I’ll need it someday” cable box, so who knows.
That’s it…
And that’s it for another week.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
Jobs Leading Research Computing Teams
(Note that these are only new jobs this week; several “Director of Research Computing” jobs are reposted for instance that don’t show up in this roundup).
Global Segment Lead - High Performance Computing - AWS, New York NY USA
The ideal candidate will possess a strong background in HPC along with strong business development, strategic alliances, and entrepreneurial skills. She/he should have a demonstrated ability to think strategically about new business models, solution selling, and technical challenges to help build and convey compelling value propositions for AWS customers with our partners.
Director, Computing and Technologies - MILA, Montreal QC CA
You will act as a strategic advisor to management on new information technologies in the Artificial Intelligence research field and will manage the existing IT team. You will also be responsible for the management and evolution of Mila’s High Performance Computing infrastructure (HPC) and all of its information systems, including the design, purchase and implementation of equipment (computing clusters), the management of IT operations, the evolution of technologies, user support and network security.
Senior HPC System administrator - Northeastern University, Boston MA USA
The successful candidate will help lead the administration, development, and expansion of both the Discovery cluster, as well as the full RC environment (VMs, storage, software, etc), within the RC team.
Vanguard Center of HPC Project Manager - University of Hawaii, Maui Meadows HI USA
Under limited supervision, provides the technical and program leadership for complex software systems development and applications projects and programs, ensuring that objectives are achieved in a cost-effective, timely manner. Leads a team of systems, software engineers, and subcontractors on development programs involving high performance computing.