Research Computing Teams Link Roundup, 13 Nov 2020
Hi!
This is email #50, and a good time to think about what’s next. It helps that events of this week - the US election, Pfizer’s vaccine - make thinking about the future much less discouraging.
I wrote last week that post-pandemic challenges that research computing teams will be facing include:
- Having really high-functioning teams, ready to work with new distributed coworkers
- Having a very clear focus to communicate to administration and funders
- Having a very clear value to communicate to researchers and other stakeholders
The current link-roundup approach has supported readers much more in 1 than 2 or 3; there’s a lot more material out there every week on managing technical teams than there are on (say) grant writing for research computing, or research product management.
In 2021 I’d like this newsletter to be much less about “links of the week” and more about building up a library of material that help research computing teams face the challenges we take on in supporting research in an uncertain environment.
To support that, I’m putting together a much improved webpage which will have a library of resources, and make it easier to search for past material. I intend that to be up and running early in the new year. I’d also like to change the format a bit, and have shorter more focussed newsletter issues.
I’ll send around a survey before the end of the year, but if you have ideas of what you’d like to see, how you’d like things structured - should everything be in one newsletter, or should you be able to pick-and-choose topics? Or even new formats would you like to see - would something like a community chat/zoom call or interviews or something else be useful to you? - please let me know - email me at jonathan@researchcomputingteams.org, or respond to this email.
Managing Teams
We Learn Faster When We Aren’t Told What Choices to Make - Michele Solis, Scientific American
A key responsibility of ours as managers is to make sure our team members are growing in their skills and abilities. Solis’ article summarizes recent work with simple game-like settings where participants learn, for instance, which symbol in the game is worth more. The authors demonstrate that the participants learn faster when its their choices that drive the learning rather than when they’re guided through the choices which show them the same things.
It’s important to delegate new tasks to team members, and while they’re learning - climbing up the responsibility ladder, or gaining experience managing increasingly complex projects - it’s our responsibility to provide guidance to team members and not leave them adrift. But we do that best by providing guardrails, not by controlling the steering wheel.
End Micromanagement: 6 Signs You’re a Micromanager (And What to Do Instead) - Dara Fontein
The Most Important Management Concept You’re Missing: Task Relevant Maturity - Lighthouse
Relatedly, one of the big questions new managers have is how much managing is too much - you don’t want to micromange. In research computing the default is to come down on the side of way, way too little managing.
These two articles read together I think help clarify things. Fontein’s article outlines common micromanaging signs, and I can almost guarantee that if you’re in research computing you don’t tick any of those boxes (except maybe “you hate delegating tasks”, but that’s a skill that needs some practice).
But when working with new people, especially junior staff, you do have to keep an eye on them as they progress. So how does that work?
The Lighthouse article on task relevant maturity (and other resources that have come up before - An Engineering Team where Everyone is a Leader and The Responsibility Ladder ) help clarify thinking around this. The idea isn’t to drop someone new into some big project and let them sink or swim; that’s not good management. But doing the work for them or rewriting their code isn’t good management either.
The idea is to give clear goals about what is to be achieved along with measures of success, and to let them run with it. Initially you follow up relatively closely - assessing the what’s not the how’s - and as they mature and demonstrate “task-relevant maturity” the training wheels slowly come off and they get more and more latitude and less oversight.
Product Management and Working with Research Communities
A web-native approach to open source scientific publishing – Nokome Bentley, OpenSource.com
The first of eLife’s Executable Research Articles are now online, allowing published, interactive Markdown and Jupyter notebooks in papers.
Readers of an ERA publication can then inspect its code, modify it, and re-execute it directly in their browser, allowing them to better understand how a figure is generated. They can change a plot from one format to another, alter the data range of a specific analysis, and more.
You can see two here:
- Replication Study: Transcriptional amplification in tumor cells with elevated c-Myc - Lewis et al.
- Inter- and intra-animal variation in the integrative properties of stellate cells in the medial entorhinal cortex - Pastoll et al.
These papers are powered by Stencila which appears to be open-source and pay-for-hosting.
Unequal effects of the COVID-19 pandemic on scientists - Kyle R. Myers et al., Nature Human Behaviour
A massive survey of 4,535 faculty/PIs and what’s happened to their allocation of work time during the pandemic.
Some headline results:
- Fields are affected differently - equipment-sensitive or field-work intensive fields have really taken a hit
- Researchers with small children have taken a big hit on their research time
- Women have taken an especially big hit on their research time, across fields and types of research
- Research time is down by a median of ~20% across the board, while grant writing, teaching, and other tasks are more field dependent
In research computing, we have the opportunity to step up and move from service providers to collaborators to shoulder some of this load that’s fallen on the community we serve (where it’s possible for us, of course - many of us have had our home/family responsibilities increase, too).
Bryn Roberts: Reflections on Pharma’s Scientific Computing Journey - Stan Gloss and Loralyn Mears, BioIT World
It’s always interesting to see why communities adopt technologies; anything that catches on quickly is because it addresses a perceived need, and solutions that address those needs go from a curiosity to being suddenly in high demand.
If you follow the job board at all you know that there are tonne of research data jobs of all sort in pharmaceuticals - governance to data platforms to software development. The article by Gloss and Mears interviewing Roberts describes why - data was very siloed in pharmaceutical companies until very recently, due to inertia and lack of incentive to change. As easy to use data analysis and platforms became available which could make use of the integration of these different data types, that siloing becomes untenable. Now there’s even (limited) data sharing across companies and a genuine interest in FAIR data standards and common data models and APIs.
Research Software Development
Ten simple rules for writing a paper about scientific software - Joseph D. Romano, Jason H. Moore, PLOS Computational Biology
This may be useful for members of your team or researchers your team works with. Much will be familiar, but it’s worth highlighting a few suggested rules:
- Publish for users, not developers - or, more generally, know your audience and what you’re trying to accomplish with a paper or any other communication
- Create a long-term software management plan
- Maintain consistency between code, documentation, websites, and papers
- Plan for follow-up publications and update the software accordingly
- Prioritize visibility and availability - again, know what you’re trying to accomplish and who you’re trying to reach.
Rust vs Go - John Arundel
Both these programming languages are getting interest in different parts of research computing. The comparison here is fair; Rust as a C++ replacement, essentially a systems programming language, and Go as a more general development language - more like a Java use case, but in a language a tiny fraction of the size.
For a lot of low-level methods development Rust is probably an increasingly decent choice (unless everything’s rectangular arrays, in which case Fortran is still hard to beat). Rusts’s parallel computing tools are still pretty primitive compared to C++ or Fortran, though rayon looks neat. I actually think it’s a failure of our community that we require tools like aligners and fluid software to be written in a systems programming language, but that’s where we are now. For more general software development, I’ll be curious to see how Go’s new generics handling works out.
How to Recalculate a Spreadsheet - Lord i/o
Reactive programming frameworks, as they mature and their performance improves, are something that could be useful for research computing. Here’s a walkthrough of how they work (using the real history of spreadsheet internals as an introduction) as a way to introduce the Rust Salsa package.
An open-source, high-performance simulator for large-scale geological carbon dioxide storage - Press Release
Extreme-Scale Scientific Software Stack (E4S) version 1.2 - E4S Collaboration
Engineering strategy every org should write - Will Larson
These are all different stories, but they’re connected a theme - the power of strategically adopting a software stack, how that makes a development team more capable of creating new applications rapidly, and how important it is to have a strategy for adding and deprecating things from that stack.
The press release is about a pretty neat multi-physics, portable simulation software for simulating geological storage of CO2 - GEOSX, which was put together by a collaboration of LLNL, Total, and Stanford.
GEOSX came together as quickly as it did because of LLNL’s RADIUSS Project - Rapid Application Development via an Institutional Universal Software Stack - which is a toolbox of components for assembling multiphysics simulation software. RADIUSS overlaps quite a bit with the Extrem-Scale Scientific Software Stack (E4S) which just had a new release.
Larson’s article is a list of strategies every development or computing team should have - from the mundane (how do we review, merge, deploy and release code) to hiring but also “What are our approved technologies for new projects” and “when and how do we deprecate internal tools”. Because it’s not just enough to have a preferred stack, there has to be a strategy behind it, and that will inform when new tools enter that stack and others leave.
Larson ends his list with a pretty thought-provoking paragraph:
For many of these, you could argue that they are more policy than strategy. That’s a reasonable argument, but for me the distinction is that you turn a policy into a strategy by explaining the rationale behind it: a policy is just an incomplete strategy.
Research Computing Systems
EPCC passes ISO 27001 Information Security and ISO 9001 Quality audits - Anne Whiting, EPCC
EPCC has passed its ISO 27001 (infosec) and ISO 9001 (quality) audits, which is a remarkable achievement for a research computing centre. It’s going to be increasingly important for centres to carve out their niches and be able to explain why they are the best to handle particular research computing needs; EPCC has clearly made the decision that part of its portfolio is sensitive data with high availability requirements and pursued certifications to demonstrate that it is an strong partner for those needs. Really commendable.
Emerging Data & Infrastructure Tools
A Complete Guide for Running SpecFEM Scientific HPC Workload on Red Hat OpenShift - Kevin Pouget, RedHat OpenShift
Demonstrating Performance Capabilities of Red Hat OpenShift for Running Scientific HPC Workloads - David Gray and Kevin Pouget, RedHat OpenShift
Following up on the article in #49, these two blog posts describe the work of RedHat’s Performance and Latency Sensitive Applications (PSAP) in running Gromacs and SpecFEM3D Globe in OpenShift, RedHat’s Kubernetes distribution.
It’s pretty clear running HPC workloads on k8s isn’t quite ready for mainstream yet - but I think it’s closer than some would think. The tipping point, when we start seeing more adoption, won’t be “when things are exactly as easy and performant on k8s as they are on HPC systems”. Instead, I think there will be two steps:
- When cloud-style data-intensive workloads are common enough and enough of a pain to run on HPC systems, there’ll be an uptick of teams running both HPC and OpenStack-style systems. I think we’re clearly in the middle of this transition.
- When running HPC workloads on the OpenStack systems is less of a pain (along all dimensions) than running two systems, it’ll start being more common to run the workflows on the OpenStack systems. We’re not there yet, but work by teams like the PSAP group at RedHat are getting us closer.
The first blog post covers SpecFEM, which requires a mesher job followed by a solver job. A golang client was written to fire up the job suite, aided by a Kubernetes datatype API for defining the job type; then the images had to be built. To run the jobs, as described before, Google’s MPI operator was used, and the team had to kludge a way to send the output of the mesher job to the input of the solver job. (It really seems to me like shared filesystems across jobs, like shared networks between jobs, are at the heart of the differences between cloud-native and HPC-style clusters).
The second blog post covers the performance results, for microbenchmarks as well as SpecFEM3D_Globe and Gromacs (the post described last week). Software defined networking definitely affects latency and bandwidth microbenchmarks, which is the reason for the isolation breakout described before. With that in place, scaling on Gromacs up to 32 nodes looks pretty decent - maybe 10% worse performance than bare metal, but performance also seems a bit more consistent. SpecFEM seems to fare better. Strong reaches about a 10-11% performance drop as you go up to 32 nodes, but weak scaling the problems the difference is more modest - 3%.
Ten simple rules for writing Dockerfiles for reproducible data science - Nüst et al, PLOS Computational Biology
This may be a good link to send to researchers and trainees looking to publish docker images for the first time, or even for those quite familiar with docker but who have been using it for a long time but who haven’t changed their workflows. The rules are:
- Use available tools - reuse existing dockerfiles for instance, or if tools like containers, docket, or repo2docker for Jupyter notebooks meet your needs, try using those.
- Build on existing images - a lot of relevant base images can help you avoid building from scratch (and will track multiple dependencies for you).
- Format for clarity - All code, even dockerfiles, are read more often then they’re written
- Document within the dockerfile - Ditto
- Specify versions - Reproducibilty is greatly aided by version pinning
- Use version control
- Mount datasets at runtime - essential for reuse
- Make it one-click runnable - Take advantage of container-running tools in Rstudio, Jupyter, or VSCode
- Order the instructions - Speed the building of images by making it easier to cache earlier steps
- Regularly use, rebuild the containers - Ensure it still works, and keep it maintained.
Events: Conferences, Training
FAIR 4 Research Software - Nov 17 7:00-10:00 UTC, Nov 19 16:00-19:00 UTC, Lamprecht et al, SORSE event
A workshop on applying FAIR research principles to research software.
Plato Elevate Winter Summit - 11 Dec, Free
Plato, which targets technical leaders, advertises this as “A half-day summit to make you a better engineering leader!”. Sessions on organizing your team, hiring and onboarding remotely, and improving alignment and communication seem very on-topic for us. Talks from the previous summit are available.
Random
If you’ve wondered about self-hosting MOOC-style teaching/learning platforms like edX, tutor is a easy-to-deploy docker distribution.
There’s growing support in the Linux maintainer community for replacing the protocol and behaviour of scp in favour of a tool with most of the same options but using the sftp protocol. This LWN article explains why.
DBDiagram.io - a slick-looking freemium entity-relationship diagram drawer for relational db schemas.
NoisePage - an really interesting project out of CMU for a postgres-compatible self-driving database system with an Arrow-based columnar storage.
That’s it…
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
Jobs Leading Research Computing Teams
Highlights below; full listing available on the job board.
Lead HPC Developer - University of Southern California, Waltham MA USA
USC’s Information Sciences Institute (ISI) is seeking a Lead HPC Developer interested in helping us develop a shared compute cluster to support our language understanding research. A successful candidate will:
Collaborate with technical leadership in the design, development, installation, and maintenance of software for Linux and HPC cluster systems and ensure its scalability and fault-tolerance needs are met. Take primary responsibility for planning, implementation, availability, performance, security, maintenance, and repair of cluster infrastructure. Support best practices across the environment
Senior Project Manager, MOONS - Science and Technology Facilities Counci, Edinburgh UK
The UK ATC Project Management Group now seeks a Senior Project Manager to manage the UK national contributions to the UK-led MOONS (Multi-Object Optical and Near-infrared Spectrograph) project, which is being developed in Edinburgh, working closely with UK and international partners. The MOONS instrument is destined for the Very Large Telescope in Chile, where it will measure millions of objects over the coming decade. This is an exciting opportunity to join a multinational team that will bring MOONS successfully into operation.
Technical Leader of Scientific Computing - Kitware, Clifton Park NY USA
Technical Leader at Kitware typically leads research projects potentially involving multiple external organizations. They seek out funding opportunities from government and other sponsors, form collaborative teams involving multiple organizations to pursue those opportunities, and manage the associated contracts. They are responsible for the functional management of a group of employees and perform other personnel and corporate duties in close coordination with all Kitware management and Teams.
In this role, you will have the opportunity to lead research efforts and collaborate with leading researchers in the area of experimental Material Science, with an emphasis on Artificial Intelligence, Machine Learning, and Deep Learning, while working on commercial and academic research and development programs. You will bring your technical expertise and project management skills to a team of highly skilled and published computational software developers, engineers, and researchers.
Group Manager, Automation and Research Computing - Board of Governors of the Federal Reserve System, Washington DC USA
Manages Operations: Manages the group operations of the section as part of the ARC leadership team, comprised of the ARC Chief and Group Managers. Works collaboratively with the ARC Chief and Group Managers to set goals and objectives, prioritize, and identify projects and activities for effective program execution. Is accountable and holds self, peers, and the staff responsible for identifying interdependencies, meeting deadlines, evaluating success based on predetermined criteria. Balances urgent short-term outcomes with making progress on important long-term priorities, and builds safeguards to maintain that balance. Creates an enabling environment that acknowledges and rewards initiative and innovation. Fosters a culture of experimentation, blameless retrospection, and learning from failure. Embodies a strong customer service philosophy and ensures its practice by group members. Participates in periodic division reviews of progress on major program objectives.
Director, Advanced Research Computing Center (ARCC) - University of Wyoming, Laramie WY USA
The Director will oversee the ARCC team, which is currently responsible for user, application, and systems support for UW researchers. The Director will also facilitate communication regarding research computing with other UW units, including Information Technology’s security, infrastructure, and campus data science groups. In addition to engaging with the UW community, the Director will maintain and develop relevant connections outside UW. This includes relationships and partnerships with regional and national research computing entities; colleagues at academic, government, and private institutions; service providers; vendors; industry partners across multiple market sectors, and potential donors.
Senior Advanced Research Computing Analyst - University of Victoria, Victoria BC CA
This role provides subject matter expertise to various client groups. This includes
Advanced technical support to researchers using ARC systems including national, regional and locally managed clusters, clouds and storage systems. Leading strategic ARC initiatives and implementation Leading workshops and tutorials on ARC systems, tools and technologies. Developing advanced HPC, data science and bigdata solutions for research groups across campus including collaboration on codes and scripts
Principal Computing Engineer - Medical Science & Computing, Bethesda MD USA
The Principal Computing Engineer will lead the design, administer and manage the IT infrastructure and IT services, as well as support scientific computing for biomedical research projects and has the following duties:
Design, implement and deploy NIH computing services on premise and cloud. Manage, administer and support daily operation of LHC computing systems on premise and cloud. Design, implement and maintain a scalable High-Availability (HA) and Fault-Tolerant (FT) network and computing systems. Study and provide technical options to managers and researchers for selecting effective computing solutions based on requirements.