Research Computing Teams Link Roundup, 25 June 2021
Hi there, everyone!
It’s been a busy week here of contacting recommended potential job candidates, building consensus among stakeholders, debugging code, and juggling finances - a normal set of activities in research computing management land, where if we’re very lucky we got some training in maybe one of those things.
I’m starting to see some out-of-office notifications in response to newsletter issues, which genuinely delights me - it’s been a long 16 months and we all deserve a break. I hope that those who haven’t managed to take significant time off in a while get the opportunity over the summer or fall.
I also had a very nice conversation this past week with one reader; we’re looking into one possibility for a community forum (which might make unnecessary a revisit of the “Ask Managers Anything” feature from last year), and toying with the idea of community video chats. I’m pretty excited about how a couple of those things might work.
But for now, on to the roundup, and the weekend:
Managing Teams
How to Re-Onboard Employees Who Started Remotely - Rebecca Zucker
Over the last 16 months or so, many of our teams have hired new people - if you are planning to go back into the office, even part-time, Zucker reminds you that you still have some onboarding tasks for those newer team members! Introduce them to people in person, show them the facilities and where everything is and how things are done, etc.
The Next Great Disruption Is Hybrid Work—Are We Ready? - Microsoft WorkLab
Most of the material here won’t be new to long-term readers, but it’s a data-driven exploration of the fact that for office workers, especially technical staff like us, flexible work - including the huge phase-space of “hybrid” options - is here to stay, and it’s going to be challenging.
One point that I don’t see addressed enough is their point five about shrinking networks; Microsoft WorkLab has taken a look at anonymized outlook emails and team meeting statistics and found that networks have shrunk - there is more communication between immediate teams and less more broadly.
I think how that plays out will depend on the team - for instance, for us (a multi-institutional collaboration) the silos between sites more or less disintegrated between our team, but there are fewer communications between each sites team and other groups within their site. For instance, our team moved institutions during the pandemic, and we have very little communication with other groups in the new institution.
This can be addressed deliberately, but it takes work.
How to manage former peers as a new manager - Claire Lew
It’s kind of goofy that one of the most awkward situations for new managers to navigate is the most common situation we put people in for their first management role in research computing - taking their old boss’ job and managing their peers.
This is not actually a difficult role - you have a huge advantage over taking over a new team both by knowing the work of the team and the team members, and being a known quantity. Coming in from outside to run a team is challenging too! But for new managers who are uncomfortable getting used to the idea of being the manager, managing their former peers feels awkward. There’s really nothing to be done about that except accept it and move forward anyway - the sooner you stop seeming awkward about things, the sooner your team members will adjust.
Lew’s specific suggestions are to not avoid one-on-ones and to take consistent action.
Managing Your Own Career
Time Management Won’t Save You - Dane Jensen, HBR
Just a reminder that time management will help you be more efficient at getting discrete tasks done, which is all well and good, but that’s far less important than being discerning about what you choose to do.
Product Management and Working with Research Communities
Experiences and lessons learned from two virtual, hands-on microbiome bioinformatics workshops - Dillon et al., PLOS Comp Bio
An overview of the (first ever) virtual QIIME-2 workshops held by this team over the pandemic. Many people on this team had been involved in multiple in-person workshops in the past, but despite requests had never run a virtual one. This gives an overview of not just one but two workshops, and how they learned from the first to improve the second.
In the first, they:
- More cleanly separated lecture from hands-on components, so the lectures could be watched asynchronously
- Pre-recorded sessions, made them available ahead of time, and streamed them live over YouTube along with live discussions and segues; this was done by an emcee using Open Broadcaster Software and Skype (which integrates with OBS more nicely than Zoom)
- Interaction was handled by (paid) slack, and in addition to the all-participants channels there were instructor-only channels and “pod” channels assigning learners into smaller subgroups
- One instructor was in charge of OBS and video streaming, and one in charge of slack question triage, with cross-training incase either became unavailable
- As usual, they had cloud instances for participants with pre-loaded data, with easily used supported methods for connecting (SSH app in Google Chrome)
In the second, they made a few changes:
- They used cheaper self-hosted Zulip, which also made it easier to bulk-invite participants; users seemed broadly happy with it
- They tried to do without the chat-question triage manager instructor, and it didn’t work as well; they’ll return to having a triage manager in the future
Making a PACT (Purpose, Attendees, Community, Tech Tools) for More Engaging Virtual Meetings and Events - Centre for Scientific Collaboration and Community Engagement
Scientific Community Profiles - Centre for Scientific Collaboration and Community Engagement
Relatedly, the CSCCE has released the fourth section of their guidebook to virtual events, and related webinars (seebelow). To my mind, the first section of the book, ”A guide to virtual events to facilitate community building: event formats” is the most valuable - it lists twelve kinds of events and suggests formats and tools for each.
CSCCE also released a collection of thirteen community profiles, to give a sense of what some other major groups are doing for community engagement - what kinds of events they run, what their community is like, the resources needed to manage the community.
Research Software Development
Quick Analysis for the SSID Format String Bug - Zhi
Way back in #11 when talking about the Ariane crash we had a paper describing how error handling or logging code was implicated in surprisingly many big failures. I mean, who ever tests their error-handling or logging code? Well that bug you probably read about where connecting to WiFi APs with SSID of ”%p%s%s%s%s%n” would disable an iPhone’s Wifi? Yeah, it was in logging code.
What Every Programmer Should Know About SSDs - Viktor Leis
A quick overview of SSDs from a software developer’s point of view:
- There’s much more parallelism available in reads and writes
- On-disk cache can hide it, but writes are 10x slower than reads - to hide that latency at volume you’ll need to use the concurrency
- Out-of-place writes and garbage collection means writes can lead to significant write amplification, increasing wear and slowing writes down further
This all means that a lot of thought that has to go into designing write patterns that perform well.
Don’t use raw loops - Thomas Lourseyre, Belay the C++
Subclassing in Python Redux - Hynek Schlawack
Two articles exhorting us to use modern language features and some thought in two languages in common use in research computing - C++ and Python.
In the first, Lourseyre echos and expands on Sean Parent’s earlier call to arms, “No raw loops”. With C++’s extensive and growing algorithm library, C-style for (i=0; i<n; i++)
approach is less explicit about what is happening, less likely to be able to take automatic advantage of parallelism, and avoids common off-by-one errors.
In the second, Schlawack goes on a deeper tour of different reasons and approaches to subclassing in Python, and along the way points out the benefits of Protocols and type checkers like Mypy.
Using pre-commit hooks makes software development life easier - Werner Dijkerman
Dijkerman walks us through setting up git pre-commit hooks and gives us some examples (linters, code style checkers, index builders, etc.) of ways that you can automate certain routine work before commits are even made to make sure that code reviews don’t waste time on these automatable details and can focus on more meaningful review.
Research Data Management and Analysis
Datastation | The Data IDE for Developers - DataStation Project
Datastation is a cute-looking in-memory data exploration environment with an open-source version that supports javascript, python, and SQL all in-browser.
Research Computing Systems
7 Lessons From 10 Outages - Tom Kleinpeter, The Downtime Podcast
In #75 we mentioned the new podcast, ”The Downtime Project”, reviewing high-quality post-mortems of notable failures. They’ve gone through 10 post-mortems in their first season, and reflect on what they’ve learned, finding 7 lessons that each touch on multiple of the failures (big failures never having merely a single cause):
- Circular dependencies will break your operational tools - for instance, don’t send your telemetry data to your production database
- Dumb automation is more robust than “smart” automation
- Don’t run huge operations on production databases
- Avoid “magic” middleware in data persistence
- Backups are useless, restores are everything, test your restores
- Roll out in stages
- Have tooling (runbooks and switches) in place to prepare for failure
Counterfactuals are not Causality - Michael Nygard
Relatedly, when you’re digging into the (likely multiple) causes of a failure, Nygard reminds us that things that didn’thappen can’t, necessarily, be the cause of something. To steal an example from the post, “The admin did not configure file purging” is not a cause. It can suggest future mitigations or useful lessons learned, as “we should ensure that file purging is configured by default”, but looking for things that didn’t happen is a way for blame to sneak in and takes our eyes off of the system that lead to the bad outcome.
CentOS replacement distro Rocky Linux’s first general release is out - Jim Salter, Ars Technica
As you know, I think long-term stable operating systems are a trap for systems teams. But it’s a trap a lot of people are stuck in, and so it may be of interest to know that the first release of Rocky Linux 8.4, a stable plug-in replacement for CentOS 8.4, is out, with migration scripts to aid in moving from CentOS 8.4 as well as others.
Your CPU May Have Slowed Down on Wednesday - Travis Downs
Intel recently updated microcode for skylake and icelake which significantly slows down zero-fill of memory pages, to mitigate yet another side-channel vulnerability.
Emerging Technologies and Practices
In the search for performance, there’s more than one way to build a network - Brendan Bouffler, AWS HPC Blog
One of the few major differentiators between the two biggest commercial cloud providers, AWS and Azure, is how they handle networking for large-scale HPC workloads. Azure has made the much easier-to-explain but harder-to-operate and integrate choice of having resources available that are connected by Infiniband - but because those nodes are special, they’re harder to come by. AWS has gone the engineering route of building a more internally consistent solution, at the cost of making it much harder to explain to consumers - building their own.
In this article, Bouffler argues the AWS side. What matters in real networks even on HPC clusters, he claims, is less median latency than tail latency, and by guaranteeing QoS and using datagrams rather than ordered packets, scalable reliable datagrams (SRD) over the elastic fabric adaptors can provide robust stained performance while still giving access to datacenter-scale compute resources.
I think it’s fair to say that scaling over Infiniband still beats EFA + SRD for a lot of real-world codes, but I also don’t think that it’s a given that things will stay that way.
Events: Conferences, Training
Juliacon 2021 - 28-30 July with workshops in preceding weeks, free but registration required
Julia, a programming language of growing interest in some research software development communities, is holding their annual conference at the end of July, with both talks and a number of workshops.
Scientific Community Engagement Fundamentals - 23 Sept - 4 Nov, two sessions a week, $750
The Center for Scientific Collaboration and Community Engagement - we mentioned a couple of the resources earlier in the newsletter - is running a 7-week course on managing scientific communities:
- Scientific Community Engagement Fundamentals is designed to offer new or existing community managers core frameworks and vocabulary to describe their community’s purpose, refine or create strategic engagement programming to match community member goals, and describe their own roles and value as a community manager in STEM. While the content is designed for any level of learner, it should not be thought of as a “beginner” course. Rather, it is intended to create common ground so that scientific community managers can converse across disciplines, more efficiently learn from one another, and build successful engagement strategies that are grounded in research.
Also relevant are individual webinars:
Making a PACT for engaging virtual meetings and events - 20 July, $150
Event Planning: Selecting tools to supplement your online meetings and events - 3 Aug, $150
Event facilitation: Making decisions during your virtual meetings - 17 Aug, $150
Event facilitation: Supporting information exchange before, during and after virtual meetings and events - 31 Aug, $150
Hybrid events: Making hybrid events the best of both worlds - 14 Sept, $150
Random
There are enough static analysis tools out there for code that early adopters are starting to switch to second-generation tools. Here’s semgrep’s take on GitLab’s switch from Bandit to Semgrep for python code.
It turns out that “for historical reasons”, you can name bash functions almost anything, including emoji, and have long been able to.
I’m a big fan of “learn how X works by building one” kinds of tutorials - here is a 10-part series of building a debugger, with the full source code on GitHub.
Cloning a PC into a virtualbox image and running it as a VM.
NCBI has released some training material on “Getting Started with Python and Cloud Computing” for bioinformatics.
Graphana dashboards for postgres metrics.
NVIDIA has a performance tuning and resource selection tool for choosing GPUs to run a multi-GPU workload on in a multi-node environment - NVTAGS.
This is the first I’ve heard about the Modelica language, a DSL for simulations based on … ODEs, maybe it looks like? Maybe more? … that seems to mainly be used in the electrical engineering community. Anyone have any experience with it?
An introduction to locality sensitive hashing for approximate neighbour searches, which is not only a nice description but a lovely example of clear visualizations for explaining.
Pocketlang is a tiny, python/ruby flavoured embedded language for including in tools (think e.g. Lua), with concurrency built in.
Canonical has announced that they have Ubuntu 21.04 working out of the box on some RISC-V development boards.
That’s it…
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, if you’re on the West coast of North America try to stay cool, and good luck in the coming week with your research computing team,
Jonathan
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
Jobs Leading Research Computing Teams
This week’s new-listing highlights are below; the full listing of 180 jobs is, as ever, available on the job board.
Director, Research & High Performance Computing Support (RHPCS) - McMaster University, Hamilton ON CA
Reporting to the Vice President, Research, the Director will be engaged in the support and direction of research computing services at McMaster in their capacity as Director. RHPCS partners with McMaster’s AVP and CTO office, University Technology Services (UTS), Computer Services Unit (CSU) in the Faculty of Health Science, the Privacy Office and all other Faculty and campus IT personnel to provide services for all researchers at McMaster. Therefore, it is essential that the Director have familiarity with networking technologies, virtualization technologies, and high-performance computing.
Lead Architect, Research Infrastructure - McMaster University, Hamilton ON CA
The Canadian Longitudinal Study on Aging (CLSA) is currently recruiting for a Lead Architect (Research Infrastructure) to manage, oversee and operationalize the replacement of existing, and implementation of new, research infrastructure. The CLSA is a large, national study that will follow 50,000 Canadians between the ages of 45 and 85 for a period of at least 20 years. The CLSA is one of the most comprehensive research platforms of its kind, not only in Canada but also around the world. The CLSA is a project funded by the Canadian Institutes for Health Research (CIHR) and Innovation Canada. The CLSA has recently received funding from Innovation Canada for the renewal, ongoing development and maintenance of the CLSA research infrastructure (including information technology (IT) systems for data collection and dissemination, medical equipment and physical infrastructure) across Canada.
Lead Software Engineer - OpenAQ (Open Air Quality), remote USA
We are looking for a talented, full-stack senior software engineer to serve as the lead engineer as part of our small, but growing 100% remote (US-based) team. You are passionate about open-source mission-driven technology organizations. Your role will include leading the technical growth of the OpenAQ platform through a combination of strategic technical leadership and hands on technical know-how, but also could include developing a community of open source contributors, speaking with NASA scientists, helping the Kenyan government build an open data API, helping Mongolian air quality activists access data, reviewing code from an open source contributor in Berlin, building a tool with the EPA, and more.
Associate Director/Senior Manager, Biostatistics - Precision for Medicine, Remote CA
The Senior Manager or Associate Director is responsible for leading the biostatistics efforts for specific projects and studies, including reviewing statistical sections of protocols, writing statistical analysis plans, and developing SAS programs. In addition, responsible for managing resources within the Biostatistics and Statistical Programming department. As part of the Biostatistics team, this person will also provide technical expertise to the development of programming standards and procedures.
Manager Digital Research Infrastructure Solutions - Concordia University, Montreal QC CA
Reporting to the Director, User Services, the incumbent is responsible for the management of IT support, consultation, and integration of technical environments within the research space as well as for digital research infrastructure (DRI) technical environments pertaining to advanced research computing (ARC), data management (DM) and research software (RS). They are also responsible for the overall customer satisfaction responsibility for research IT services.
Senior Manager Data Analytics - Secureworks - Dell, Remote USA
We are looking for a detail-oriented, talented, and enthusiastic engineering manager to work in a fast-paced, startup-like environment with a seasoned cross-functional team of Security Experts, Data Scientists, Data Engineers, and Machine Learning Engineers to advance the state-of-the-art in computer and network security. If you love the challenges that come with big data then this role is for you. Duties include leading a team in large-scale data routing, modeling, extraction, transformation, loading, warehousing, and composing such systems together with support for monitoring and mediation logic. You will use the latest big data platforms and technologies (e.g. Spark, Kafka, NoSQL, Docker, Kubernetes, AWS/GCP) to help build and federate algorithms and analytics as part of the next generation Secureworks platform.
Data & AI Project Manager - AstraZeneca, Cambridge UK
At AstraZeneca we are treating Scientific Computing as a strategic asset underpinning our advances in science. Leading-edge research strategies critically depend on best in class Computing capabilities. We are looking for a highly motivated, ambitious and independently working Scientific Computing Platform aligned Project Manager to join our global team. The Scientific Computing Platform (SCP) is AstraZeneca’s state-of-the-art computing environment to pursue todays and tomorrows in-silico challenges. It strongly focuses on a platform concept, building capabilities and services around central building blocks. At its heart it uses 3 compute environments, a classical InfiniBand/Slurm HPC cluster, an OpenStack private cloud as well as various public cloud for elasticity and scale. To exploit these resource pools most optimally, the SCP is deploying strong DevOps tooling and cloud native technologies. It seeks to adjust and adapt according to changing requirements and follow the science.
Director, Research Facilitation and Services - New Jersey Institute of Technology, Newark NJ USA
Collaborate with members of NJIT’s research community and support the Associate CIO - Research Computing to ensure proactive planning for infrastructure, software, services, and resources to enable NJIT’s strategic research growth. Lead the development and ongoing management of the research computing and data storage environments, with responsibility for developing the architecture and implementation of the university’s advanced research computing platforms and managing the work of a technical team. Lead the development of research technology capital planning, incorporating sustainable funding models, and grant funding sources. Lead the discussion and planning for policies related to research data and compute in collaboration with the Associate CIO of Research Computing and the CIO, consulting with the university’s research community, General Counsel, and Risk Management so that NJIT researchers can meet their computing needs. Manage ongoing service requests from NJIT’s research community to ensure that the services provided by the technical team are viable, scalable, and proactive.
Director, Scientific Computing - Johnson & Johnson, Springhouse PA or Raritan or Titusville NJ USA or Leiden NL or Beerse BE or High Wycombe UK
Janssen Research & Development, L.L.C., a Johnson & Johnson company, is recruiting a Director, Scientific Computing, for the Statistics & Decision Sciences organization. Travel up to 10% both domestic and international is required. Primary responsibilities of the position includes identifying, establishing collaboration with, and supervision of external partners and their services at multiple locations. Strategic and technical leadership is required both internally and externally. This includes collaborations with statisticians, researchers, and information technology professionals. There is a large diversity of needs to serve, such as end-to-end management of software applications for statistical evaluation or for business processes, education in-classroom and e-learning, knowledge sharing, user interface navigation, software/application acquisition and training, and high performance computing for intensive data evaluation, simulations, and statistical research.
AnVIL Project Manager - Johns Hopikins University, Baltimore MD USA
The Analysis, Visualization, & Informatics Lab-space (AnVIL) is a secure cloud based software ecosystem for genomic data analysis ( http://anvilproject.org ). The AnVIL Team is distributed at sites throughout the country, with the Leadership managed out of JHU under program director Dr. Michael Schatz and supported by several other faculty and researchers at JHU and at collaborating institutions. The scalability and access of AnVIL will be utilized in teaching environments to train the next generation of scientists to have an understanding of accessible, reproducible, and transparent genomic data science techniques. The AnVIL Project Manager will provide technical expertise and oversight for the AnVIL Project, will work closely with the AnVIL Project Director and will provide direction and communication on technical development and project progress to federal funding agencies and research laboratories.