Research Computing Teams - Stop Doing Things Challenge #2, and Link Roundup, 15 Jan 2021
Hi, all:
Most of us are now well and truly back into the swing of things; I hope you and your team are doing well.
Last time we talked about, as a manager, focusing on tasks with leverage (LeadDev also had a good article on leverage this week). Our main jobs are to support our team members and ensure the team works as effectively as possible; we're principally multipliers, not makers and our tasks should as much as possible reflect that.
What about what the team as a whole is doing? What activities, if any, should the team stop doing? The steps are are straightforward enough, but doing them properly takes some work:
- Look at where you want the team to be in a year - opportunities you want to take advantage of, threats you want to avoid.
- Look at what the team is doing that is most valuable to your researchers and other stakeholders (funding, etc.)
- Start dropping activities that are less valuable and aren't actively getting the team where you want it to go.
The first, identifying opportunities and threats, is probably something you already have a pretty good idea about. Maybe you're trying to get your systems team away from being merely a group that tends after colocated individual systems and into running a larger shared system; maybe you're trying to move your software development team towards a specialty you see as growing like data science/ML or health/private data. Maybe there's a funding stream whose future is uncertain and you want to aim towards performing work that's more clearly under a different funding stream. Your boss will also have some ideas about directions to move or more away from, and this should be a conversation you have with them (at least one).
The second, finding out what the team is doing that's most valuable to your user community, is something you have to go out and talk to your user community about. If you haven't explicitly asked one of your researchers or other stakeholders the question "what are the things we do that are most important and valuable to you" in the last six months or year, you definitely don't know the answer to this question. I've never seen anyone go through this process with their community without being repeatedly surprised.
It's generally pretty easy to arrange say quarterly one-on-one meetings with key research users or stakeholders under the (true!) pretence of "status updates"; that gives you the opportunity to tell them what you're doing that's relevant to them, but then - more importantly - to ask them questions and advice about what their changing priorities are. For groups you work more closely with you can even do monthly.
If someone won't take - or have someone else take for them! - a quarterly half-hour meeting with you where you're telling them what you're working on for them, that's a pretty clear signal that the work you're doing for them is not highly valued, and might not be worth the effort the team is putting in. These are almost certainly activities that can be phased out.
These first two steps are a pretty decent bit of work. They involve talking with a bunch of people - not as a one off, but continuously - and distilling the likely contradictory information down into an understanding of what your most valuable activities are - valuable current activities as seen by your community, or activities that are valuable for the opportunities they will unlock.
But it's high-leverage work; it's making sure your teams activities are aligned with needs and activities. The steps above are basically a mini strategic planning process for your team for the next year or so. For longer times and larger organizations it gets bigger and more comprehensive, but the basic approach is recognizable. It's also a pretty good approach to your first 90 days of a new managerial job, for pretty much the same reasons.
So doing it properly will take some time - setting up and preparing for these meetings; but it's really important if you want to make sure that what you and your team are putting all that effort into matters as much as possible.
The third thing is by far the hardest. Once you've collected all the data (and yes, as our digital humanities colleagues remind us, non-quantitative information is still data), and distilled it down, your team has to stop doing activities that are low value and not "strategic" (getting your team where you want it to go). The steps here are clear and simple, but a little intimidating, because not everyone is going to be happy:
- Communicate with your manager to make sure they're happy with the direction, if there's significant changes
- Communicate with your team - the why as much as the what
- Start communicating with your community about what your team will be focusing on
- Start saying no: phase out existing projects
- Start preemptively saying no: where possible intervene early in the design stage of new projects to steer them in directions that make sense for your team, or talk about why you won't be taking this project on.
This isn't easy, especially this last part; but without frequently saying no to low value tasks that don't get your team where it needs to go, you won't have time to say yes to the tasks that really matter, for research and for your team.
Let me know if you have questions - this stuff isn't hard, but can be intimidating - or feel free to share with the community some success stories about things that worked for you and your team. Just hit reply, or email me at jonathan@researchcomputingteams.org. And now, on to the roundup!
Managing Teams
In Bad Times, Decentralised Firms Outperform Their Rivals - Philippe Aghion and Isabelle Laporte, INSEAD Knowledge Hold People Accountable (But Keep It Safe) - Dexter Sy, Tech Management Life
Having decision-making centralized can work really well for routine operations - which is almost never the research world - but falls apart in times of rapid change, as the article by Aghion and Lahorte points out, summarizing a paper led by one of the authors including data on companies from ten countries.
Things are always changing rapidly in research - especially in research computing - and we need that same sort of distributed decision making. Our goal is to have the right people on our team, those doing the work hands-on, making as many as possible of the decisions, with us only a "safety check", making sure those decisions line up with larger-picture priorities and goals (and if not, sending them back to make a new decision, not overriding the decision with one of your own).
The article by Sy talks about how to do that, how to delegate responsibility to team members who have enough task-relevant maturity. This lets them be accountable to you for their decisions and mistakes they make, by creating the environment where they can make mistakes without it being a huge problem, and helps the team members grow in responsibility, confidence, and skills.
Maximizing Developer Effectiveness - Tim Cochran
This is aimed at software developers, but much of it would apply just as easily to those running systems or curating research data. Team members are effective if they're quickly and frequently getting feedback - did this change work, does this solution meet the requestor's needs - and not waiting for things or having their day chopped up into little pieces.
That means as managers it's important to make sure we have the tooling and processes in place to get team members their results quickly, have things ready to show users/stakeholders quickly, and be respectful of their time. Because the pace of research is generally relatively slow and fitful, we tend in research computing not to make this a focus - but team members can only move at the speed of their tools and processes.
Recommended Engineering Management Books - Caitie McCaffrey
McCaffrey has gone from an individual contributor to a director running teams totalling 20 people over the past 3.5 years, and recommends these books, some of which have come up in the newsletter an some which haven't. McCaffrey covers the books in much more detail, but an overview is:
- The Manager's Path, Camille Fournier - Focussing on the transitions between diferent roles and the mindset changed required
- Thanks for the Feedback - Douglas Stone and Sheila Heen - Understanding how (and why) to receive, and implicitly to give, feedback well
- The Hard Thing About Hard Things, Ben Horowitz - covers a lot, McCaffrey felt the importance of training your team members really came through
- Accelerate, Nicole Forsgren, Jez Humble, Gene Kim - A classic DevOps but also high-performance computing team book
- Dare To Lead, Brene Brown - The importance of daring (and vulnerable) leadership
- Switch, Chip and Dan Heath - Understanding how to enact change by better understanding people
- Atomic Habits, James Clear - How to enact change in yourself by building good habits and breaking bad ones.
Product Management and Working with Research Communities
Ten simple rules for creating a brand-new virtual academic meeting (even amid a pandemic) - Scott Rich, Andreea O. Diaconescu, John D. Griffiths and Milad Lankarany, PLOS Comp. Bio
The pandemic has meant that virtual conferences are now accepted as being meaningful ways to disseminate work and gather communities. That acceptance opens up enormous opportunities to arrange workshops and conferences that would otherwise be too niche to have people travel nationally or internationally for. It also allows workshops to be put together start-to-finish on a much faster timescale than we're used to.
This paper by Rich et. al highlights some of the key points, many of which come down to finding ways of taking advantages of the flexibility that virtual events enable Two key points worth highlighting are pretty broadly important:
Ensure your meeting addresses a need unmet by current conferences and fills this gap in an engaging and creative fashion
Craft a coherent, themed conference itinerary to make the content accessible to as broad an audience as possible
Presenting virtually? Here’s a checklist to make it great - Tamsen Webster, The Red Thread
The downside of lacking in-person social cues of an "on-prem" conference is that virtual events, especially long presentations, can be easy to tune out. So presentations should be shorter. But there are other tricks too!
Here well known speaker and speaking coach Webster gives very concrete steps to take to make your virtual presentation as engaging as possible, whether it's live or prerecorded. Many of the visual tips we've all learned by now (although the tip to stand was new to me), but the tips about making the talk sound engaging, and doing a better job of decoupling visuals from the words in the script, are things most of us in research could do better at, and are especially important when you're "presenting" in one small window on someone's screen.
Research Software Development
Two Kinds of Code Review - Aleksey Kladov
This is another good article of a number we've seen here on the topic of code review as asynchronous pair programming, a way of sharing knowledge both ways - about the code itself but also about expectations and goals of the team. From the article:
- "One goal of a review process is good code."
- "Another goal of a review is good coders."
Managing technical quality in a codebase - Will Larson
This article is about the steps in improving code quality over time from an initial messy code base; the idea is marching up a ladder, solving increasingly high-level issues.
This is particularly relevant for research software development. Successful research software marches up a technical readiness/maturity ladder from proof of concept to prototype to community use to production research infrastructure. As code marches up that ladder, the tradeoffs change, and the needs for code quality change with them.
The rungs on the code quality ladder for managers, in Larson's estimation, are:
- Hot Spots - Get the bits that are causing immediate problems fixed
- Best Practices - Update team practices and tools to bring up to best practices, so there are fewer hot spots
- Leverage Points - Clarify interfaces, data models, and other leverage points within the project to clarify overall design and make the code cleaner - this could be but doesn't necessarily mean refactoring
- Technical Vectors - Improve training and strategy to make sure the whole team is aligned on what and why they're building
- Measuring technical quality - Start proactively measuring code quality (by whatever metrics are important to your team and project) before it becomes a problem
But it works on my machine - Anton Sergeyev
A nice overview - with references and examples - on the sorts of things that can go wrong with problems that aren't reproducible between systems, what you can do to diagnose the problems, and what you can do to avoid the problems.
Sergeyev breaks the issues into 6 major categories:
- Human Errors
- Unexpected environment
- Unexpected infrastructure
- Unexpected users
- Unexpected state
- Application memory
- Databases
- Cache
- Poor performance
Research Computing Systems
Analyzing HPC Support Tickets: Experience and Recommendations - Alexandra DeLucia, Elisabeth Moore
This is the first paper I've ever seen trying to analyze HPC support tickets. Given that supporting research is what we do for a living, and interacting with researchers and research staff trying to use our systems are such a key part of that effort, I'm surprised that I haven't seen efforts like this before.
The authors looked at a set of a bit under 70,000 tickets at LANL (scraped using a script from their ticketing system), looked at some patterns, and ran some NLP analysis on the text to both try to categorize tickets and see if they could predict responses.
They found some interesting things:
- Their ability to analyze the data was greatly hampered by lack of metadata - there's no summary on closing the tickets, so it's hard to even see what the underlying question was and what the solution was
- Only one category is allowed per ticket, which often mislabels tickets because tricky problems often involve interaction between two or more types of things
- The support staff's ability to look at other information about the user, or the system, was essentially nonexistant
- The support staff's ability to search past tickets by content was extremely limited
- The number of templates the support staff could use for answers to common questions were very limited
- The researchers generated (using LDA) a list of ticket clusters, which support staff then named - I'd be very interested to see the full list, only some examples are given
This is a very useful piece fo work for understanding how we handle system support, and frankly it's not very encouraging.
I'm always surprised about how shoddy the user-facing support tools research computing teams use are. If we cared about supporting users well we'd use existing really good commercial tools with integrated client relationship management service (who is this person, how have we interacted with that group in the past, what questions do they typically have), make it easy to extract data so we can see how we were doing, and spend time writing good scripts, templates, and summaries for tickets on closing. That would help both support staff - by developing a knowledge base which could forestall questions or make them easier to answer - and the users, who would receive better service. Instead we mostly use crappy open source warmed-over perl scripts to do the bare minimum of ticket tracking, mainly so we don't get yelled at if an email falls between the cracks.
SLO — From Nothing to… Production - Ioannis Georgoulas
We've talked about Service Level Indicators/Objectives/Agreements (SLI/SLO/SLA) in the past as ways to focus operations effort in ways that are visible to users. Service Level here often means "availability" under some specific measure (the indicator) but it could just as easily be a wait time (jobs in the queue, emails awaiting responses, waiting list for training), disk space, or almost anything else (time until a new user successfully runs a nontrivial job?). The indicators are the measures you define; the objectives are internal targets for what those indicators should show; and agreements are external-facing agreements with your users what those numbers should be.
Other links in the roundups have focussed on the basics; Georgoulas' article focusses on developing and advocating for SLOs in your organization, with a number of useful resources linked including some slides you could use was a starting point. This could be used by a manager or an individual contributor to advocate for adoption of internal SLOs in a team.
Emerging Data & Infrastructure Tools
RISC-V Vector Instructions vs ARM and x86 SIMD - Erik Endheim
This is a nice long-read introduction to SIMD instructions like x86 SSE or AVX, as opposed to vector processing as in old supercomputing systems like Crays, or (with some modifications) in GPUs.
Fixed-width SIMD instructions can have the advantage of being specialized, but, well, they're fixed width, and code needs to be recompiled to take advantage of new wider SIMD instructions. Too, as a practical matter, there's typically a lot of overhead preparing or moving data to take advantage of the SIMD instructions. Endheim quotes from an 2017 article by Patterson and Watterman, SIMD Instructions Considered Harmful:
Two-thirds to three-fourths of the code for MIPS-32 MSA and IA-32 AVX2 is SIMD overhead, either to prepare the data for the main SIMD loop or to handle the fringe elements when n is not a multiple of the number of floating-point numbers in a SIMD register.
On the other hand, Vector processors provide a much simpler API to user code. Rather than being fixed width, there simply vectors, of whatever known (or computed) length - go do your thing. New version of the CPU can handle vectors better? The API probably doesn't change much if at all.
Now, vector processing on vectors of arbitrary length isn't of much use to general purpose computing - but it's very very useful for scientific and data-intensive computing. And it may be coming back - in particular, upcoming RISC-V processors don't have SIMD instructions but there is a vector-processing extension. This wold be of significant interest to scientific computing, and to AI/ML workloads.
Calls for Proposals or Papers
HiCOMB 2021 - 17 May, Portland, OR, USA (hybrid); paper deadline 29 Jan
HiCOMB is the IEEE International Workshop on High Performance Computational Biology, the intersection of HPC and Computational Biology. It's a workshop of the IEEE International workshop on Parallel and Distributed Processing Symposium (IPDPS) and many outstanding methods and applications papers have been published here. Other IPDPS workshops of note are Parallel and Distributed Scientific and Engineering Computing Workshop.
Events: Conferences, Training
SORSE Lightning Talks - 20 Jan, 3pm UTC
A round of lightning talks covering a poster or blogpost. Talks lined up for this session so far include:
- Development of an Automated High-Throughput Animal Training Platform
- Code Review Community
- Embedding a Jupyter Notebook
FOSDEM 2021 - 6-7 Feb, Virtual , Free
The 2021 Free and Open Source Developers' European Meeting is online, this year, and the schedule is more or less in place.
Some tacks or talks likely of interest - Tools and Concepts for Successfully Open Sourcing Your Project, First Ph.D then Open Source Startup, and dev rooms like Testing and Automation, HPC, Big Data, and Data Science, CI/CD, and others for particular OSes, languages, databases, and tools.
Random
Finally, CMake is good for something - Raytracing in pure CMake.
A nice set of SciPy lecture notes starting from intro to Python to sparse matrices and optimization to scikit-learn and 3D plotting with Mayavi.
TIL: Mutation testing is kind of like monte-carlo methods for test coverage; you change bits of your code and make sure at least one of your tests fail. If they don't, then obviously the mutated bits of code weren’t covered by your tests. Mutmut is a python package for mutation testing.
ORNL looks ag GPU lifetimes, and find that GPU lifetime is very dependent on heat dissipation.
An open-source tool similar to Roam Research - linked note-taking - based on GitHub and VSCode; foam.
A complete course for Raku/Perl 6, if you’re into that.
You can apparently compile Scala into javascript now, if you’re into that.
A fairly detailed walkthrough of setting up a WireGuard VPN.
That’s it…
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
Jobs Leading Research Computing Teams
Highlights below; full listing available on the job board.
Research Associate - Bioinformatics - Genome Sciences Centre at BC Cancer, Vancouver BC CA
This role is instrumental in the development of the CFI-funded cyberinfrastructure project, CanDIG—a state-of-the-art data sharing platform contributing to the development of an international effort to facilitate information exchange as part of the Global Alliance for Genomics & Health (GA4GH). CanDIG also supports large-scale provincial and national data sharing projects, including BC Cancer’s Personalized OncoGenomics program, the Terry Fox Research Institute (TFRI) PROFYLE project and the TFRI-led Marathon of Hope Cancer Centres Network—a major federal initiative to accelerate the adoption of precision medicine for cancer in Canada.
Data Science Manager - Intellisense.io, Cambridge UK
We are looking for a Data Science Manager to help build and improve the tooling and frameworks developed and used by our growing Data Science team, as we bring the models we develop to production.
The role would require someone with a strong background in Python development, as well as experience in CI/CD processes.
HPC/Research Computing Specialist - Unspecified, Unspecified London Area UK
I’m currently working on a permanent role for a world-renowned University with a focus on research computing - ideally, they’re looking for someone who is well versed in high performance computing.
The role would report into the Head of Product and would work closely with the Director of Research Computing. The role will own a group of applications under a Research Computing product line which covers the full product lifecycle of management including strategic technology roadmap, technology selection, release of new products, change activity on current products, ongoing support for products, technical debt, and retirement of sunset products.
The role is being described as a Product Owner, but essentially the client is looking for experience of management in the product and project space, with a blend of research / high performance computing (Supercomputing, Container Technology, Clustering, Windows-Based Clustering, Large Scale Machines, and High-Performance Storage System).
Cloud Engineer - Research Computing - Pacific Northwest National Lab, Richland WA or remote USA
We are seeking a proven and experienced Cloud Engineer to develop solutions in Azure, Google Cloud Platform (GCP), and/or Amazon Web Service (AWS). In this role, you will leading all stages of the engineering process, from vision casting to prospective projects, supporting proposals, soliciting requirements, leading the architecture, design, and implementation of varied solutions to meet a broad range of science and engineering programs and initiatives.
As a part of our growing cloud team, you will work closely with our driven engineers. Helping resolve issues and answering questions for existing projects, influencing the strategy to better support existing and future projects and how to leverage the ever-growing cloud capabilities to drive the Lab’s research activities.
Senior Informatics Analyst, Research Analytics - Genentech, San Francisco CA USA
The Research Informatics and Software Engineering Department seeks a motivated individual to join our team of talented developers to create transformational software that assists in discovering groundbreaking therapeutics. As a Software Engineer, you will work with Pathologists and Lab Managers to develop and improve the next generation of Pathology Platform in Genentech Research. You will be able to understand the needs of our distinguished scientists and interface with key stakeholders in order to deliver IT solutions that support our culture of innovation.
You should be passionate about sustainable software engineering practices, the transformative potential of health-related data, and our mission to improve the lives of patients. Additionally, you should have a learning mindset, should be able to work in a fluid environment, and should have a strong desire to pursue creative solutions to challenging problems.
Manager, Agile Data Analytics - Rogers, Toronto ON CA
Reporting to the Senior Manager, Data & Analytics the candidate will be directly responsible for wireless customer analytics and insights within an Agile Marketing team.
Director of the UCL Centre for Advanced Research Computing - University College of London, London UK
As part of the university’s ambitions to enhance and improve their already world leading-leading research capabilities, a founding Director of Advanced Research Computing is required to ensure that the vision, strategy and approach to research technology is progressive, evolving and valued by the university’s research community.
To be successful in this position, you will be an experienced senior leader with skills gained in large scale research environments. You will be experienced in the development and leadership of teams aligned to world class research technology, and be an engaging and inspiring leader. You will be curious about the High Education sector and attracted to its purpose. Expert in stakeholder engagement and partnering in a large scale and complex organisational structure, your track record will be to have delivered exceptional results by working with others, and to have developed your team for enduring success.
ARC will be a professional services organisation with a strong academic remit. Its goal will be to support research excellence in UCL by delivery of world-class research IT and thus help UCL address some of the most complex and important global challenges. Specifically, ARC will provide a research platform comprising infrastructure, software/tools and data management support as well as training for Researchers to develop custom software and tools.
Senior Manager, AML Data Governance - Scotiabank, Toronto ON CA
The Manager, Data Governance is a key member of the AML Data Governance Team. The mandate of the team is to define and implement data governance and data management processes aligned with policies defined by the Enterprise Data Governance group. The team will collaborate with multiple stakeholders within AML Risk and AML Technology to ensure that data management policies and standards are comprehensive and add value to existing processes while aligning with Bank standards and regulatory requirements.
The incumbent is responsible for maintaining and implementing data governance processes including documentation and compliance evidence. Additionally, responsibilities include supporting new data initiatives from a data governance perspective. Responsibilities include supporting the periodic review of the AML domain adherence to the enterprise policy and standards as well as supporting Audit and regulatory items closure
Director, Scientific Computing - Allen Institute for Brain Science, Seattle WA USA
The Allen Institute for Brain Science has been a leader in the field of neuroscience for over 17 years; in 2021 we are launching a new research division to understand how dynamic neuronal signals at the level of the entire brain implement fundamental computations and drive flexible behaviors. This new Allen Institute research initiative aims to generate foundational data resources of unprecedented quality and breadth, while building software tools that will help us and the community answer fundamental scientific questions.
We seek an individual to build and lead the Scientific Computing team for this division. This individual will work effectively as a technical leader, mentor, manager, and collaborator. They will be a member of the leadership group responsible for steering and implementing the new Institute’s overall research program. The Scientific Computing team will develop and implement data standards, software practices, and complex data analysis pipelines for petabyte scale data.
European Program Manager, Informatics - BD, Various in Europe
In this role you will be responsible for:
Driving the Informatics product launches, innovations & growth, including Management of organizational readiness, product enhancement, and European roadmap Provision of support for implementation & execution, participation & support f troublehooting teams, support of platform technical informatics needs Management of key projects related to the informatics strategy & roll out Supporting the IDS Informatics strategy accross the platforms by supporting the marketing strategy, pricing strategy & contracting documentation
Senior Support Specialist (Research) - University of Bristol, Bristol UK
We are seeking a Research Software Engineer (RSE) to fill the role of Senior Support Specialist for Research Computing, working as part of the ACRC’s RSE and HPC teams. This is a key role for the University and forms a bridge between specialist academic researchers and high-performance computing and software engineering expertise in the Advanced Computing Research Centre. You will have the opportunity to make a real impact by working closely with researchers to improve research software quality and performance.
Manager - Research Computing Data Science - University of Alabama at Birmingham, Birmingham AB USA
The UAB IT-Research Computing team collectively provides computational research expertise to all divisions within the University. As a central team of cyberinfrastructure experts, we are focused on improving the quality, performance, and sustainability of UAB-led computational research. Our group is committed to building collaborative environments in which the best engineering practices are valued, and to sharing and applying cross-disciplinary computational techniques to new and emerging areas. UAB’s annual research portfolio exceeds $600M/year. In this position, you will be required to work closely with colleagues in the IT-Research Computing team as well as with faculty, student/postdoctoral researchers, and technical staff across the campus to enable and accelerate their research computing efforts. You will be an integral member of a team focused on providing cutting-edge cyberinfrastructure for research.
Manager Computational & Data Ecosystem - Scientific Computing - Mount Sinai Hospital, New York City NY USA
The Manager, Computational and Data Ecosystem is responsible for managing the technical operations for Scientific Computing’s computational and data science ecosystem. This ecosystem includes high-performance computing (HPC) systems, clinical research databases, and a software development infrastructure for local and national projects. To meet Sinai’s scientific and clinical goals, the Manager brings a strategic, tactical and customer-focused vision to evolve Sinai’s computational and data-rich environment to be continually more resilient, scalable and productive for basic and translational biomedical research. The development and execution of the vision includes a deep technical understanding of the best practices for computational, data and software development systems along with a strong focus on customer service for researchers. The Manager is an expert troubleshooter. The incumbent is a productive partner for researchers and technologists throughout the organization and beyond. This position reports to the Senior Associate Dean for Scientific Computing and Data Science. Specific responsibilities are listed below.
Application Software Developer/Team Lead - University of Chicago, Chicago IL US
The Research Computing Center (RCC) is seeking an experienced software developer who will be leading the development and improvements of faculty and researchers’ software projects. The person in this position will collaborate directly with faculty, researchers, users and RCC colleagues. The Application Software Developer will work on a variety of projects with faculty campus wide.