Here our team is continuing to make post-pandemic plans for how work will work.
Others are further ahead. This week Apple confirmed their very office-centric plans, and Asana’s similar plans came up again. Apple, which is very much a product company relying on close integration of the work of software, hardware, industrial design, and even materials science teams, very dramatically doubled down on its office-based approach - with a surprisingly flexible (for Apple) approach of allowing up to two days a week remote for some teams, and an unsurprisingly (for Apple) inflexible approach of firing people who leaked internal Apple discussions to the world. Asana is requiring even more on-site time, 4 days a week (the 5th day being their long-held tradition of no-meetings Wednesdays), but with increased flexibility in working hours and conditions.
I definitely think that switching to remote-first is a huge potential opportunity for many research computing teams, and one we all should consider. But it’s not the only model available, and looking back I worry a bit that I may not have given the other obvious model its due, especially for those at Universities.
For the largely hospital-based, multi-institutional, software development team that I work on, remote-first will work quite well; but other teams have other constraints, and other benefits from being on-site. Remote-first will make hiring some candidates easier for my team, but for (say) University teams, being on campus can be a real attraction for others.
A lovely campus with quads and cafés, feeling like part of the hustle and bustle of research and teaching, being able to duck out for an hour in the afternoon to sit in on a colloquium or lecture, visiting the cutting-edge labs of researchers you are supporting — these are all major selling points to the right candidates. They’re also features we’ve typically taken for granted, and since there’s absolutely no way teams in academia can compete on compensation these days, they’re features we should be collectively playing up.
Whatever model we end up choosing (or having chosen for us by institutional requirements), there’s nothing wrong with embracing and leaning into it. If I were back in a University and we were staying on-site, I’d give serious thought to making sure a list of relevant seminars (and auditable courses) on campus was being maintained and available to the team, arranging lab tours, pushing for on-campus meetings wherever possible, etc., and including them in job ads along with some things I previously took for granted (like a couple weeks off over the holidays). The right new candidates will find this appealing, and they are perks that can’t be readily offered by tech firms. We need to play up all of our unfair advantages, which are many.
Either remote-first or office-first models are going to be easier to communicate and manage than hybrid approaches - which admittedly cover a lot of ground. I’m really interested to see what approaches pop up, and how we can use them to our advantage in research computing.
For now, on to the roundup:
There’s a couple of items this week which go back and forth between managing your own career and helping your team members, so let’s merge the sections this time:
Everyone who’s managed or been a team lead for long enough has had the experience of thinking aloud or asking an idle question and then having a team member waste hours following up on what they thought was now a Suddenly Important Thing.
As a manager we need to stay involved in the work enough to understand what issues are likely to come up while not micromanaging the work - letting the team members make the decisions they’re best placed to make. That means we need to be asking questions, probing, and checking our understanding, while providing our team members clear context that that’s what we’re doing and inviting corrections rather than them thinking these are directives.
Begbie provides some linguistic context-setters - he, following author Robin Sloan, refers to these as “linguistic instagram filters” - that he uses (and when he uses them). My favourites in this context are:
Two things I’d like to add. First - that last approach in particular, where you listen to a problem or proposed solution and then try to explain it in your own words to check that you understand, is an incredibly useful technique in almost any setting. Communication is hard, and pausing to verify that accurate communication has in fact taken place is so so so helpful. My own version of that last phrase is: “[listens carefully, asking clarifying questions here and there]… Ok, so let me see if I understand this…. [my re-explanation]. Is that close? What did I get wrong?”
Second - as a new manager, it will feel weird to have set phrases that you say over and over again. Don’t worry about it. It is good and useful to keep using the same phrases. One of the things a decent manager brings to a team is a certain amount of stability and even predictability in their actions. If you want to experiment with a different phrase to see if you get even better results, outstanding, please do, but don’t change up things you say just because you feel funny repeating yourself. Communication is hard. A conversation starter that has been used often enough for everyone on the team know exactly what context it sets is a valuable thing.
In the second article, Hogan suggests going further and actively seeking out useful approaches and phrases in meetings and conversations you’re in, then poaching them and making them your own. She particularly suggests keeping an eye on how people:
As you start watching experienced (or just talented) leaders and team members participating in meetings for these sorts of approaches, you’ll start collecting lots of potentially useful tools for the tool box. You might have to re-cast them in your own voice, and they may take a few times to work in your team, but learning from others is how we work in science and in leadership.
The Hotel Giraffe - Michael Lopp, Rands in Repose
There’s a lot in here about stress, how it builds up, and how it’s hard to see sometimes from the inside.
Lopp has four questions he asked his team members during a previous job:
On a scale of 1 to 10 (1 == low, 10 == high): How stressed are you right NOW? What is your IDEAL stress level? Ideal meaning the stress is useful and not debilitating. What is your MAX stress level? What behaviours do you see in yourself when you close or at MAX?
Most of us in this line of work like to be a bit stressed, if the stress is of the right kind - learning challenging new material, developing new skills, reaching to make stretch goals. But when stress builds up, it gets unproductive (sometimes this distinction is referred to as eustress vs distress). Lott walks through his failure modes when he’s near max stress:
Lossiness: I become unreliable. I miss on commitments and I’m not aware I’m doing so until reminded after the miss which leads to
Irritability: Small annoyances have a disproportionate effect on my mood. I have strong negative reactions to small developments that I normally easily shrug off. Then I start to become
Increasingly Pointlessly Tactical: Stuff is dropped, I’m grumpy, so I start to make lists. Lots of them. […]
Rage: The final straw. When we’re not all following my irrational unspoken script, I get rage because of my totally unrealistic expectation that everything must proceed exactly to plan.[…]
I think mine are pretty similar; certainly lossiness and irritability are early warning signs.
Much of the article is a particular example of how he only belatedly discovered he was very stressed out because he was missing on commitments to himself (food, exercise, sleep). The key is to pay attention to early warning signs - in yourself as well as your team members! - and then course correct, rather than letting things go too far.
Leadership in Data Science: Lessons Learned From Time Invested in Helping to Build the Field - Patrick J. Wolfe, Harvard Data Science Review
Data Science and Computing: The View From a Sister Campus - Hal S. Stern, Debra J. Richardson, and Marios Papaefthymiou, Harvard Data Science Review
These two short commentaries on a longer article, how data science and computing is being organized at UC Berkeley by Jennifer Chayes (Associate Provost of the Division of Computing, Data Science, and Society). Both Wolfe and Purdue and Stern, Richardson, and Papaefthymiou at UC Irvine, give some advice based on their experiences at those institutions. The approaches outlined in both papers are important not just in data science but across research computing and data and really any research support role.
Wolfe emphasizes matters of approach, that to be a trusted partner in inherently interdisciplinary work and thus to support research, it is important to:
In the second, Stern et al. at UC Irvine emphasizes structural and institutional structures, including the building of educational programs, the creation of centres for applying discipline-specific expertise to data in various domain, and cross-institutional collaborations these centres can enable within those disciplines. This work also emphasizes that lack of staff is too often the limiting factor:
A challenge that we have found at UC Irvine is that a limiting factor for advancing data science collaborations on our campus—and we imagine this is true at many universities—is that there are not enough people with the data skills and computing expertise to meet the needs of campus research teams.
Hardware Memory Models - Russ Cox
With ARM processors becoming more common in research computing - likely more popular than POWER processors ever were - it’s important that the developers of multithreaded software understand the much looser memory consistency semantics of ARM and POWER over x86.
Cox walks us through how the Total Store Order of x86 cores work, and then the more relaxed ARM/POWER memory model which has much weaker consistency guarantees. Even for developers who are relying on libraries and tooling like OpenMP runtimes to handle memory barriers for them - which hopefully is most of us! - it’s useful to know the underlying mechanisms and what can go wrong.
I’m really excited by the potentials of AI-based code completion like GitHub Copilot - research computing has too much boilerplate and drudge work compared to web development, where a lot of work has gone into frameworks to help. But others are much less enthusiastic, and there are real questions about whether all the code that went into training the Microsoft/OpenAI model was licensed to allow this.
Relatedly, this week I learned about Wingman for Haskell, which uses a logic/solver based approach (I think) to try to write out code to meet the often complex type definitions of Haskell functions.
Querying Parquet with Precision using DuckDB - Hannes Mühleisen and Mark Raasveldt, DuckDB Blog
As you know we’re big fans of embedded databases (SQLite and the ilk) around here. DuckDB is a columnar embedded database, aimed at supporting analytical queries more than row-by-row transactions. In research computing, many of our use cases are more analytical than transactional, and so these sorts of databases are interesting; I’ve played with DuckDB myself a few times.
DuckDB has its own file format for databases, which it turns out maxes out at about 4B rows (go ahead, ask me how I know), but it can also directly query from Parquet files, a standard file format for columnar data, where there is no such limit. Unlike its own internal format it can’t persist an index for such files, but parquet was meant to be streamed through quickly (and supports sharding) so it can be very fast (especially on fast storage). In this blogpost, Mühleisen and Raasveldt demonstrate how much faster DuckDB’s python client is than pandas for querying such files, but like sqlite it also has a pretty standard CLI for querying.
Introducing PathQuery, Google’s Graph Query Language - Jesse Weaver, Eric Paniagua, Tushar Agarwal, Nicholas Guy, and Alexandre Mattos, arXiv:2106.0979
Introducing PathQuery, Google’s Graph Query Language - John Zabroski
In #23, we talked about SQL/PGQ and then GQL, proposed incremental additions to SQL to support first property graphs (PGQ) and then a graph-specific query language (GQL) to query those property graphs.
Google has written up a paper on PathQuery (PQ) a graph query language written from scratch to support Google’s Knowledge Graph. As Zabroski points out, an interesting difference between PQ and any of the SQK variants is that it’s a proper programming language, with modules and a compilation step, and a relational algebra to make a number of well-known optimizations possible. Apparently Google has 40,000 path query modules already written that can be used internally. I wonder how long it will be before we see an open source implementation of this?
It’s a networking-palooza in the systems section this week:
How a Docker footgun led to a vandal deleting NewsBlur’s MongoDB database - Samuel Clay, NewsBlur Blog
Always love a good postmortem. Here NewsBlur (think Google Reader - an RSS eader service with a social component) had their MongoDB database of articles, etc. deleted because of how docker, iptables and ufw interact. That ended up exposing the MongoDB container to the world - and unfortunately, the mongo database didn’t have authentication set up, so once a vandal was through the firewall it was game over.
There’s a good story here about backups and logging, and not depending on a single line of defence. I think Clay is a little unfair to docker here though:
When I containerized MongoDB, Docker helpfully inserted an allow rule into iptables, opening up MongoDB to the world.
The issue here is upstream of Docker (and applies to podman if not running rootless, for instance). Traffic to containers are forwarded and so follow the FORWARD chain of the filter table in iptables, not INPUT, and ufw only lets you easily manage the INPUT chain. My understanding is that docker-compose, which by default sets up internal shared networks, doesn’t have this problem. And obviously this only applies to on-node firewalls, external ufw/iptables firewalls “north” of the system would be completely unaffected by this.
There are ways to address this issue on-node like running docker with —iptables=false or being very careful about the order in which you apply rules, but Clay’s description of this as a footgun is perfectly fair - it’s certainly unexpected interaction and not nearly well-known enough.
Accelerating Scientific Applications in HPC Clusters with NVIDIA DPUs Using the MVAPICH2-DPU MPI Library - Gilad Shainer, Dhabaleswar K Panda and Nick Sarkauskas
Looking Behind the Curtain of EVPN Traffic Flows - Rama Darbha, NVidia Developer Blog
Using VXLAN Routing with EVPN Through Asymmetric or Symmetric Models - Rama Darbha, NVidia Developer Blog
Even though NVIDIA’s purchase of Mellanox has been planned for over two years and was completed over a year ago, it’s still weird to me to be reading networking articles that have nothing to do with GPUs on the NVIDIA blog.
The first addresses Data Processing Units or “DPUs” - the new marketing term for accelerated network cards and/or storage interfaces. These devices are now typically somewhat more programmable and may connect network, memory, and storage. Intel has decided (not entirely unfairly) that they’ve been doing DPUs for ages - so much so that they’re already bored with it and will call their new stuff IPUs, Infrastructure Processing Units. Either way, while the new marketing term is the biggest step-change advance, the last year or two has genuinely given us real increases in the capabilities of these accelerated and smarter network processing devices.
Shainer et al. give an overview of the capabilities of their new BlueField SmartNIC/DPU/IPU/whatever — which at 8 ARM cores with 16GB of shared onboard memory is a pretty serious compute platform in its own right — and walk us through their partner’s distribution of MVAPICH2, MVAPICH2-DPU, which looks like it will be a commercial licensed product. By pushing nonblocking collectives like Ialltoall to the DPUs, it can accelerate the OSU Ialltoall benchmark by 17-23% on 32 nodes, and P3DFFT (which relies on nonblocking collectives) by 12-21% on 32 nodes.
The second and third articles have Dharba give us more general and non-NVIDIA specific deep dives (with demos you can run on your own box) into ethernet VPN (EVPN), and how it works (in the first article) and how it interacts with VXLAN (in the second article) for routing. I’ll admit to you, dear colleague, that they go deeper into networking fundamentals than I can follow, but the articles and collected resource (including the demos) seem like a good place to start if I wanted to learn more about these areas. (Corrections welcome from those of you who work with this stuff regularly!)
DevOps; a decade of confusion and frustration - Jan Harasym
DevOps started as a pretty simple idea - breaking down (or at least making more porous) the wall between two groups, developers and operations staff, with with diametrically opposed incentives. There is tooling now that helps (but isn’t essential), and skillsets that help (although as Harasym points out, sysadmins have always had coding skills, and lots of developers have long had some sysadmin skills), but the key is just alignment.
Unfortunately, devops has become a widely misused term. I’m guilty of some of the sins Harasym outlines myself, having (recently!) been on the lookout for a potential new “DevOps hire”. But that’s not a phrase that means anything. DevOps is an organizational approach, with developers and operators working together; it’s meaningless when applied to a single person.
Harasym urges us for clarity to continue to distinguish between roles which are principally operations/sysadmins, developers, and things like release/pipeline/workflow managers. If you try to hire someone for “a devops role” or advertise yourself as a “devops engineer” no one will know what you mean, because it’s an empty statement. On the other hand, an “operations/deployment professional on a devops team” actually communicates something.
HPC Wire gives us an overview of Hyperion’s HPC Market Update briefing during ISC2021, focusing on total spending and breakdown by sectors. Some things that I think matter are:
The cloud component is growing increasingly interesting, with several shots across the bow from cloud vendors on the top-500 list - Azure having 4 systems tied for 26 on the list, NVIDIA A100 + Infiniband HDR in different regions (to make it clear, presumably, that they field these special HPC Infiniband data centres across the US and in the EU), and an AWS cluster in 40 on the list with just Xeons, 25G ethernet, and (presumably) their EFA/SRD low-median-latency networking atop the ethernet we’ve talked about several times before.
ACM Europe Summer School on HPC Computer Architectures for AI and Dedicated Applications - Applications due 15 July, Virtual Summer School 31 Aug - 3 Sept
This year’s summer school hosted by BSC will be fully remote. It’s aimed at early-career CS researchers and engineers, with some spots available for outstanding MSc or undergraduate students.
2021 International Workshop on Performance, Portability & Productivity in HPC - 14 November (as part of SC21), abstracts due 27 Aug, Papers due 3 September
Topics of interest include:
Azure for Researchers part 1: Introduction to Cloud Computing - Ongoing, Microsoft Azure, 5hr45, Free
Azure for Researchers part 2: Cloud Security and Cost Management - Ongoing, Microsoft Azure, 2hr22, Free
Microsoft has just released two Microsoft Learn self-guided courses specifically for researchers - one on the basics of cloud computing in Azure for researchers, and the next one specifically looking at cost management and security.
MNHack21: 3rd Marenostrum hackathon - Team registration due 15 Sept; hackathon is 2-4 November, on-site in Barcelona
The 3rd Mare Nostrum Hackathon, using the MareNostrum4 supercomputer at BSC, is taking team applications, due 15 Sept. Registration is free, and teams will be provided mentors for the hackathon.
Oracle wants you to know that they provide commercial cloud offerings, too, and that they’ve recently expanded their free tier offerings.
For those of us using Slack (which is a lot of us), they’re rolling out some very useful new features - scheduled messages (FINALLY), and some I’m curious to try - lightweight audio huddles, and (coming soon) easy async video/audio recordings.
Cute overview of how fly.io recommends handling geographically distributed postgres instances by putting modest amounts of logic in the proxy layer for writing.
XSEDE has added some new badges to their training offerings.
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
This week’s new-listing highlights are below; the full listing of 188 jobs is, as ever, available on the job board.
Director of Research Programming, Wharton Computing - University of Pennsylvania, Philadelphia PA USA
This position will be responsible for organizing and managing Research Programming development workflow for all development projects including managing the work queue, maintaining codebases, and prioritizing historical code enhancements. Position will supervise all aspects of projects for the team including researcher relationships, application enhancements, web presence, and blogging efforts. Will work closely with the Senior Director with regard to overall strategy and direction. Will manage the team’s projects and work with professors to define and develop projects that contribute to researcher success. Responsibilities will include providing support, occasionally after business hours. Must ensure the team’s adherence to technical, quality assurance, data integrity, and security standards. This position requires a devotion to teamwork, an eagerness to learn about different domains, and a passion for delivering high-quality, high-impact projects. We welcome applicants with web-development experience in any language.
Modeling & Simulation Senior Scientist/Pharmacometrician - Vertex Pharmaceuticals, Boston MA USA
The Modeling & Simulations Senior Scientist at Vertex will lead Modeling & Simulation analyses from planning through reporting and align these analyses with the strategy determined by the modeling program lead. Types of analyses can include modeling and simulation to support evaluation of study design, dose selection, probability of success estimation, portfolio strategy, pediatric extrapolation etc. It is expected that this role will work in close collaboration with colleagues in Clinical Pharmacology, Biostatistics, Research and other colleagues within the R&D organization The M&S Senior Scientist contributes to evaluation of study design, dose selection, probability of success estimation, portfolio strategy, pediatric extrapolation etc.
Sr. Manager, Biostatistics - Moderna, Ottawa ON CA
This role is an exciting opportunity to be a critical part of the clinical development group of a high growth organization. The Senior Manager of Biostatistics will be involved in the design, execution, analysis, and reporting clinical studies with a focus on Moderna’s infectious disease vaccine programs.
Manager, Information Technology (HPC) - Memorial Sloan Kettering Cancer Centre, New York NY USA
Do you have a background in High Performance Computing and want to help enable and advance groundbreaking computational research in the frontline of the fight to cure cancer? We are seeking HPC Manger to join our Advanced Research Computing team! As a People Leader you will engage and support the academic and research scientists in research projects potentially involving multiple external organizations. Our group is committed to building collaborative environments in which the best software engineering practices are valued, and to sharing and applying cross-disciplinary computational techniques in new and emerging areas. In this position, you will m anage and maintain HPC resources, including systems administration; coordinate and direct integration and growth of these resources; guide short- and long-term planning of HPC computing at MSKCC; keep abreast of emerging technologies that may benefit HPC computing research and applications at MSKCC.
Project Manager III – Data Products Manager - UNAVCO, Boulder CO US or remote US
UNAVCO maintains and operates the Geodetic Facility for the Advancement of Geoscience (GAGE) which provides support to the NSF investigator community for geodesy, Earth Sciences research, education and workforce development with broad societal benefits. As part of GAGE, UNAVCO operates a large network of geodetic instruments (primarily GNSS) and a world class data facility and archive which provides cyber infrastructure to support the full data life cycle and interoperability with national and international Earth science Data Centers. While supporting traditional functions of the UNAVCO Data Products Manager providing support for data product generation and operations for UNAVCO and the GAGE facility, this position will also support UNAVCO’s transition to a cloud native architecture and the normalization of the NOTA network to high-rate, low latency applications like ShakeAlert. These activities support the next generation of geodetic applications from the study of plate tectonics and earthquakes to real-time precise vehicle navigation and fiducial networks.
Data Release Team Manager - NHS Digital, Southport UK
Managing the service team and ensuring effective workload planning and demand management. Providing subject matter expert advice and guidance on our processes. Investigating issues and implementing service improvements. Liaising with assurance groups to discuss the pipeline of activities, gaining feedback and implementing service changes to improve the processes and engagement with those groups. Using audit functions, provide overall assurance of applications. Engagement with the internal team to provide full end-to-end overview of the process and ensure the data access pipeline provides the required information for the wider data services to meet applicant needs.
Information & Computing Services Manager - University of Cambridge - Department of Medicine, Cambridge UK
The role-holder will report to the Department of Medicine Business and Operations Manager, one of the largest Departments in the Clinical School. While the Department operates from numerous sites across the Addenbrooke’s Biomedical Campus, the focus of the Information and Computing Services Manager will be on the delivery of IT support in the JCBC building. The role-holder will manage and oversee IT infrastructure, as well as co-ordinating current and future scientific computing provision for the JCBC. The priority will be supporting the computing needs of Principal Investigators and their research groups, who are a mix of clinical and non-clinical scientists.
HPC Manager - Stealth Mode Startup Company, Newark CA USA
We are seeking a hands-on HPC Manager with extensive experience in high-performance computing systems to join our Computing Team. You will lead a small but growing team of HPC engineers in the design and implementation of HPC system that is a crucial part of our product. You will develop highly optimized systems using the latest CPU and GPU technologies. This position is full-time based in Newark, California.
Software Engineering Manager, Bioinformatics - Roche, Pleasanton CA USA
As a Software Engineering Manager, Bioinformatics you will partner with software developers and scientists to develop the underlying bioinformatics analysis algorithms and workflows that support next generation genomic products. Manage a team of engineers and lead the effort to productionize and optimize bioinformatics software in collaboration with the scientific algorithm development team. Monitor every phase of the development process to ensure that the design and software adhere to company or department standards.
Cloud Computing Principal Solutions Architect - Roche, Madrid ES or Warsaw PL
Are you passionate about IT infrastructure, experienced in leading Cloud Technologies and convinced with the business added value of the digitalisation? Are you looking for a dynamic environment where you could make an impact? Do you have both broad and deep knowledge of networking, virtualization, operating systems, cloud native architectures, web applications, databases, and applications designed for the cloud? At Roche, we’re hiring a highly technical cloud solution architect to help our Roche Business in maximizing the benefits from public cloud infrastructure & services and together with them doing now what our patients need next.
HPC System Architect - Principal Engineer II (remote) - Roche, Santa Clara CA USA
Our Nanopore team combines expertise from a wide range of fields such as Software Engineering, Data Science, Information Theory, Biophysics, Circuit Design, Electrochemistry, and more to bridge the gap between biology and technology enabling the development of a disruptive Next-Generation Sequencing (NGS) platform. In this role, the HPC System Architect - Principal Engineer II will design and define frameworks for moving real-time data, starting from the sensor chip, and ultimately written to disk. Activities include working through trade-offs in various implementations to deliver optimal solutions for processing massively parallel data streams. In addition, the architect will be required to stay abreast of emerging technologies and advise development teams on their best utilization. In short, this role is a visionary, leveraging advances across disciplines to solve complex problems.
Sr Engineer - HPC Infrastructure (Remote) - Roche, remote CA USA
Our Senior Engineers work on complex issues and needs where analysis of situations or data requires an in-depth evaluation of variable factors, including technology dependencies, inter-organizational impact and systems thinking approaches. They are senior T-shape engineers. They have in-depth knowledge in at least two technology spaces within their infrastructure technology area and a working knowledge in a wide range of infrastructure technologies. They have a detailed understanding of how the IT infrastructure in their areas of responsibility impacts respective Roche business processes. They contribute to and technically lead challenging projects which require deep technical knowledge and infrastructure engineering skills.