Research Computing Teams #122, 14 May 2022
Hi!
I took part in a hackathon this week, working directly with researchers on improving their code for the first time in — gosh, maybe eight years? Nine?
It’s funny what sticks in the mind and what doesn’t. I kept coding on and off over the past decade, so that was fine, but all the glue stuff that I didn’t think to stay in practice on like “how do you unload all modules?” or “why doesn’t this shell script that sets environment variables work when I run it?” (yeah, that one’s kind of embarrassing) took some time to remember.
There were more data analysis projects, and infinitely more AI/ML projects, then there would have been a decade ago when it would have been 90% simulation codes. Technology has changed over the past decade — there are some new profilers, and tooling, systems are faster, and hey Power systems aren’t big endian any more! But I was mainly struck by how little change there had been in the basic challenges. Maybe the percentage of research groups with extensive in-house research computing and data skills has gone up a bit, but mostly the research groups we saw were very capable, deeply knowledgeable in their domain, with significantly less expertise on the computing, software, and data management side. They needed, and knew they needed, collaborators to work with to achieve their goals. And our responsibility was to do a little bit of hands-on-keyboard work, but mainly to give them the skills and resources they needed so they could continue to advance even after the engagement; to give guidance as trusted experts. And that expertise ends up straddling at least two of software, systems, and data - those aren’t useful in isolation.
The hackathon ended with a bunch of teams all being able to do significantly more computational science (and social science), and tackle bigger problems. It was really nice to be able to contribute hands-on to such work again!
The work we do is important. It’s important that it be funded well, but we can’t individually do much about that. What we can do is make sure the services are provided professionally, and meet the needs of our researchers; that we assemble teams of great experts that function well together; that our team members are treated well, given opportunities to grow and do meaningful work; and that we constantly improve our offerings and build the skills of the teams.
One more note before the roundup - next weekend is Victoria Day weekend here in Canada; I’ll be taking the weekend off, so issue #123 will arrive in your inbox on or around the 28th.
With that, on to the roundup!
Managing Teams
Communication is not Collaboration - Matt Schellhas
Schellhas decries the tendency to assume that if groups aren’t working together, it’s because of a lack of communication, and so the solution is some joint meeting to keep each other informed (this can go even worse, and become… shudder … joint social activities). But communicate is not collaboration, and so more communication doesn’t necessarily lead to more collaboration.
To get two teams working together, Schellhas says, is not significantly different than getting one team working together:
- Build enough psychological safety so that people can try something new, like trusting that other group
- Set expectations
- Back up those expectations
- Provide constant feedback
and those actions don’t have to come from managers (although that does make it easier).
Weecology’s Wiki - Weecology lab, U Florida
Ethan White and his group made their joint lab wiki openly accessible, as a resource to others but also for people considering joining the group as grad students, etc.
We’ve mentioned onboarding materials and candidate packets many times in the past, and making these resources openly available is extremely valuable, in getting candidates (or collaborators, or clients) interested in the work of your group, for building a shared team understanding of the work, its impact, and how to bring people on board, and more. Having it as a wiki, and so readily updatable, is very helpful.
Note that this isn’t a “if you build it, they [the content] will come” sort of deal:
Most @weecology folks will remember hearing @skmorgane and I repeatedly say things like:
“Did you check the wiki?” “What’s missing from the wiki?” “OK, now that we’ve talked through how to do that can you write up that advice on the wiki?” “Nice solution! Add it to the wiki!”
It takes a lot of effort to making putting the material there and updating it the default behaviour of the team. But once it’s there, there’s a lot of uses it can be put to, and making it open for candidates and others to see is one of them.
Managing Your Own Career
A love letter to LFTM - Angélique Weger
We’ve had a couple articles here in the last couple of weeks about the need for more than one kind of document format than just a to-do list to keep track of the different things a manager needs to keep track of. The Low Friction Task Manager (lftm) is one such (github-friendly) system, and here Weger confirms the advantages claimed by the system:
- Answers the question of ‘what do I do next?’, which is the ultimate productivity killer.
- Keeps my working memory uncluttered.
- Keeps me from um’ing during my daily standups. I always know what I worked on yesterday.
- Is a handy record of accomplishments that I can reference when it’s time for my review, I want to ask for a raise, or I’m updating my resume.
- Provides a reminder that I do, in fact, get things done and that I don’t, in fact, suck at my job.
And then describes her takes on it (adding Markdown, and re-ordering things). The system has folders for notes on 1-1s, other meetings, general notes, and projects, each with a template, and then running journals where things like to do lists are kept and archived, with scripts to keep things up to date.
As always, the best system is one that works for you, and that may change over time.
Product Management and Working with Research Communities
MIT to launch Office of Research Computing and Data - Scientific Computing World
As if there was any doubt about the rising importance of research computing and data, MIT is putting together an office direclty reporting to the VPR with University-wide RCD as its remit. I don’t love that it’s being lead by a professor rather than professional staff — we know that works less well for HPC centres, for instance — but still, it’s a good initiative, and makes MIT one of only a handful of large institutions that I know of that will have a single campus-wide effort rather than services and capabilities scattered all over campus (though I’m sure even MIT will take some time to move from column B to column A).
Cool Research Computing Projects
First Sagittarius A* Event Horizon Telescope Results. I. The Shadow of the Supermassive Black Hole in the Center of the Milky Way - Akiyama et al.
First Sagittarius A* Event Horizon Telescope Results. II. EHT and Multiwavelength Observations, Data Processing, and Calibration - Akiyam et al.
Increasingly, there’s no sharp line between scientific observations and experiments and the research computing and data efforts which enable them. The news this week that you probably already saw is a great example.
It’s way harder to see what’s at centre of our own Milky than to see distant galaxies; looking inwards, half the Galaxy is in our way. Sagittarius A* (so called because it’s in the constellation of Sagittarius, and was the brightest radio source in the constellation when such things were being catalogued in the 50s) is not only obscured, but it’s highly variable (the heated gas glowing and orbiting around a black hole understood to be four million times the mass of our own sun varies over a timescale of minutes to hours - imagine the amounts of energy involved!). So you can’t even just take a single long exposure even if that would work.
To pull out the signal at the frequencies where Sgr A* is brightest, you need telescopes which have collectively have a large collecting area. And to get at the resolutions you need to see spatial structure in this compact object, you need signals from far apart - using large baseline interferometry, accurately correlating signals from measurements far apart.
The Event Horizon Telescope is an effort and instrumentation spanning half the globe. At each site, data is collected at 16-64Gb/s with data saved to disk, and those disks are collected. Because the data has to be correlated, keeping the tracks in sync is very important, and atomic clock data is saved along with the time streams. Multiple pipelines run on the data (as described in the second paper). The pipelines are tested by running multiple simulations of what one might expect the area around the black hole to look like, generating synthetic data, and running them through the pipelines to check against expected results.
There are no observations (and subsequent galactic physics papers) of this without the research computing and data effort, and that RCD effort is highly specialized to the observations, with no firm dividing line between the astrophysics, software development, systems, and data management pieces.
Research Software Development
Trusted Cyberinfrastructure Evaluation, Guidance, and Programs for Assurance of Scientific Software - Pignolio, Miller, and Peisert, Better Scientific Software
Best practices to keep your projects secure on GitHub - Justin Hutchings, GitHub blog
When I started in this field, research software was all numerical simulation - accuracy and testing were concerns, but the only real security issue was making sure the code didn’t get deleted. But with web applications, or infrastructure software, or sensitive data analytics, then integrity of the code really matters.
The Better Scientific Software article summarizes the Trusted CI report that we mentioned in #105, and focuses largely on governance issues - making sure there’s a security point of contact, having tools and testing, having processes for commits, training and knowledge about safe code practices, etc.
The GitHub blog focuses on GitHub tools for monitoring dependencies, dependency review, and turning on Dependabot for security alerts (which you should absolutely do if the language of your project is supported). Both approaches are useful, and work together in tandem.
We talk a lot about containers for reproducibility of scientific workflows, but don’t sleep on reproducibility of development environments also being a huge win. Here’s a blog form of a talk by Avdi Grimm about dev containers, with tooling like VSCode or Github Codespaces or any of a number of similar tool sets, and the benefits for onboarding new developers (hires or community members), and just productivity for day to day work, of a standard development container.
I just threw away a bunch of tests - Jonathan Hall
The evolution of the SciPy developer CLI - Sayantika Banik, Quantsight Labs
Related (but not limited) to ease of developer onboarding - I was just having this conversation with a friend. Test suites are code, too, and part of your product - they’re not some weird kind of “meta-code” for which the usual rules don’t apply. As Hall points out, that means keeping them documented, making them easy to run, refactoring them, and discarding some when appropriate.
As an extreme version of that, Banik walks us through the SciPy developer CLI, with tools and documentation for building, testing, static checking, benchmarking, using, and adding/editing documentation or release notes for SciPy packages. Obviously almost no other packages are SciPy, but an engaged developer community needs support and documentation, and if you’ll make tests mandatory as part of significant commits (and you should!), the test suites, and instructions for running and updating the test suites should be kept up to date.
The code review pyramid - focus human attention where it is needed, and have automation take care of what it does well.
Research Data Management and Analysis
sqlite-utils: a nice way to import data into SQLite for analysis - Julia Evans
As you know, I’m a big believer that many .csv files, and most directories of .csv files, could usefully be a sqlite file (or some other database). Evans discovers sqlite-utils, part of the suite of tools around Datasette, for quickly importing existing data into sqlite.
I’m also really excited about all of these “SQL atop data just sitting there” tools that are popping up, particularly on object stores and/or in columnar formats - I think they’ll end up being really useful for a lot of data analysis, or even for log analysis for systems. Sneller is another such tool, for schemaless JSON items.
How time-series compression algorithms work, and why they matter.
Research Computing Systems
Videos are available for talks from the Science Gateways Community Institute MiniGateways 2022 spring conference. Talks on data collection from real-time sensor networks, OpenOnDemand application development, and a lightning talk in the first session on building an open science data web application with no budget may be of interest.
Cool - Arm’s HPC compiler suite is now freely available! The fortran one is coming, with Arm working with LLVM on flag. There are, ahem, others available as well, but this is good news.
Emerging Technologies and Practices
Intel Unrolls DPU Roadmap, with a Two Year Candence - Timothy Prickett Morgan, The Next Platform ZFS without a Server Using the NVIDIA BlueField-2 DPU - Patrick Kennedy, Serve The Home Platform Week Product Updates - Cloudflare
After years of Software Defined Networking being largely a cloud thing, pushing intelligence into the network - either the data centre fabric or the WAN - is clearly very nearly here, and I’m fascinated to see how it shifts how we’ve thought for quite some time about designing systems.
Without getting into whose DPU is better than whose for which purpose [CoI reminder: I’m at NVIDIA now], the tech bigs clearly see an inportant future here. AMD has made a big purchase to get into that space, and Intel at their Vision conference announced their DPU roadmap through to 2026 (although to be fair it gets fuzzy past 2023, and they still insist on calling them IPUs). Morgan gives a runthrough of Intel’s announcement, and characteristically gives us some historical perspective - disaggregation is just the pendulum swinging back after years of pushing everything onto the CPU:
In a way, all of the virtualized functions that were added to operating systems and drivers are being consolidated on a dedicated and cheaper subsystem – just like mainframes and proprietary minicomputers did five and four decades ago because mainframe CPUs were also very expensive commodities
In the second article, Kennedy runs a ZFS RAID array server entirely from the compute on two DPUs, without any x86 server in the path - one DPU exposing the block devices, and the second making a RAID array. There are plans to try to use some further offloading to specialized hardware on the DPUs. Storage is one of the areas I’m most excited about the impact of DPUs (the other is tenant isolation/security).
On a same-but-different note, this week Cloudflare announced a number of new or updated products, particularly around their edge Workers, durable objects and distributed edge databases (based on SQLite!), open beta for their R2 object storage, and NAT and distributed API gateways. Again, this is about pushing compute and logic out into the network, but here the WAN. The work announced this week plus their previous work of pushing zero-trust authentication into the network and edge could have real impact for designing science gateways, DMZs, etc., changing how the data is distributed and what “has” to be collocated. Cloudflare of course isn’t the only company working on this, but they’re currently the furthest ahead with a complete portfolio of services.
R2’s cost structure, as I mentioned in #94, is pay per-month for the storage and no per-download-bandwidth fee. (There’s a nominal fee per million listing or reading operations, but unless those are occurring more than a million times a month, unlikely for our community, you won’t even pay that). That is going to be potentially very useful for research computing groups who want to make large datasets or other downloads available to the community, and for whatever reason can’t host it themselves.
Monte Cimone: Paving the Road for the First Generation of RISC-V High-Performance Computers - Bartolini et al., arXiv:2205.03725
I’m more sceptical about RISC-V than others I know. Maybe it’s lack of imagination, but just don’t see how making an ISA open source leads to uptake and innovation. Hardware isn’t software. Even in a world with lots of FPGAs, the ISA is one thing, but most RISC-V core designs are very much not open source — and for going beyond FPGA implementations, making chips remains eye-wateringly expensive. Even absent the issues of turning an ISA into physical chips that implement the ISA, one of the superpowers of open source is forking and remixing, and having forked and remixed versions of an ISA isn’t an obvious path to success for an architecture.
But lots of people I respect strongly disagree, and god knows I’ve been wrong before. For areas where standards are important, enough people wanting the standard to succeed can be enough for it to do so. So one keeps an eye on things.
Here Bartolini et al. build an 8-node RISC_V cluster, based on SiFive U740 SoCs. Impressively, enough of the software stack is risc-v ready for the author’s to be able to run linux, slurm, NFS, LDAP, ExaMon for performance monitoring, HPL and stream benchmarks, and quantumESPRESSO. (One huge and unsung benefit of the work done over the past decade to build the ARM software stack has been to identify and special case “all the world’s an x86” assumptions that have lurked in code for a long time).
Performance isn’t brilliant — HPL gives ~2 GFLOP/s per node, the available 1GbE only gets them to ~75% efficiency on 8 nodes, and something is holding them back to 15% of theoretical bandwidth on stream — but power consumption (see below) is absolutely tiny; and this even being possible demonstrates that the tool chain and software stack is further along than I had realized.
The first quantum computing coding competition that I’m aware of.
Fascinating to me how cloud services like AWS Batch are adopting features familiar from HPC queueing systems, like job dependencies.
Random
You know how you’ve wanted Wifi and bluetooth on your Apple II? You’re in luck.
Setting up and exploring gopher in 2022. I’ll remind you all that in 1994 I was mortally certain that the web was a fad and Gopher was where things were going, so do keep that in mind when deciding how much weight to give my opinions on technology.
COLR, a font format that includes colours for fonts with letters that intrinsically are coloured with multiple colours. The last 15 years of the internet has proven that we will always use new technologies wisely and with restraint, so I’m sure this will turn out fine.
Implementing TeX’s paragraph-wrapping in perl, for some reason.
Writing your own uint128 to float64 conversion by hand for some reason, and getting better results than the rust compiler?
Accessibility in Linux is kind of a mess.
Adding CSV support to (an implementation of) awk, which honestly sounds pretty useful.
Implementing a lock-free queue with C atomics.
Uncertainty propagation for the masses - guesstimate lets you give the range of input values and compute expected output values. This is the sort of thing we know how to do in science and sort of forget to apply in the rest of our lives. The results are more useful than you’d think!
The history of the Intel Hypercube, one of the first commercial multiprocessing systems, and (in posts coming soon) its influence on matlab and technical computing. The first parallel computer I ever used was the Paragon, an immediate descendent of this machine.
That’s it…
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Jonathan
About This Newsletter
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
Jobs Leading Research Computing Teams
This week’s new-listing highlights are below; the full listing of 143 jobs is, as ever, available on the job board.
Computing and Data Manager, Department of Earth Science & Engineering - Imperial College London, London UK
The Computing and Data Manager will provide technical expertise to support decision making of the Departmental e-infrastructure steering Committee. The role manages the Computing Support Team whose role is to lead on providing world class customer service, proactively improving service performance and driving down incidents volumes. Their primary role is to ensure that all staff and students within the department have useful, safe and efficient access to the hardware, software, skills and IT services required in order to undertake the full range of their research, teaching, learning, administration and outreach.
Principal Solutions Architect, HPC Performance Lab - AWS, Seattle WA USA
Ideally, you are someone who knows the current field of HPC solutions available to customers, has a view about the best way to use them, and also wants to help guide the broader HPC industry in new directions. You have worked with scientific applications, including analyzing and optimizing them for performance, and are passionate about the HPC space. You love to share your passion with others and exhibit good judgment in selecting strategic opportunities to do so. You don’t just want to be part of an industry movement; you want to be out front leading it. If this sounds like you, we’d like to speak with you.
Director Advanced Data and Storage Management, HPC - Princeton, Princeton NJ USA
Princeton University is seeking a talented Director of Advanced Data and Storage Management to join its Research Computing leadership team. The Director of Advanced Data and Storage Management reports to the Associate CIO for Research Computing and manages the group responsible for the vision, design and support of data storage and management for advancing innovative research at Princeton University. This role will provide leadership to the implementation and support of the TigerData service, a comprehensive set of data storage and management tools and services that provides storage capacity, reliability, functionality, and performance to campus. To successfully implement TigerData, this role will closely partner with the Director of Research Data and Open Scholarship in the Princeton University Library
Research Computing Supervisor, Sunnybrook Research Institute - Sunnybrook Health Sciences Centre, Toronto ON CA
Purpose of the Role: The Research Computing Supervisor is responsible for managing SRI computing deskside operations and supporting corporate medium to large enterprise wide projects as assigned by the corporate IT group (CS/TS/Infosecurity). Provide regular reports on team effectiveness and project status. Provide subject matter expertise to other team members on corporate and clinical applications supporting Research at SRI.. Provide coaching and mentoring as required. Provide input to performance reviews and report performance related issues.
Senior Software Engineer - Promethius Biosciences, San Diego CA or remote USA
Data Science & Engineering (DSE) at Prometheus Biosciences, Inc. is looking for a full-stack software engineer with experience in back-end infrastructure, databases and big data. Experience (or at least a strong interest) in bioinformatics, biology and machine learning is a strong plus. DSE is a software and methods group that develops computational and machine learning approaches to discover new drug targets, validate biomarkers and develop companion diagnostics. We’re a small, remote team where members have a chance to wear multiple hats, contribute to the needs across R&D, collaborate with experts from diverse backgrounds and build expertise in the underlying science and technology.
Senior Manager, Data Visualization, Alexion - Astra Zeneca, Toronto ON CA
At Alexion, people living with rare and devastating diseases are our Guiding Star. The Senior Manager, Data Visualization supports the development and enhancement of digital technology and leads the data visualization initiative developed to monitor clinical trial data quality and vendor performance. The position manages complex abstract clinical data management technical projects and is responsible for tasks used to define how and where technologies are applied to collect, code, review, validate, and visualize clinical trial data received in-house and from external sources.
Data Science & Analytics Manager - GSK, Barnard Castle UK
In this role, you will coordinate a team of data and analytics professionals and digital change agents, who are truly valued and appreciated for their knowledge and skills, and are critical to our success, performance and impact we can have on our patients. You and your team will be responsible for integrating and analysing complex, high dimensional data sets, building novel, innovative models and visualisations to solve complex business problems or identify untapped opportunities.
High-Performance Computing Consultant II - User & Application Support - NCAR, Boulder CO USA
Designs, develops, deploys, and maintains tools to support the HPC user environment for the purposes of monitoring, troubleshooting, and advancing the ease-of-use of CISL HPC systems. May mentor and supervise early career staff, student assistants and visitors.
Senior Data Engineer, Media Cloud - Notherastern University, Boston MA USA
The Media Cloud project (http://mediacloud.org) is seeking a Senior Data Engineer to develop scalable text analysis pipelines, research and implement cutting-edge text classification approaches, and support and collaborate on academic research projects related to media attention, hate speech, and social media platforms. In this grant-funded role, you will wear many hats - exploratory data scientist, text analysis expert, data pipeline engineer, research collaborator, product manager, and more. You will work closely with the principal investigators and a team of media researchers to research, prototype, and develop data analysis workflows that can scale from initial prototypes to corpora of millions of documents. Some of this will rely on skills you already have, but you will have to do significant work learning new skills and exploring cutting-edge supporting technologies and algorithms. This position provides an opportunity for someone to work on leading tools that support critical research into how social mobilization interacts with media and to help make Media Cloud more useful for researchers and non-profits trying to understand the role of media for democratic processes. We expect scholarly and popular press publications to come out of this research.
Software Engineering Manager, Quantum Computing, Center for Quantum Computing - AWS, San Francisco CA USA
We are looking for a manager who can: Manage and develop 4-6 junior to senior engineers and grow and engage them on a day-to-day basis. Drive programs spanning engineering and science teams to deliver operational and performance improvements and capabilities to drive internal research and development. Work collaboratively with other teams to scope requirements and align responsibilities. Adapt to an unorthodox scientific/engineering environment