It’s been a good week here at RCT world headquarters.
First, our team finally published our paper describing our v1 platform at a high level - a mere 29 months after creating the first version’s Google Doc. The effort tied together years of not just software development and technical architecture but stakeholder engagement, privacy considerations, team building, and domain knowledge. Several co-authors were software developers who had never been on a paper before, were pretty new to the whole process, and hadn’t necessarily appreciated the “full stack” of the effort. It was fun to help them be part of the process not just of writing a paper but of creating a piece of the scientific record of humanity. Knowing they’ll be able to walk into many University libraries all over the world, for decades, and find a copy of it in the stacks, with their name on it, with authorship and citation records kept basically in perpetuity, is pretty cool.
Secondly, on a personal note, I spent some time at an arm HPC hackathon, which was both exciting (new tech! With many different systems to play with!) and surprising (Oracle’s cloud seems… pretty ok?). But more importantly it was really rewarding to see that after a probably eight year hiatus from day-to-day performance tuning of HPC codes, some of the names and tools may have changed, but a basic understanding of the tradeoffs at play and the techniques used to balance between those tradeoffs translate unscathed and can be put to use immediately. These are fundamental skills.
Both of these events drive home to me the breadth and depth of expertise we have in our profession, and how important it is for us to apply it.
And both breadth and depth are needed. Tech just learned a very expensive lesson with the Zillow Offers fiasco that we in research computing have known for a while - it turns out you need to have some domain expertise as well as technical expertise. It’s not just enough to know how to code or to run a computing system or manage a database; that needs to be paired with understanding of why the software or system is being used, or what valid data looks like in a field. And the problems we’re dealing with are subtle - they require deep understanding of the domains we straddle in our work.
What continues to baffle me is that while the nature and importance of the expertise we bring to bear on research problems is being increasingly appreciated in the rest of academia - and elsewhere, as the burgeoning job board indicates - too many of our own teams continue to underplay it. Groups underbid on projects, are timid in proposals, and try to be a little bit of everything to everyone instead of understanding and playing to their strengths. Teams discussing cloud computing in research computing continue to emphasize first and foremost arguments like “we’re cheaper”, and “we don’t pay inflated tech salaries” as if being the bargain-basement discount brand is our natural lot, or as if us scandalously underpaying our staff is a feature instead of a bug.
We’re hitting a bit of a milestone with this issue of the newsletter - it’s not a nice round number like 128, but it’s still pretty notable. So far this newsletter community has helped at least one reader find a new job, helped another couple try new things in managing their teams, and has inspired at least one feature in a software project. We’re just getting started, and there’s a lot more to be done. If you have ideas, or questions, or want to help, just drop me a note at email@example.com.
For now, on to the roundup!
Voice or Veto (Employees’ Role in Hiring Managers) - Ed Batista
A common and avoidable source of frustration when making any high-impact decision - hiring a new team member or manager, but also any major technical or strategic direction - comes from not being clear ahead of time about how the decision is being made and by who. Do the team members get a voice, or a veto? What are the decision criteria?
There’s a lot of perfectly good answers to those questions, many of which the team members (or stakeholders, or..) would be ok with, but not making things explicit right at the beginning can make people feel like they’ve been fooled or not listened to.
Batista councils to be explicit about how important hiring decisions will be made before soliciting input, being clear about how the decision will be made (and by whom), and communicating clearly throughout the process.
Owning your power as a manager - Rachel Hands
Relatedly to being clear about decision making power: one of the common mistakes I see in new research computing managers is an unwillingness to actually accept the fact that they now have a position of power. This is especially true when the new manager has been promoted to a manager of previous peers.
For a lot of people, suddenly having power is uncomfortable, and that’s ok (it’s way better than the other failure mode, of really relishing the newfound power), but you can’t just ignore it. “Ah but I’m still just the same person, you know?” Yes, you are, but now you can fire someone. And even if you choose not to see that power difference, those someones are exquisitely aware of it.
Hands outlines the role power that comes with being a manager, helpfully and correctly distinguishes it from the relationship power that comes with trust, and points out some specific real problems that come if you don’t acknowledge your power (my favourite: your power manifests in ways you didn’t intend) and what happens when you do.
CFFInit - Generate your citation metadata files with ease - Netherlands eScience Center
If you’ve been meaning to generate a CITATION.cff for your repos, here’s a little browser-based tool that will get you started - enter the name, authors, a message, and any identifiers, and it’ll provide a downloadable file.
Boykis talks about some problems she had learning a new programming language for production data work in the context of some things she’s read in The Programmer’s Brain, a book that covers cognitive science specifically in the context of programming. (There’s been a bunch of good reviews of The Programmer’s Brain, but I haven’t had a chance to read it yet)
There’s a lot of different ways that one can be confused by something, including but not limited to:
and how to write code (or documentation) to make it less confusing means being clear on which of those and other sources of confusion are.
What’s more, some of the middle issue - lack of knowledge - can often be helped by making it easy to explore around and make changes and see results, but that’s often quite hard in production code while quite easy in more exploration friendly environments. The things one is concerned with in production tooling - robustness, logging, correctness checking - are very different when exploring (will this even work?) - which is extremely relevant to research software development.
NVIDIA Announces Availability for cuNumeric Public Alpha - Jay Gould, NVIDIA
Worth flagging here that NVIDIA has released it’s first public version of a free drop-in CUDA/NVIDIA GPU enabled replacement for numpy, called, confusingly for those of us who remember the pre-numpy days, cuNumeric.
I normally avoid “speeds and feeds” and new product announcements here, but this was a pretty big week for new stuff coming out and I think reflects some upcoming directions that those of us in RCD should be aware of.
NVIDIA Declares that it is a Full-Stack Platform - Jeffrey Burt, Next Platform
NVIDIA Debuts Quantum-2 Networking Platform with NDR InfiniBand and BlueField-3 DPU -John Russell, HPC Wire
NVIDIA GTC was this week, and there were a lot of announcements - like cuNumeric above - but I think these two capture the most interesting points. The first, by Burt, points out that NVIDIA leadership sees itself as building complete systems from hardware, networking, systems software, and SDKs for accelerated computing; and while AI and graphics are clearly specialties, HPC, genomics, data science, digital twins, and other research computing and data mainstays are explicitly called out, as well as emerging areas like quantum computing simulation.
An example of what this will mean in the short term in the data centre is in the second article by Russell, where new extremely high-bandwidth InfiniBand fabrics are being paired with accelerated computing in NIC, NVIDIA’s DPUs. That’s going to allow cloud-like network flexibility - like network isolation and encryption - at infiniband latencies within HPC clusters, which will hopefully support wider ranges of use cases and more flexible provisioning.
AMD: 96-Core EPYC ‘Genoa’ Due In 2022, 128-Core EPYC In 2023 - Dylan Martin, CRN
AMD Launches Milan-X CPU with 3D V-Cache and Multichip Instinct MI200 GPU - Tiffany Trader, HPC Wire
Vertical L3 Cache Raises the AMD Server Performance Bar - Timothy Prickett Morgan, The Next Platform
Azure HBv3 virtual machines for HPC, now up to 80 percent faster with AMD Milan-X CPUs - Evan Burness, Azure Blog
Slightly stomping on NVIDIAs news was a set of announcements the day before by AMD, long an “x86, but cheaper” considering but now well and truly taking on Intel and NVIDIA at the high end. For CPUs, the new Milan-X chips introduce a large, fast L3 cache on top of the chipset cores, as Morgan points out, which has substantial performance implications for a lot of research computing codes which are bandwidth-limited but with pretty regular access patterns.
This is available right now, as Burness points out, in Azure - commercial cloud platforms are increasingly the most reliable way to start testing out new systems.
AMDs new Instinct MI200 GPUs also look like beasts, and seem to have made different tradeoffs than NVIDIA, going (like Ponte Veccio?) explicitly after high double-precision FP64 performance. These aren’t yet available to play with, so we’ll have to see how this holds up on real benchmarks.
The range of interesting research computing hardware, with increasing differentiation between them, is going to make for a very interesting time, and will finally blow up the “all the world’s an x86” assumptions and monoculture we’ve had around tooling and assumptions. That’s going to make things harder for systems and software teams, but it’s going to make greater ranges of applications more feasible.
Hooking up a faucet to Niagara Falls: Seagate demos NVMe-accessed disk drives at Open Compute Summit - Chris Mellor, Blocks & Files
This is fun - hard drives are continuing to get faster, even though they’re much slower than SSDs - and so are starting to benefit from faster interfaces, especially when it’s not just one drive but JBODs of them.
Here Mellor reports on Seagate’s demo at Open Compute Summit of an NVMe-connected JBOD of disks, and blog of the same, including of NVMe support directly in an HDD controller.
How We Saved Millions in SSD Costs by Upgrading Our Filesystem - James Katz, Heap
Katz provides us a reminder that using SSDs for filesystems means changing some tradeoffs that previous filesystem decisions may have implicitly made.
They had used ZFS, a Copy-on-Write filesystem, for their database cluster. That had a number of advantages for them (higher durability, consistent snapshots, fs-level compression); but that copy-on-write, which always starts causing problems when the disk starts getting full (it’s harder to find empty blocks for each write), and that interacts with SSDs natural tendency for write amplification.
Moving to the ZFS 2.x which uses Zstandard and so has the option of higher but slower compression over lz4 was a substantial win for them. It resulted in fewer blocks written per write, so better performance overall and when things started to get full -which was less often, because of the better compression. Other workloads, of course, will experience the higher/slower compression tradeoff differently.
Analyze terabyte-scale geospatial datasets with Dask and Jupyter on AWS - Ethan Fahy and Zac Flamig, AWS Public Sector Blog
The Pangeo community has been doing a lot of great work for large geospatial data, from software (where they’ve pushed on Dask quite a bit) to array-structured data formats (Xarray, iris).
Fahy and Flamig walk us through setting up a JupyterLab environment using Dask workers to access a very large climate simulation intercomparison data set, CIMP6. Here of course they use AWS - and make use of spot instances for the Dask workers - but the basic setup of Dask and JupyterHub using Kubernetes (with Helm charts) would be a pretty common Pangeo setup.
CCGrid 2022 - 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing - 16-19 May, Taormina, Italy, papers due 24 Nov
Covering future internet computing systems, programming models and runtimes, distributed middleware and network architectures, storage and I/O systems, security, privacy, trust, and resiliance, performance modelling, scheduling, and analysis, sustainable and green computing, scientific and industrial applications, and AI/ML/DL.
The ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) - Minneapolis 27 June-1 July 1, Papers due 27 Jan
Some relevant topics of interest include:
ESSA 2022 : 3rd Workshop on Extreme-Scale Storage and Analysis - 3 June, Lyon, Papers due 1 Feb
Covers topics ranging from storage systems to language and library support for data intensive computing at scale.
Open Confidential Computing Conference 2022 - Conference 17 Feb, Virtual, Free; Onsite Hackathon 18-19 Feb
As research computin and more frequently involves sensitive data, there’s growing interest in confidential computing, keeping data confidential even from the systems teams. OC3 is a good venue to learn about what’s happening in this space.
Seemingly weird redirection behaviour explained, stemming from the fact that
>& doesn’t just redirect stdout and stderr, but duplicates them to the same handle.
Learn low-level programming and disassembly through the use case of hacking games.
Container Layer Analyzer, a simple self-hosted web application to explore container image sizes broken down by layer and directory.
New HTTP verb (hopefully) - QUERY. I can’t even tell you how much it annoys me to have to use POST to make a complex query just so I can put the detailed query in the body of the request.
Great discussion of the issues with ffast-math, and why the “why don’t you just” arguments for dealing with it won’t usually work.
C vs C++ vs Rust vs Cython for Python extensions.
Generative art - Samila.
HPC job scheduler issues come to cloud systems - AWS Batch can now use fair share scheduling. I have opinions!
Source code for a Commodore 64 MMPORG, Habitat.
A way deep dive into data initialization and finalization in C++ and ELF.
Free Operating Systems book, breaking OS design down into three overarching concepts - virtualization, concurrency, and persistence.
xtdb is an interesting looking open source database that keeps and makes searchable all value history.
This week I learned about the Kirkpatrick Model of training assessment - that assessments can be considered in layers from the most superficial (reactions) to increasingly meaningful and harder measures (assessing learning; assessing behaviour changes; and assessing overall results or impact).
And that’s it for another week. Let me know what you thought, or if you have anything you’d like to share about the newsletter or management. Just email me or reply to this newsletter if you get it in your inbox.
Have a great weekend, and good luck in the coming week with your research computing team,
Research computing - the intertwined streams of software development, systems, data management and analysis - is much more than technology. It’s teams, it’s communities, it’s product management - it’s people. It’s also one of the most important ways we can be supporting science, scholarship, and R&D today.
So research computing teams are too important to research to be managed poorly. But no one teaches us how to be effective managers and leaders in academia. We have an advantage, though - working in research collaborations have taught us the advanced management skills, but not the basics.
This newsletter focusses on providing new and experienced research computing and data managers the tools they need to be good managers without the stress, and to help their teams achieve great results and grow their careers.
This week’s new-listing highlights are below; the full listing of 148 jobs is, as ever, available on the job board.
Assistant Director, Applications of Artificial Intelligence and Machine Learning to Industry (AIMI) Center - Penn State, State College PA USA
The Institute for Computational and Data Sciences (ICDS) is seeking applications to join our team for the position of Assistant Director of Penn State’s Applications of Artificial Intelligence and Machine Learning to Industry (AIMI) Center. This university-wide interdisciplinary research center connects Penn State’s Artificial Intelligence and Machine Learning (AI/ML) diverse research expertise with corporate and industry needs and unites them in the pursuit of exploring novel AI/ML ideas. In coordination with the AIMI Director, ICDS leadership, and stakeholders, our AIMI Assistant Director will direct, lead, co-lead, and manage activities supporting AIMI goals and program mission.
R&D Project Manager - Data Science - ROSEN, Newcastle upon Tyne UK
ROSEN(UK) are currently recruiting for a Project Manager for our Newcastle based team of Data Scientists and Data Engineers working on Research & Development projects in the field of Integrity Analytics. The project manager will take ownership of all Integrity Analytics R&D projects and liaise with other R&D functions, mainly in Lingen, Germany, to ensure execution is in line with all defined timelines, budgets and quality requirements, co-ordinating project activities and working with the team to solve all project related challenges.
Technical Project Manager - National Renewable Energy Laboratory, Golden CO USA
NREL is seeking a Technical Project Manager to support major programs led by the Energy Security and Security Center (ESRC). This individual will provide project management leadership and coordination to efforts of significant scale and complexity. Responsibilities are expected to include: track and optimize staff and project planning; provide technical progress tracking and coordination across tasks and interdependent projects; facilitate the optimal use and leveraging of discretionary resources; develop project management resources for Department of Energy Annual Operating Plan (AOP), Technology Partnership Projects (TPP), and other lab initiatives; provide proactive support and summary tracking for research proposals; and continue to improve processes for information sharing and coordination for staff and project planning with other Centers across the Laboratory. The primary objective of these tasks and this role is to increase the efficiency and impact of ESRC’s operational tools and strategies and to advance ESRC’s long term mission and growth objectives.
Application for HPC Manager Science, CIMES - Princeton University, Princeton NJ USA
he Cooperative Institute for Modeling the Earth System (CIMES) at Princeton University is offering an exciting opportunity to an ambitious individual to help advance groundbreaking computational earth system science research using High Performance Computer (HPC) systems. The Atmospheric and Oceanic Sciences (AOS) Program is hiring a Manager Science (HPC-MS), working also with PU’s Institute for Computational Science and Engineering (PICSciE). The HPC-MS reports directly to the CIMES director and is (i) responsible for the management and further development of the HPC resources and activities at CIMES; and (ii) develops novel and innovative initiatives and projects to advance CIMES’ scientific mission.
Senior Research Software Engineer - University of Birmingham, Birmingham UK
Advanced Research Computing (ARC), at the University of Birmingham, is seeking to appoint a Senior Research Software Engineer on a permanent contract. As an award winning team, ARC has earned a national and international profile. As part of our RSE team you will be working with researchers, providing the knowledge and skills to develop, improve, maintain and support high quality research software that fulfils the requirements of research projects. The work will be varied and challenging, supporting a broad spectrum of research from Digital Humanities to Cancer Research on projects ranging from a few hours to months or years. You will join the Research Software Group within ARC. We will encourage and support you in developing your skills, both through formal training and with mentoring from highly skilled colleagues within ARC. You will work in a helpful and collaborative environment, interacting with other team members doing similar tasks and sharing knowhow on a daily basis.
ICDS Technical Director - Penn State, State College PA USA
The Institute for Computational and Data Sciences (ICDS) at Penn State, seeks a talented and goal-oriented individual as our Technical Director. The Technical Director manages the Engineering, Implementation, Engineering Project Management, and Senior Engineering teams ensuring adequate tasking and prioritization of projects.
System Engineering and Integration Lead - Penn State, State College PA USA
The Institute for Computational and Data Sciences (ICDS) at Penn State seeks a talented and goal-oriented individual to join our engineering team as a System Engineering and Integration Lead. The System Engineering and Integration Lead manages an engineering team by providing guidance and expertise in the areas of systems definition, specification definition, overseeing security and compliance implementation, hardware scoping and acquisition that ICDS-Roar supports.
Operations Lead, Institution for Computational and Data Sciences - Penn State, State College PA USA
The Institute for Computational and Data Sciences (ICDS) at Penn State seeks a talented and goal-oriented individual to join our Operations Engineering Team as an Operations Team Lead. The Operations Team Lead manages the operations of researcher focused HPC system leading a team of systems administrators and engineers.
Numerical Algorithms Developer and Research Manager - Arup, Manchester UK
We currently have an exciting opportunity for a numerical algorithms developer and research manager to join our Algorithms and Numerical Analysis team, which comprises specialists in computational mechanics, theoretical computer science, high performance computing and mathematics. The team’s work covers a broad range of areas including solvers for eigenvalue problems and nonlinear equations, graph machine learning, surrogate modelling, parallel computing, and quantum optimisation.
Data Analyst Lead, Center for Education Efficacy, Excellence, and Equity (E4) - Northwestern University, Evanston IL USA
The Center for Education Efficacy, Excellence, and Equity (E4) is partnering with Curriculum Associates (CA), a leading-edge online education company, to address educational excellence and equity through the development of an innovative research/practice partnership geared to produce evidence to improve outcomes for students with the greatest needs, especially given the impact of Covid-19. The Data Analyst Lead will work with the E4 Center Director and report directly to the director. The Data Analyst Lead will engage in all aspects of the E4 Center as a thought partner of the director and will provide expertise in data management, quality assurance, analysis and reporting. Establishes data accuracy and validity derived from a variety of systems. Performs data analysis using statistical techniques. Researches and analyzes information using multiple databases and creates reports of data and results. Responsible for maintaining confidential databases at Northwestern and preparing anonymized extracts for Northwestern researchers for approved projects.
Data Engineering Lead/Director - GSK, Brentford UK
The Data Engineering team at Consumer Tech is a critical part of the end to end engineering practice we are creating for the company. In this role you will lead the team that will be centrally responsible for the development of all global data platform capability and focus on equipping the company with a highly capable, expansive and value-scalable data platform.
Statistics and Data Science Innovation Hub: Director/Lead - GSK, various UK or USA
Biostatistics is the single-largest functional group of Statisticians, Programmers and Data Scientists within GSK R&D, numbering approx. 700 permanent people in the US, UK, Europe and India. Our work ensures that robust quantitative examination and statistical analysis is at the heart of R&D decision-making and enables scientists to make timely, data-driven choices about which potential new medicines and vaccines are most promising to add value for patients – ultimately making the development process more efficient and maximising success. We are investing in growing our cutting-edge statistical innovation and data science capabilities by creating the new Statistics & Data Science Innovation Hub (SDS-IH) led by Prof Nicky Best. SDS-IH’s mission is to build capability and deliver data-driven approaches to decision making across the organization. Achieving this vision requires us to adapt to new ways of working and rebalance skills across Biostatistics. The SDS-IH team itself is a model for our vision, consisting of Statisticians and Data Scientists with a variety of different skills and backgrounds, working together to develop novel quantitative methodologies, systems and tools.
Manager, Engineering Analytics - GitLab, remote
You will be stepping into a new position in a new team as a Manager of our Engineering Analytics team. The goal is to build and improve on our existing data capabilities while providing actionable insights and data intelligence to our highly productive, 500+ person Engineering Division. You’ll lead a team that garner insights from a broad range of metrics such as usability scores, code merge frequency, duration of pipelines, and code review times.
Manager, High-Performance Computer Service Program - Australian Bureau of Meterology, Various AU
With limited guidance, the High-Performance Computer Service Programme Manager is accountable for all high-performance computer services and providing this service using the Bureau’s defined service management processes. You will develop, maintain and operationalise policies and procedures relating to HPC. As our High-Performance Computer Service Programme Manager, you will be accountable for ensuring all HPC incidents, problems and requests for service are appropriately assigned, managed, actioned and resolved on a timely basis to meet service levels. You will be required to exercise a practical understanding and be experienced with HPC technology, processes and systems and be able to manage a national team.
Principal Statistician - Australian Bureau of Meterology, Various AU
Data is at the core of everything we do at the Bureau, as we collect millions of observations from our networks and external sources and convert these into essential weather, climate, water and ocean services. To ensure the Bureau of Meteorology maintains the highest level of data assurance, the Bureau is recruiting a Principal Statistician in a newly defined role in the Science and Innovation Group (SIG). The Principal Statistician will bolster the Bureau’s statistical capability by developing, applying and communicating statistical methods to observational data assets and with a particular focus on long-term climate and hydrological record.
Manager, Research Computing - Boston University, Boston MA USA
Manager of Research Computing, College of Engineering, Information Technology Provide daily and long-term operational management of research computing through the management of researching, planning and implementing solutions to meet the goals of the various research labs and systems. Provide maintenance and technical support for research laboratory systems. Provide advice to researchers regarding technology. Manage the daily operation of the HPC environment. Research and analyze emerging technologies and solutions, devise a plan/strategy and implement those technologies as appropriate.
Lead Researcher – Heritage Data Science - Science Museum Group, London UK
As Lead Researcher you will be working with curators and historians of industrial history looking into the benefits of using computational techniques to bring together museum collections and other digital heritage content. This research will be undertaken as part of the AHRC-funded 36-month “Congruence Engine” project which directly brings together researchers from the Science Museum Group, four universities and five other heritage organisations. It has nine other heritage organisations, large and small, as formal partners, along with Manchester Digital Laboratory (MadLab) and Wikimedia UK.
Senior Data Scientist - Carnegie Mellon University, Pittsburgh PA USA
Are you passionate about data science and social impact? We are currently searching for a Senior Data Scientist with experience working on real-world problems, and a passion for social impact. Senior Data Scientists will guide educational/training programs, applied research projects with government agencies and non-profits in education, public health, criminal justice, environment, economic development and international development, and research in areas such as interpretability, bias and fairness, and other machine learning methods focused on problems in social sciences and public policy with a strong emphasis on releasing the work through open source code, shared curriculum, and publications. Mentoring students and research associates who work with the group.
Senior Research Scientist - Carnegie Mellon University, Pittsburgh PA USA
Are you passionate about helping governments and non-profits be more effective with your AI/ML/Data Science skills? We currently have Senior Research Scientist positions for people with PhDs in AI/ML/Data Science (or related areas), industry experience working on real-world problems, and a passion for social impact. Senior Research Scientists will guide educational and training programs, applied research projects with government agencies and non-profits in education, public health, criminal justice, environment, economic development and international development, and research in areas such as interpretability, bias and fairness, and other machine learning methods focused on problems in social sciences and public policy with a strong emphasis on releasing the work through open source code, shared curriculum, and publications. Mentoring students and research associates who work with the group.
Research Software Solutions Architect and Team Leader - University of Leeds, Leeds UK
The Centre for Computational Imaging and Simulation Technologies in Biomedicine (CISTIB), within the Faculties of Engineering & Physical Sciences and Medicine & Health, involves various academics and research groups. It focuses on algorithmic and applied research in computational imaging, computational physiology modelling, and simulation. We seek a leader with technical proficiency and the passion and ability to inspire the RSE team to achieve this vision and its full potential. You will have strong management skills and a proven record in developing robust research software engineering teams. You will work closely with and form an essential interface between engineering-based teams within the University, clinical and medical physics researchers within LTHT, industrial and third party collaborators, to ensure software developments remain focused and relevant, and that development resources are effectively used. As the leader of the RSE team, you will also work closely with academics in planning new research activities and be responsible for ensuring the team’s sustainability and ability to deliver on agreed project objectives. In this context, you will actively contribute to, and where appropriate, lead, fundraising initiatives through which the team is resourced.